Lakes, Swamps, Ponds, and Other Bodies of Data
In our continued effort to demystify analytics-related jargon, this post will define all data collections named after types of buildings and bodies of water. Figurative language is nice and a helpful mnemonic device, but some of these terms are getting into extended-metaphor territory, a murky place indeed.
First, it’s worth asking why the buildings-and-water terms are so ubiquitous. Why is this a thing?
According to Dataversity, the first of these words was “datamart,” coined in the 1970s and followed by “data warehouse” a decade later. Then, in 2010, James Dixon extended the metaphor by conceiving of a datamart as a “store of bottled water:”
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
Data and water have so much in common (liquidity, commercial value, fuel potential) that it’s no surprise the metaphor took on a life of its own from there. Let’s see where the current leads.
A repository of raw data in its native format. The term describes any large data pool in which the schema and data requirements are undefined before querying.
Example: An unstructured repository of Google Maps search information, including string, numeric, array, geospatial, and image data types.
A series of isolated repositories of raw data in its native format, also referred to as “data puddles,” used as a temporary intermediary location for raw, just-imported information. The data is then typically added to a data lake.
Example: Porting in ticket sales data for a particular theme park before adding it to a data lake containing all information for all parks in the system.
A repository of ungoverned data, typically a data pond or data lake of questionable quality. The data swamp is typically used as an argument for cultivating data reservoirs rather than data lakes.
Example: An unmanaged database of philanthropic gifts that do not accord with an organization’s other financial data and can therefore not be trusted.
A repository of data that has undergone information management and governance, which typically includes access controls, transformations enforcing semantic consistency, and cataloging methods. The term is possibly analogous to “data warehouse.”
Example: A centralized medical database that is regularly updated by medical staff and governed by a dedicated team of data stewards. Data is primed for reporting and tenanted according to staff clearance levels.
A repository of data that has undergone ETL (Extract, Transform, Load) processing, which may include information management and governance, for the purpose of integrating data from diverse sources and making it easier to analyze.
Example: Clothing manufacturing, shipping, and sales data that has been consolidated into a single database, shaped, and released to business users.
A repository of data that has undergone ETL and is tailored to the needs of a specific end user group. Datamarts (also “data marts”) can either be crafted from a data warehouse or combined to form a data warehouse.
Example: A subset of the data warehouse example above where the data has been groomed for a particular user set, such as a sales and marketing team.
A repository of data that exists in isolation from other repositories of data. Data silos can be intentional or the accidental result of mismanaged data channels.
Example: A series of spreadsheets maintained by different people, or a database that has been disconnected from other systems to control a data breach.
Source: PC Mag
Data Lake vs. Data Warehouse
Since data lakes, marts, and warehouses seem to be the most commonly confused terms, we’ve given them extra treatment below. For an even deeper dive, check out this cheat sheet from TechTarget!
If we were to arrange these three terms in process order, data lakes would come first because they contain raw, unstructured data. That data is either queried directly or ETL’d into a data warehouse, which may be further partitioned into data marts.
Data Mart vs. Data Warehouse
As these definitions suggest, it’s important to know where your enterprise data originated, how it integrates with data from other sources, and how it will be made available to end users. Whether you call them ponds or a silos, unintentionally isolated repositories of data should be avoided and instead incorporated into well-managed warehouses and reservoirs.
What other terms can we help demystify? Let us know in the comments!