Lakes, Swamps, Ponds, and Other Bodies of Data
In our continued effort to demystify analytics-related jargon, this post will define all data collections named after types of buildings and bodies of water. Figurative language is nice and a helpful mnemonic device, but some of these terms are getting into extended-metaphor territory, a murky place indeed.
First, it’s worth asking why the buildings-and-water terms are so ubiquitous. Why is this a thing?
According to Dataversity, the first of these words was “datamart,” coined in the 1970s and followed by “data warehouse” a decade later. Then, in 2010, James Dixon extended the metaphor by conceiving of a datamart as a “store of bottled water:”
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
Data and water have so much in common (liquidity, commercial value, fuel potential) that it’s no surprise the metaphor took on a life of its own from there. Let’s see where the current leads.
A repository of raw data in its native format. The term describes any large data pool in which the schema and data requirements are undefined before querying.
A series of isolated repositories of raw data in its native format, also referred to as “data puddles.” Enterprises typically strive to unify their data ponds into a single data lake in order to eliminate the possibility of conflicting data sets. (See also “data silo.”)
Source: Cask Data
A repository of ungoverned data, typically a data pond or data lake of questionable quality. The data swamp is typically used as an argument for cultivating data reservoirs rather than data lakes.
A repository of data that has undergone information management and governance, which typically includes access controls, transformations enforcing semantic consistency, and cataloging methods. The term is possibly analogous to “data warehouse.”
A repository of data that has undergone ETL (Extract, Transform, Load) processing, which may include information management and governance, for the purpose of integrating data from diverse sources and making it easier to analyze.
A repository of data that has undergone ETL and is tailored to the needs of a specific end user group. Datamarts (also “data marts”) can either be crafted from a data warehouse or combined to form a data warehouse.
A repository of data that exists in isolation from other repositories of data (see also “data ponds”). Data silos can be intentional or the accidental result of mismanaged data channels.
Source: PC Mag
As these definitions suggest, it’s important to know where your enterprise data originated, how it integrates with data from other sources, and how it will be made available to end users. Whether you call them ponds or a silos, unintentionally isolated repositories of data should be avoided and instead incorporated into well-managed warehouses and reservoirs.
What other terms can we help demystify? Let us know in the comments!