Lakes, Swamps, Ponds, and Other Bodies of Data

Nov 9, 2017

In our continued effort to demystify analytics-related jargon, this post will define all data collections named after types of buildings and bodies of water. Figurative language is nice and a helpful mnemonic device, but some of these terms are getting into extended-metaphor territory, a murky place indeed.

First, it’s worth asking why the buildings-and-water terms are so ubiquitous. Why is this a thing?

According to Dataversity, the first of these words was “datamart,” coined in the 1970s and followed by “data warehouse” a decade later. Then, in 2010, James Dixon extended the metaphor by conceiving of a datamart as a “store of bottled water:”

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

Data and water have so much in common (liquidity, commercial value, fuel potential) that it’s no surprise the metaphor took on a life of its own from there. Let’s see where the current leads.

 

DATA LAKE

A repository of raw data in its native format. The term describes any large data pool in which the schema and data requirements are undefined before querying.

Source: TechTarget

 

DATA PONDS

A series of isolated repositories of raw data in its native format, also referred to as “data puddles.” Enterprises typically strive to unify their data ponds into a single data lake in order to eliminate the possibility of conflicting data sets. (See also “data silo.”)

Source: Cask Data

 

DATA SWAMP

A repository of ungoverned data, typically a data pond or data lake of questionable quality. The data swamp is typically used as an argument for cultivating data reservoirs rather than data lakes.

Source: Gartner

 

DATA RESERVOIR

A repository of data that has undergone information management and governance, which typically includes access controls, transformations enforcing semantic consistency, and cataloging methods. The term is possibly analogous to “data warehouse.”

Sources: IBM, Dell EMC

 

DATA WAREHOUSE

A repository of data that has undergone ETL (Extract, Transform, Load) processing, which may include information management and governance, for the purpose of integrating data from diverse sources and making it easier to analyze.

Sources: TechTarget, Spotless Data

 

DATAMART

A repository of data that has undergone ETL and is tailored to the needs of a specific end user group. Datamarts (also “data marts”) can either be crafted from a data warehouse or combined to form a data warehouse.

Source: TechTarget

 

DATA SILO

A repository of data that exists in isolation from other repositories of data (see also “data ponds”). Data silos can be intentional or the accidental result of mismanaged data channels.

Source: PC Mag

 

As these definitions suggest, it’s important to know where your enterprise data originated, how it integrates with data from other sources, and how it will be made available to end users. Whether you call them ponds or a silos, unintentionally isolated repositories of data should be avoided and instead incorporated into well-managed warehouses and reservoirs.

What other terms can we help demystify? Let us know in the comments!

 

Thumbnail Photo Credit: This modification of “Mount St. Helens” by Theo Crazzolara is licensed under CC BY 2.0.

SHARES

Schedule a Demo

Leave a Comment