It may be part of your company's technology estate and data infrastructure, or if it's not, it almost certainly will be. But what is it?
This is part of our 'Tech for the Exec'
series.
“A picture paints one thousand words”
as the saying goes. But equally, a name or a short description can conjure up a vivid image. The name “Data Lakehouse” did just that for me when I first heard it. A quaint, petite building on the edge of a vast, natural lake. So, what would this mean as a metaphor in the context of technology and digital infrastructure? Spoiler alert! Anything but this idyllic scene I’d imagined. Let’s jump back into the world of business and tech to explain.
Back to Business
Being able to perform effective data analytics efficiently has rapidly become one of the primary concerns for businesses, shaping the very foundations of organisational design, software and systems architecture (to support a more considered Enterprise Data Architecture), and data infrastructure. Even those of you far removed from your CTO and their band of technologists are likely to have felt the effects of good or poor data management. Whether it's driving your sales activities, your customer retention strategies, your revenue forecasting or ensuring you have your essential corpus of data to mine (supporting the development of your own AI-based systems), data management underpins them all.
Warehouses and Lakes
Both
Data Warehouses and
Data Lakes
have played key roles in data infrastructure and data management to this point. Data lakes and data warehouses can be used in conjunction, a dynamic duo, working together to store and structure your valuable data. In our experience, however, many enterprises have not invested sufficiently and typically run from a Data Warehouse only.
Data lakes function as catch-all systems for new data, while data warehouses apply downstream structure and meaning to specific data from this system. Great, you might think. Well, yes…and no. Orchestrating these systems to provide reliable data can incur substantial costs in terms of time and resources. And here enters the concept of the Data Lakehouse. But before we delve into that, let’s take a whistle-stop tour of what Data Warehouses and Data Lakes do.
Data lakes: these are commonly built on big data platforms such as Apache Hadoop, known for being cost-effective and versatile storage spaces. They handle all sorts of data, from videos to text, creating an eclectic treasure trove. However, navigating these vast waters might require the guidance of data wizards—data scientists and engineers—to tame the data deluge. Additionally, as data governance is mostly implemented downstream from these systems, data lakes tend to be more susceptible to data silos, which can subsequently transform into a data swamp with scattered islands of data that should be connected. When this occurs, the usability of the data lake can be compromised.
Data warehouses: They're like wise librarians, gathering raw data from multiple sources, neatly organising it into a central repository that uses relational database infrastructure. These are central to data analytics and business intelligence applications such as enterprise reporting. Data warehouses employ ETL processes to
extract,
transform, and
load data into the data warehouse repository. ETL processes are sometimes the means of moving data from the warehouse to a 'destination' system or tool. However, it is limited by its inefficiency and cost, particularly as the number of data sources and quantity of data grow over time. (We'll delve into this a little in our "Technically Curious" Section at the end.
Now, imagine a hybrid realm where the wisdom of warehouses meets the adaptability of lakes. Behold,
the Data Lakehouse!
The Data Lakehouse
Data lakehouses are gaining popularity as an infrastructure choice, merging the capabilities of data warehouses and data lakes.
A "Lakehouse" is a misnomer in my opinion. It's a warehouse, a factory, a logistics office and a distribution network sitting next to a reservoir. A powerhouse of ability and potential. But a Data Lakehouse is what it's called.
----------
For the Technically Curious Exec
If you want to look a little deeper into this subject, read on...
Data Lakehouses are generally products provided by Cloud Infrastructure providers, moving into the traditional Data Warehouse market. This is useful to know because it helps to understand how they're able to (or almost able to) provide the benefits of a Data Lake and a Warehouse, and what the trade-offs are especially at an executive level.
- Data storage is cheap in a
Data Lake, the provider has vast amounts of it and utilise cheap cloud storage. It's the Data retrieval that is harder and thus costly. That's because the data has no structure, no strong relationships or ordering to make it easy to search for the data requested. It relies on a huge amount of computers to churn through all the data, all done behind the scenes by the provider.
We say that Data Lakes are optimised for Data Write.
- The opposite is true for a
Data Warehouse. Data storage is more expensive, primarily because it uses high-end, high spec hardware under the hood. But also because the data is structured, contains copies, indexes (Look Up Tables) and duplications so that it can be linked together in different ways enabling quick and cheap data retrieval...Think Librarian and Library.
We say that Data Warehouses are optimised for Data Read.
- A
Data Lakehouse doesn't structure the data to organise and store it. Instead, it does some clever parsing, attempts to categorise it (apply type to the data), compresses it and calculates meta data on it. This allows for some 'short-cuts' that a Data Lake can't perform when retrieving data for a data query. It also allows for the data to be chunked up and searched in parallel, using a large amount of compute power for that short period of time. Think of it like splitting a deck of cards into 5 parts and asking 5 people to find the King of Diamonds. It's much faster than one person perfoming this search. But you can't do this if you don't know that the stack of rectangular cards with information on has only one King of Diamonds. The Data Lake doesn't understand the rules of the data. A Data Lakehouse knows just enough to allow the search to be done by many computers working in parallel.
So, you can imagine that Performance and Operating Costs may differ between all three systems. Data retrieval times of a Data Lakehouse are not quite as fast as a Data Warehouse. This is constantly improving however. And at a greater cost, data retrieval can be improved in a Lakehouse.
Because of the way that these systems are built differently, the cost to operate them also differs. Which is cheaper? It depends on your usage patterns. If you remember that these providers usually separate out the
cost of data storage from the
cost of compute, you'll know to ask questions about what usage patterns your business has. Sometimes a Data Warehouse is the right solution for your business.
*ACID refers to atomicity, consistency, isolation, and durability, all of which are critical properties ensuring a transaction (a sequence of consecutive events and their consequent data updates) maintains data accuracy and consistency.
If you found this insightful, you might want to checkout our
Resources Hub.
Related Posts:
- My AI Work Buddies - explores the importance of data models as part of a future-proof strategy to leverage AI technologies.
Image Credits:
www.peakpx.com
www.oblivionstate.com