Blog Post

What is a Data Lakehouse?

Scott Potter • Dec 05, 2022

It may be part of your company's technology estate and data infrastructure, or if it's not, it almost certainly will be.  But what is it?

This is part of our 'Tech for the Exec' series.

“A picture paints one thousand words” as the saying goes.  But equally, a name or a short phrase can conjure up a vivid image.  For me, the name “Data Lakehouse” did just that when I first heard it.  A quaint, petite building on the edge of a vast, natural lake.  So what would this mean as a metaphor in a technology and digital infrastructure context?   Spoiler alert!  Anything but this idyllic scene I’d imagined.  Let’s jump back into the world of business and tech to explain.

Back to Business

Being able to efficiently perform effective data analytics has rapidly become one of the primary concerns for businesses, shaping the very foundations of organisational design, software and systems architecture (to support a more considered Enterprise Data Architecture), and data infrastructure.  Even those of you far removed from your CTO and their band of technologists are likely to have felt the impacts of the importance of data management, whether it's driving your sales activities, your customer retention strategies, your revenue forecasting or of course so you have your own essential corpus of data to mine and to develop your own AI based systems.


One such impact results from the expectation that the infrastructure must support a simple interface letting your team effortlessly converse with data and tackle intricate tasks using everyday language.

Warehouses and Lakes

Both Data Warehouses and Data Lakes have played key roles in data infrastructure to this point in time.  Data lakes and data warehouses can be used in conjunction, a dynamic duo, working together to store and structure your valuable data.  In our experience however, many enterprises have not invested sufficiently and are typically running from a Data Warehouse.


Data lakes function as catch-all systems for new data, while data warehouses apply downstream structure and meaning to specific data from this system. Great, you might think.  Well, yes…and no.  Orchestrating these systems to provide reliable data can incur substantial costs in terms of time and resources. And here enters the concept of the Data Lakehouse.  But before we delve into that, let’s take a whistle-stop tour of what Warehouses and Lakes do.

Data lakes: these are commonly built on big data platforms such as Apache Hadoop, known for being cost-effective and versatile storage spaces. They handle all sorts of data, from videos to text, creating an eclectic treasure trove. However, navigating these vast waters might require the guidance of data wizards—data scientists and engineers—to tame the data deluge. Additionally, as data governance is mostly implemented downstream from these systems, data lakes tend to be more susceptible to data silos, which can subsequently transform into a data swamp with scattered islands of data that should be connected. When this occurs, the usability of the data lake can be compromised.


Data warehouses: They're like wise librarians, gathering raw data from multiple sources, neatly organising it into a central repository that uses relational database infrastructure.  These are central to data analytics and business intelligence applications such as enterprise reporting. Data warehouses employ ETL processes to extract, transform, and load data into the data warehouse repository.  ETL processes are sometimes the means of moving data from the warehouse to a 'destination' system or tool. However, it is limited by its inefficiency and cost, particularly as the number of data sources and quantity of data grow over time. (We'll delve into this a little in our "Technically Curious" Section at the end.


Now, imagine a hybrid realm where the wisdom of warehouses meets the adaptability of lakes. Behold,
the Data Lakehouse!


The Data Lakehouse

Data lakehouses are gaining popularity as an infrastructure choice, merging the capabilities of data warehouses and data lakes.

They leverage similar data structures from data warehouses and combine them with the cost-efficient storage and flexibility of data lakes. This enables organisations to store and access large-scale data quickly and efficiently while also addressing potential data quality issues. And without diving too deep into the technical details, it's worth noting that data lakehouses also support ACID* transactions for substantial data workloads, ensuring a high level of data integrity.


In summary, a Data Lakehouse brings together the flexibility and scalability of data lakes with the management and data quality aspects of warehouses. It also reduces the necessity to transfer data between systems, mitigating privacy risks. By bringing these benefits under one data architecture, data teams can accelerate their data processing as they no longer need to straddle two disparate data systems to complete and scale more advanced analytics, such as machine learning.

A "Lakehouse" is a misnomer in my opinion.  It's a warehouse, a factory, a logistics office and a distribution network sitting next to a reservoir.  A powerhouse of ability and potential.  But a Data Lakehouse is what it's called.

----------


For the Technically Curious Exec

If you want to look a little deeper into this subject, read on...

Data Lakehouses are generally products provided by Cloud Infrastructure providers, moving into the traditional Data Warehouse market.  This is useful to know because it helps to understand how they're able to (or almost able to) provide the benefits of a Data Lake and a Warehouse, and what the trade-offs are especially at an executive level.

  • Data storage is cheap in a Data Lake, the provider has vast amounts of it and utilise cheap cloud storage.  It's the Data retrieval that is harder and thus costly.  That's because the data has no structure, no strong relationships or ordering to make it easy to search for the data requested.  It relies on a huge amount of computers to churn through all the data, all done behind the scenes by the provider.  We say that Data Lakes are optimised for Data Write.
  • The opposite is true for a Data Warehouse.  Data storage is more expensive, primarily because it uses high-end, high spec hardware under the hood.  But also because the data is structured, contains copies, indexes (Look Up Tables) and duplications so that it can be linked together in different ways enabling quick and cheap data retrieval...Think Librarian and Library.  We say that Data Warehouses are optimised for Data Read.
  • A Data Lakehouse doesn't structure the data to organise and store it.  Instead, it does some clever parsing, attempts to categorise it (apply type to the data), compresses it and calculates meta data on it.  This allows for some 'short-cuts' that a Data Lake can't perform when retrieving data for a data query.  It also allows for the data to be chunked up and searched in parallel, using a large amount of compute power for that short period of time.  Think of it like splitting a deck of cards into 5 parts and asking 5 people to find the King of Diamonds.  It's much faster than one person perfoming this search.  But you can't do this if you don't know that the stack of rectangular cards with information on has only one King of Diamonds.  The Data Lake doesn't understand the rules of the data.  A Data Lakehouse knows just enough to allow the search to be done by many computers working in parallel.


So, you can imagine that Performance and Operating Costs may differ between all three systems.  Data retrieval times of a Data Lakehouse are not  quite as fast as a Data Warehouse.  This is constantly improving however.  And at a greater cost, data retrieval can be improved in a Lakehouse.


Because of the way that these systems are built differently, the cost to operate them also differs.  Which is cheaper?  It depends on your usage patterns.  If you remember that these providers usually separate out the cost of data storage from the cost of compute, you'll know to ask questions about what usage patterns your business has.  Sometimes a Data Warehouse is the right solution for your business.


*ACID refers to atomicity, consistency, isolation, and durability, all of which are critical properties ensuring a transaction (a sequence of consecutive events and their consequent data updates) maintains data accuracy and consistency.


If you found this insightful, you might want to checkout our Resources Hub.


Related Posts:

  • My AI Work Buddies - explores the importance of data models as part of a future-proof strategy to leverage AI technologies.


Image Credits:

www.peakpx.com

www.oblivionstate.com

by Scott Potter 01 May, 2023
Future-Proof your business now to maximise AI in the workplace soon.
by Scott Potter 05 Dec, 2022
Is the new generation of Artificial Intelligence intelligent?
by Scott Potter 29 May, 2022
I want to overcome the negative impacts of unhealthy perfectionism
by Claire Green 25 Apr, 2022
What is meant by systems and systems thinking? We’re not talking about computer systems or management control systems. What we are looking at is interconnectedness and viewing an organisation as a dynamic process.
Scrabble letters spelling the phrase 'Broken Crayons Still Colour'
by Scott Potter 18 Feb, 2022
Do I want to overcome my perfectionism?
You didn't come this far to 'only' come this far.
by Scott Potter 01 Feb, 2022
So, you want to improve your abilities and your performance. What or who do you turn to first? A colleague? A mentor? Online or classroom training courses? Or a coach? Executives and board members often turn to a coach first. But this isn’t true for all of us. Sometimes it’s simply down to cost - the cost of a personal coach is far greater than a training course or some form of internal mentoring scheme. Whilst this is true to a large extent, so too is the return on investment. And costs aren’t always prohibitive when compared to some courses. So, the problem with coaching is that it’s often overlooked and misunderstood.
by Scott Potter 10 May, 2021
Dispelling myths about servant-leadership
Man presenting to a small group of people sitting.
by Scott Potter 13 Dec, 2019
Improving yourself with the intention of improving your entire team.
by Samantha Brown 03 Mar, 2019
It's not the most exciting of topics but a clean whiteboard helps to focus the mind and makes the content easier to understand. So let's talk about effective cleaning equipment.
by Scott Potter 01 Aug, 2016
Roundies are people that "know something that you couldn't possibly know" whilst being blissfully ignorant of their own gaps in knowledge. However, are some Roundies unconsciously aware that their strong opinions aren’t underpinned by as much knowledge as they once thought they had?
More posts
Share by: