What is a Data Lakehouse?

Scott Potter • 5 December 2022

It may be part of your company's technology estate and data infrastructure, or if it's not, it almost certainly will be.  But what is it?

This is part of our 'Tech for the Exec' series.

“A picture paints one thousand words” as the saying goes.  But equally, a name or a short description can conjure up a vivid image.  The name “Data Lakehouse” did just that for me when I first heard it.  A quaint, petite building on the edge of a vast, natural lake.  So, what would this mean as a metaphor in the context of technology and digital infrastructure?   Spoiler alert!  Anything but this idyllic scene I’d imagined.  Let’s jump back into the world of business and tech to explain.

Back to Business

Being able to perform effective data analytics efficiently has rapidly become one of the primary concerns for businesses, shaping the very foundations of organisational design, software and systems architecture (to support a more considered Enterprise Data Architecture), and data infrastructure.  Even those of you far removed from your CTO and their band of technologists are likely to have felt the effects of good or poor data management.  Whether it's driving your sales activities, your customer retention strategies, your revenue forecasting or ensuring you have your essential corpus of data to mine (supporting the development of your own AI-based systems), data management underpins them all.


Warehouses and Lakes

Both Data Warehouses and Data Lakes have played key roles in data infrastructure and data management to this point.  Data lakes and data warehouses can be used in conjunction, a dynamic duo, working together to store and structure your valuable data.  In our experience, however, many enterprises have not invested sufficiently and typically run from a Data Warehouse only.


Data lakes function as catch-all systems for new data, while data warehouses apply downstream structure and meaning to specific data from this system. Great, you might think.  Well, yes…and no.  Orchestrating these systems to provide reliable data can incur substantial costs in terms of time and resources. And here enters the concept of the Data Lakehouse.  But before we delve into that, let’s take a whistle-stop tour of what Data Warehouses and Data Lakes do.

Data lakes: these are commonly built on big data platforms such as Apache Hadoop, known for being cost-effective and versatile storage spaces. They handle all sorts of data, from videos to text, creating an eclectic treasure trove. However, navigating these vast waters might require the guidance of data wizards—data scientists and engineers—to tame the data deluge. Additionally, as data governance is mostly implemented downstream from these systems, data lakes tend to be more susceptible to data silos, which can subsequently transform into a data swamp with scattered islands of data that should be connected. When this occurs, the usability of the data lake can be compromised.


Data warehouses: They're like wise librarians, gathering raw data from multiple sources, neatly organising it into a central repository that uses relational database infrastructure.  These are central to data analytics and business intelligence applications such as enterprise reporting. Data warehouses employ ETL processes to extract, transform, and load data into the data warehouse repository.  ETL processes are sometimes the means of moving data from the warehouse to a 'destination' system or tool. However, it is limited by its inefficiency and cost, particularly as the number of data sources and quantity of data grow over time. (We'll delve into this a little in our "Technically Curious" Section at the end.


Now, imagine a hybrid realm where the wisdom of warehouses meets the adaptability of lakes. Behold,
the Data Lakehouse!


The Data Lakehouse

Data lakehouses are gaining popularity as an infrastructure choice, merging the capabilities of data warehouses and data lakes.

They leverage similar data structures from data warehouses and combine them with the cost-efficient storage and flexibility of data lakes. This enables organisations to store and access large-scale data quickly and efficiently while also addressing potential data quality issues. And without diving too deep into the technical details, it's worth noting that data lakehouses also support ACID* transactions for substantial data workloads, ensuring a high level of data integrity.


In summary, a Data Lakehouse brings together the flexibility and scalability of data lakes with the management and data quality aspects of warehouses. It also reduces the necessity to transfer data between systems, mitigating privacy risks. By bringing these benefits under one data architecture, data teams can accelerate their data processing as they no longer need to straddle two disparate data systems to complete and scale more advanced analytics, such as machine learning.

A "Lakehouse" is a misnomer in my opinion.  It's a warehouse, a factory, a logistics office and a distribution network sitting next to a reservoir.  A powerhouse of ability and potential.  But a Data Lakehouse is what it's called.

----------


For the Technically Curious Exec

If you want to look a little deeper into this subject, read on...

Data Lakehouses are generally products provided by Cloud Infrastructure providers, moving into the traditional Data Warehouse market.  This is useful to know because it helps to understand how they're able to (or almost able to) provide the benefits of a Data Lake and a Warehouse, and what the trade-offs are especially at an executive level.

  • Data storage is cheap in a Data Lake, the provider has vast amounts of it and utilise cheap cloud storage.  It's the Data retrieval that is harder and thus costly.  That's because the data has no structure, no strong relationships or ordering to make it easy to search for the data requested.  It relies on a huge amount of computers to churn through all the data, all done behind the scenes by the provider.  We say that Data Lakes are optimised for Data Write.
  • The opposite is true for a Data Warehouse.  Data storage is more expensive, primarily because it uses high-end, high spec hardware under the hood.  But also because the data is structured, contains copies, indexes (Look Up Tables) and duplications so that it can be linked together in different ways enabling quick and cheap data retrieval...Think Librarian and Library.  We say that Data Warehouses are optimised for Data Read.
  • A Data Lakehouse doesn't structure the data to organise and store it.  Instead, it does some clever parsing, attempts to categorise it (apply type to the data), compresses it and calculates meta data on it.  This allows for some 'short-cuts' that a Data Lake can't perform when retrieving data for a data query.  It also allows for the data to be chunked up and searched in parallel, using a large amount of compute power for that short period of time.  Think of it like splitting a deck of cards into 5 parts and asking 5 people to find the King of Diamonds.  It's much faster than one person perfoming this search.  But you can't do this if you don't know that the stack of rectangular cards with information on has only one King of Diamonds.  The Data Lake doesn't understand the rules of the data.  A Data Lakehouse knows just enough to allow the search to be done by many computers working in parallel.


So, you can imagine that Performance and Operating Costs may differ between all three systems.  Data retrieval times of a Data Lakehouse are not  quite as fast as a Data Warehouse.  This is constantly improving however.  And at a greater cost, data retrieval can be improved in a Lakehouse.


Because of the way that these systems are built differently, the cost to operate them also differs.  Which is cheaper?  It depends on your usage patterns.  If you remember that these providers usually separate out the cost of data storage from the cost of compute, you'll know to ask questions about what usage patterns your business has.  Sometimes a Data Warehouse is the right solution for your business.


*ACID refers to atomicity, consistency, isolation, and durability, all of which are critical properties ensuring a transaction (a sequence of consecutive events and their consequent data updates) maintains data accuracy and consistency.


If you found this insightful, you might want to checkout our Resources Hub.


Related Posts:

  • My AI Work Buddies - explores the importance of data models as part of a future-proof strategy to leverage AI technologies.


Image Credits:

www.peakpx.com

www.oblivionstate.com

by Claire Green & Scott Potter 6 December 2024
Are You Really Doing a Business Impact Analysis?
by Claire Green and Scott Potter 12 November 2024
Resilience is critical, but when does being prepared cross the line into overkill? Some businesses invest so much in risk avoidance that they lose agility - the ability to pivot, adapt, and seize new opportunities. Overly complex plans, redundant redundancies, and rigid structures can actually hurt your organisation in the long run. How can you strike the right balance between resilience and flexibility? In our latest article, we explore the fine line between being prepared and being overly cautious - and how to keep your business agile in an unpredictable world.
by Scott Potter 1 May 2023
Future-Proof your business now to maximise AI in the workplace soon.
by Scott Potter 5 December 2022
Is the new generation of Artificial Intelligence intelligent?
by Scott Potter 29 May 2022
I want to overcome the negative impacts of unhealthy perfectionism
by Claire Green 25 April 2022
What is meant by systems and systems thinking? We’re not talking about computer systems or management control systems. What we are looking at is interconnectedness and viewing an organisation as a dynamic process.
Scrabble letters spelling the phrase 'Broken Crayons Still Colour'
by Scott Potter 18 February 2022
Do I want to overcome my perfectionism?
You didn't come this far to 'only' come this far.
by Scott Potter 1 February 2022
So, you want to improve your abilities and your performance. What or who do you turn to first? A colleague? A mentor? Online or classroom training courses? Or a coach? Executives and board members often turn to a coach first. But this isn’t true for all of us. Sometimes it’s simply down to cost - the cost of a personal coach is far greater than a training course or some form of internal mentoring scheme. Whilst this is true to a large extent, so too is the return on investment. And costs aren’t always prohibitive when compared to some courses. So, the problem with coaching is that it’s often overlooked and misunderstood.
by Scott Potter 10 May 2021
Dispelling myths about servant-leadership
Man presenting to a small group of people sitting.
by Scott Potter 13 December 2019
Improving yourself with the intention of improving your entire team.
More posts