Many enterprises are moving towards use of data lakes to help in managing increasing amounts of information.
Such large repositories allow organisations to gather and store structured and unstructured data before handing it off for further data management and processing in a data warehouse, database, enterprise application, or to data scientists and analytics and artificial intelligence (AI) tools.
And, given the potentially vast volumes of data at play and the need to scale as the business grows, more organisations are looking at the cloud as a data lake location.
What is a data lake?
Data lakes hold raw data. From the data lake, data travels downstream – generally for further processing or to a database or enterprise application. The data lake is where the business’s various data streams are gathered, whether from supply chain, customers, marketing, inventory or sensor data from plant or machinery.
Data in a data lake can be structured, unstructured or semi-structured. Firms can use metadata tagging to help find assets, but the assumption is the data will flow onwards into specialist applications, or be worked on by data scientists and developers.
Amazon Web Services (AWS) offers a good working definition – a data lake is a “centralised repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data”.
This contrasts with a data warehouse, where information is stored in databases, which employees and enterprise applications can access.
Cloud data lakes: key features
The key feature of a cloud data lake is its scale, followed closely by ease of management. The hyperscale cloud providers’ data lakes run on object storage, and these offer practically limitless capacity. The only constraint is likely to be the enterprise’s data storage budget.
As with other cloud storage technologies, cloud data lakes can scale up and down, to allow customers to adjust capacity and therefore cost, according to business requirements. The hyperscaler is responsible for adding capacity, hardware and software maintenance, redundancy and security, and so takes that burden off the data science team.
“Managed data lake services from cloud hyperscalers allow data engineering teams to focus on business analytics, freeing them from the time-consuming tasks of maintaining on-site data lake infrastructure,” says Srivatsa Nori, a data expert at PA Consulting.
“The high reliability, availability and up-to-date technology offered by cloud hyperscalers make managed data lake infrastructures increasingly popular, as they ensure robust performance and minimal downtime.”
Cloud providers also offer sophisticated access controls and auditing, he adds, as well as streamlined billing through tools such as resource tagging.
And, although data lakes and data warehouses have so far been largely separate, they are moving closer together, either running on a single platform or as “data lakehouses”.
“In a modern data architecture, there is a place for the data lake and data warehouse as they serve complimentary purposes,” says Nori. “The cloud provides a powerful environment to unify both approaches.”
Pros and cons of cloud data lakes
Most of the benefits of hyperscale cloud storage apply equally to cloud data lakes, including scale, flexibility and ease of management.
Organisations also avoid the need for upfront capital expenditure, and the long lead times that come from datacentre construction and hardware installation.
Against this, organisations need to consider potential loss of control, especially over cost. The flexible nature of cloud storage can mean costs rise if a data lake is used more than was expected. Data teams also need to consider egress and possible bandwidth costs, especially as they move data “downstream” into databases and other applications.
Security, confidentiality and data sovereignty remain barriers for some organisations. Regulations can put limits on where organisations hold data, and raw unprocessed data can be highly sensitive. The hyperscalers now offer availability zones and geographical limits on where they hold customers’ data. CIOs and CDOs need to ensure those limits meet business requirements.
Performance, though is not usually a barrier for large-scale data lake projects because heavy duty processing takes place further downstream. Performance matters more at the data warehouse level, where block storage – either in the cloud or on premises – is used for database storage.
Hyperscalers’ data lake offerings
For enterprises building data lakes in the cloud, Microsoft offers Azure Data Lake Storage (ADLS), as well as Azure Synapse for analytics, and Azure Purview for data governance. ADLS Gen2 combines ADLS Gen1 with Azure Blob storage, while Synapse works with structured and unstructured data, for data lakehouses.
AWS provides AWS Lake Formation, to build data lakes on S3 storage. This combines with Athena, Redshift Spectrum and SageMaker for data access, analytics and machine learning.
Google takes a slightly different approach, combining Google Cloud Storage with open source tools, BigQuery and VertexAI. Google also offers BigLake, which can combine storage across GCP, S3 and Azure as well as creating a unified architecture for data lakes and data warehouses, and what Google calls an “open format data lakehouse”.