A recent article in Forbes reported that 95% of businesses face some kind of need to manage unstructured data—with over 40% indicating that they must do so on a frequent basis. With the increasing volume of data generated along with the countless data sources available in the digital age, the future of big data might very well be inextricably tied to the effective use of data lakes.
A data lake is a cost-effective and configurable repository for storing both structured and unstructured data in a flexible format. Data can be sourced internally or externally. A data lake is like a data warehouse in that they both act as repositories; however, they are both designed for very different purposes. While data warehouses operate best for highly structured data with a clear schema, data lakes are optimized for unstructured, semi-structured, and structured data.
Data typically comes from many types of raw sources and business rarely rely on one source to make data-driven decisions. In fact, one survey on business decision-making found that most organizations consult about 5 internal sources before reaching a data-driven decision.
These raw sources might include those that are structured such as relational databases (with rows and columns) to unstructured data including emails, documents, social media content, images, audio, and video. There are even semi-structured data sources such as comma-separated values (CSV) files and logs.
Data lakes flexibly combine all inputs—structured, semi-structured, and unstructured—into a single repository making it easy for data scientists and business analysts to pull exactly what they need for analysis. Ultimately, this saves on time, effort, and costs when businesses and other organizations need to make important decisions derived from the synergy of their raw data.
Once a data lake is properly formed it provides enormous flexibility, savings, and scalability. It allows the developer to break down data silos and combine different types of analytics to capture insights for making better business decisions. However, manually setting up and managing a data lake can be an extraordinarily complex and time-consuming process.
This is where Amazon Web Services (AWS) Lake Formation can make a huge difference.
A Lake in a Cloud
While it may sound a little like a peculiar weather report, AWS Lake Formation is exactly that—it allows developers to form a data lake in the cloud using Amazon’s resources and impeccable service. According to Amazon, Lake Formation is a service that makes it easy to deploy a secure, centralized, and curated data lake in a few days rather than weeks.
Lake Formation simplifies the process of discovering, cleansing, transforming, and ingesting data into a data lake from various internal and external sources. A data lake stores all data in its original form and prepared for analysis.
With this service a developer or data scientist can load data from diverse sources, monitor those flows, set up partitions, turn on encryption, define transformation jobs, reorganize data into columnar format, remove redundant data, and configure access and security settings—among many other functions.
A Look Inside AWS Lake Formation
Lake Formation provides a user with many features and functions. Under the hood it uses several Amazon services such as AWS Glue to orchestrate jobs, cleanse, and transform data. It also uses AWS Identity and Access Management (IAM) and AWS Key Management Services (KMS) to secure data. Along with those services, it also leverages Amazon Athena to query data and Amazon SageMaker to analyze data.
The general workflow or process for using Lake Formation is as follows:
- Ingest data
- Catalog, index, and partition data
- Cleanse and transform data
- Secure and encrypt data
- Grant data access to users with audit capabilities
- Query data for analytics and machine learning (ML)
A data lake can be built quickly by importing data from databases already in AWS or external sources. When working with AWS, once a user specifies the existing databases, Lake Formation imports the data to a new data lake and records the metadata or schema in a central catalog.
Data can be imported from MySQL, Postgres, SQL Server, MariaDB, and Oracle databases running Amazon Relational Database Service (RDS) or hosted in Amazon Elastic Compute Cloud (EC2). A user can also pull in semi-structured and unstructured data from other Amazon Simple Storage Service (S3) data sources. Lake Formation can also move data from on-premises databases by connecting with Java Database Connectivity (JDBC).
Catalog, Index, and Partition Data
As was mentioned, when preparing imported data, Lake Formation crawls and reads the data sources to extract technical metadata or schema definitions. As part of this process, it creates a searchable catalog from databases and object storage that describes and classifies this information for users. This allows users to discover and understand the available datasets in the data lake. A centralized data catalog describes available data and appropriate usage.
Custom labels can also be added at the table and column-level to more properly define attributes for users. To streamline the use of the data lake, Lake Formation provides a text-based search of the metadata and labels so users can find what they need to analyze quickly.
Moreover, once ingested, Lake Formation can optimize the partitioning of data in S3 to improve performance and ultimately reduce costs. Raw data that is loaded may be in partitions that are too small or large. To optimize, Lake Formation organizes data by size, time period, and/or relevant keys. The result of doing so is faster scan and parallel, distributed reads for common queries of the data.
Cleanse and Transform Data
Once data has been ingested, transformations can be performed on the raw data to get it in the appropriate format for forming a data lake. Transformations on data might involve rewriting various formats for consistency. The purpose for doing so is to make sure that data is stored in a manner that makes it easy to use for analytics.
The data transformations are actually performed by AWS Glue. They are written in columnar formats to increase performance. With Lake Formation, a user can also create custom transformation jobs using AWS Glue and Apache Spark.
Lake Formation also uses machine learning (ML) to deduplicate data by finding matching records within the same database or matching records across two databases.
Secure and Encrypt Data
Lake Formation also provides robust security management features. The service leverages the encryption capabilities of S3 for data in the data lake. AWS Key Management Service (KMS) manages the keys for the server-side encryption. Data is encrypted while in transit. Separate accounts are used for source and destination regions which protect against malicious deletions by insiders. This is very important for sensitive data that may be captured like the credit card numbers of customers.
Overall, Lake Formation provides a secure foundation for all data in the data lake.
Grant Data Access to Users with Audit Capabilities
As part of its secure foundation, Lake Formation allows administrators to fully define and manage access controls for the data lake. The service provides central access controls that work with security policy-based rules. The rules let administrators define users and applications by role. It also ensures regulatory compliance for certain datasets governed by statute such as the Health Insurance Portability and Accountability Act (HIPAA).
When rules have been correctly defined, Lake Formation will enforce access controls at both the table and column-level granularity for users of Amazon Redshift Spectrum and Amazon Athena. AWS Glue access is enforced at the table-level only for administrators.
Along with role-based access to the data lake, Lake Formation also provides real-time comprehensive audit logs with CloudTrail to fully monitor access and show compliance with centrally defined policies. This lets administrators see which users have attempted to access what data, with which services, and when.
Query Data for Analytics and Machine Learning (ML)
The final part of building a data lake with Lake Formation is to make data available to authorized users—such as business analysts and data scientists—who will leverage it with analytics and ML.
With appropriate security permissions, data can be accessed from the data lake by running queries with analytics and ML services like Amazon Redshift and Amazon Athena. The data can then be analyzed using a powerful tool such as Amazon SageMaker.
Lake Formation is a service that is going to be essential for businesses and organizations to make the best use of the prodigious amount of data being generated today.
This flexible cloud service from Amazon simplifies the process of building data lakes—bringing down creation time from months to days. The service provides a central point of control from where users can identify, ingest, clean and transform data from thousands of sources. It also enforces security policies across multiple services while acquiring and managing new insights to make better data-driven decisions.
The JBS Quick Launch Lab
1/2 Day Assessment
Quantify what it will take to implement your next big idea!
Our intensive 1/2 day session will deliver tangible timelines, costs, high-level requirements, and recommend architectures that will work best. Let JBS show you why over 20 years of experience matters.