Prepping and cleaning data is a time-consuming task that can be tedious and not all that pleasant. To make matters worse, an enormous amount of skill and cerebral energy can also be expended on completing this mundane and iterative procedure necessary for analytics and machine learning projects.
In an effort to spot anomalies in data, highly-skilled engineers and developers can spend days or weeks writing custom workflows or programs before the data actually gets analyzed and real value is extracted from it. To illustrate the immense level of effort required, a report issued by Forbes on “Cleaning Big Data” estimated that data scientists spend approximately 80% of their time exploring data and cleaning it.
The data preparation phase, while absolutely necessary, is often a challenging one that has in the past frequently required advanced data engineering skills to clean, parse, and proof data. One data engineer at a major tech company wryly referred to his position as a “data janitor.” Another data professional remarked in a similar fashion that “Data science is 99% preparation, 1% misinterpretation.”
So, to vastly improve the lives of all the hardworking “data janitors” out there—including the many analysts, data scientists, business users, and data engineers that routinely work with datasets to extract important information and insights—Amazon Web Services (AWS) recently expanded the functionality of an earlier application in its arsenal to make data cleansing more accessible and user-friendly.
This new application—AWS Glue DataBrew—is a visual data preparation tool that enables customers to clean and normalize data without writing code or expending the copious amounts of energy previously required to accomplish this requisite task.
What is AWS Glue DataBrew?
AWS Glue DataBrew, an extension to the existing AWS Glue software, has taken data preparation to a whole new level that will allow many other professionals to participate in the data preparation and cleansing step using a variety of templates and drag-and-drop activities.
It allows non-coders—more frequently found on the data analysis side as opposed to developers—to ingest and prepare data from a multitude of different sources. While popular with data analysts and scientists, it is also a very useful tool for software developers and engineers because it minimizes the time spent on undifferentiated, repetitive, and routine tasks.
Once ingested and cleaned, data processed through Glue DataBrew can then be used for reporting, visualization, and analysis where critical insights about an organization can be extracted and appropriate decisions made.
Moreover, it allows teams to more easily collaborate when turning terabytes of raw data into useable intelligence without the need to create clusters or manage any infrastructure. Basically, this program streamlines the steps data engineers and data scientists carry out known as extract, transform, and load (ETL) prior to placing the data in a cloud repository.
Glue DataBrew is a newer component of AWS Glue (a serverless ETL platform). However, Glue DataBrew now allows for many of these same AWS Glue tasks to be more easily done in a low-coding environment.
Basically, Glue DataBrew provides the same cleansing operations without complex coding. Analysts, data scientists, business users, and engineers can simply point-and-click on buttons and check off boxes in a user-friendly, interactive visual interface. It is targeted to all technical levels to make it easy to understand, combine, clean, and transform data.
The tool allows users to connect to many data sources such as Amazon S3, define transformations, output to various file formats, and schedule jobs. Glue DataBrew uses a non-destructive process by making copies of all the data—always leaving the original data unaltered. Through its Application Programming Interface (API) it can also access and integrate with many existing data pipelines
Lastly, with Glue DataBrew, customers only pay for what they use and there is no upfront commitment or costly user licenses making it an extremely affordable data prep tool.
Decoding the Details of AWS Glue DataBrew
Glue DataBrew, rolled out by Amazon in 2020, is a user interface designed on top of the existing AWS Glue—first introduced in 2016 as an advanced data preparation and ETL service for data engineers. Glue DataBrew makes use of clean data visualizations and seamless integrations with other important AWS services. It has a host of advanced capabilities which allow for:
(1) Profiling data,
(2) Cleaning and normalizing the data,
(3) Mapping data lineage, and
(4) Automating future processes with new incoming data.
These four basic capabilities also represent a general workflow for using this robust tool to simplify the ETL process.
The first step in working with Glue DataBrew is to connect to a data source or (multiple sources). These can come directly from databases, data warehouses, and data lakes. Once connected, a user can then profile the ingested data and evaluate its overall quality for use. This includes understanding data patterns and detecting anomalies.
The data profiling function offers statistics about the dataset (the first 20,000 rows) such as the number of valid values, the number of missing values, the number of unique rows, value distributions in each column, correlations, and the top 50 unique values. It is not meant to be a comprehensive analysis, but rather a basic first glance or preview of the quality of the data in columns.
Users can also specify which data quality statistics to auto-generate for datasets when running a profile job. This allows users to customize the profile with only the statistics that meet their needs such as determining duplicate values, correlations, and outliers. Moreover, users can also set up automated data quality alerts that monitor the quality of incoming data over time and discover changes to the data within minutes.
Cleaning and Normalizing the Data
Once a user has profiled the data, the next step is usually to create a data transformation project and give it a name. A project usually consists of a series of saved steps in a data transformation job that is known as a “recipe” in Glue DataBrew. A recipe ensures consistency when a user is updating several datasets. Recipes can be edited to change, remove, or add various steps.
The steps in a recipe are really data transformation directions to clean and normalize the dataset. With Glue DataBrew, a user can also choose from 250 built-in transformation functions to visualize and clean data using an interactive, point-and-click visual interface.
These ready-made transformations automate data preparation tasks such as filtering anomalies, converting data to standard formats, and correcting invalid values. The AWS Big Data Blog reported that some of the most common data preparation transformations in Glue DataBrew are:
- Handling and inputting missing values
- Combining datasets
- Creating columns
- Filtering data
- Aggregating data
- Managing categorical values
- Managing numerical values
The 250 point-and-click transformations also include removing nulls, replacing missing values, fixing schema inconsistencies, and creating columns based on functions. Transformations can even be used to apply natural language processing (NLP) techniques to split sentences into phrases.
The tool also provides immediate previews so a user can visualize a portion of the data before and after transformation. This gives a user the ability to modify a recipe before applying it to the entire dataset.
After datasets have been cleaned and normalized, they can output into a file format that can then be used for analytics, reporting, and visualizations. The output file might even be used as the input or training data for machine learning.
Mapping Data Lineage
Glue DataBrew also has the capability to visually map the lineage of a user’s data. This allows the user to better understand the various data sources and transformation steps that have been applied to a given dataset.
The data lineage map is generally presented as an easy-to-read visual flowchart. It shows the progression from an input data file broken down by specific datasets. From there it shows the jobs or recipes that have been applied to the datasets followed by the output file. It allows the user to view the various stages of data transformation from start to end.
Basically, the map visually tracks all the data flows and changes to the data allowing a user to find the root cause in case there are errors in the output.
Automating Future Processes
Once complex data transformations are fully developed from projects, users are able to save and use them at scale with AWS EMR service that provides a platform for rapidly processing, analyzing, and applying ML to big data. All cleaning and normalization recipes built in Glue DataBrew can be automatically used in future incoming data operations.
These automated operations can save data analysts, data scientists, and ETL engineers an enormous amount of time when processing new data. If changes need to be made to the workflow, it is relatively easy to update the steps contained within a recipe. Changes will be automatically applied to new data run through them.
Unlike, some other tools, Glue DataBrew is very easy to operationalize and build repeatable workflows. It will work at scale without any code-based heavy lifting making it ideal for users of all technical levels.
The ability to effectively automate these processes aligns perfectly with the way data itself is now trending. Data is growing exponentially, originating from new and increasingly diverse sources, and used by many people with varied skillsets.
It was mentioned earlier that preparing and cleaning data can take up as much as 80% of the time spent on a data project. The Pareto principle or 80/20 rule states for many outcomes, roughly 80% of the consequences come from 20% of the causes known as the “vital few.”
Glue DataBrew’s capabilities and functionality are laying some serious groundwork to remove those “vital few” causes in data preparation for data analysts, data scientists, and ETL engineers. The result is likely to be a significant reduction in the time it takes to perform ETL while expanding the group of people that can directly clean and normalize data without coding in the future.
The JBS Quick Launch Lab
1/2 Day Assessment
Quantify what it will take to implement your next big idea!
Our intensive 1/2 day session will deliver tangible timelines, costs, high-level requirements, and recommend architectures that will work best. Let JBS show you why over 20 years of experience matters.