As mentioned in part one of this series, we'll be taking a look at the various data-centric ETL services that are offered on AWS. It might not always be obvious which service might be the most appropriate one for a particular use case or need, and this series of blog posts will hopefully help make the use of these services more clear. Part one focused on AWS's Database Migration Service (DMS) and the kinds of scenarios it works well for. Part two focused AWS Data Pipeline, and in this article, we will talk about AWS Glue.
AWS Glue is a fully managed, serverless AWS ETL service. One key concept in that definition is the idea of Glue being serverless - at no point in using Glue do you provision any compute resources. There is no infrastructure at all that needs to be set up and managed - unlike Data Pipeline for example. These resources are provisioned on-demand based on what kind of work Glue is doing. Under the hood, Glue is leveraging AWS's EMR service for all of its compute and resource needs.
While Glue can be used for almost any general-purpose ETL task, Glue really excels at sourcing, transforming, and enriching operational or other kinds of enterprise data needed for the creation and management of data lakes and data warehouses. (Either via batch or real-time stream.) For example, AWS's data lake creation and management service, Lake Formation, makes very heavy use of Glue behind the scenes. Glue is also very strong with handling unstructured data as well. In order to handle just these kinds of scenarios, Glue's four major components consist of a metadata repository known as the Glue Data Catalog, an ETL engine that generates either Scala or Python code, Triggers which are a highly flexible way to schedule your ETL jobs and Workflows which allows you to create complex ETL pipelines. Let's take a look
at each of these three areas.
The Glue Data Catalog is a central repository for storing all the needed metadata that you use as sources and targets of your ETL jobs. This metadata consists of all the schema information needed for Glue to your data source. The most important objects in the Glue catalog are:
- Databases: These are containers for tables, and can consist of tables from various data sources.
- Tables: Consists of all the necessary metadata to define a table that will store data from various data stores. Glue will also take care of table partitioning if needed.
- Connections: This object stores all the connection information needed to connect to a specific data source
- Crawlers: Crawlers are the Glue objects that populate the catalog with tables. Crawlers are generally the preferred method for populating your catalog with table objects. You point a crawler to a data source (or multiple data sources), where it will crawl that source (similar to how a web crawler indexes web sites), and retrieve all the schema information needed to populate the Glue catalog. The crawler will also note any schema changes and perform metadata updates as needed. Crawlers can be scheduled or run on demand. Crawlers have far too many options and configuration properties to go into detail here, please check them out in the official documentation (link below).
Keep in mind, the catalog doesn't store the actual data itself, just the schema information that will be used by the actual ETL processes. So for a table, it will store the column names, data types, and so on. If you're familiar with Hadoop, the Glue catalog serves a similar function as the Hive Metastore.
Once you have your catalog populated, you can start authoring and implementing your ETL jobs. In Glue, a Job contains all the business logic and transforms that your ETL process will include - it extracts data from sources, transforms it, and then loads it into your targets. Jobs can be trigger based, run on a schedule, in response to events, or on-demand.
Authoring a basic job in Glue is relatively straight-ahead and consists of several steps:
- Choose the tables and data sources used for your job. These must be defined and exist in the Glue catalog. This also includes the defined connections used by these sources that are in the catalog as well.
- Once the data sources have been defined, you can choose your jobs target. Unlike the sources, the target tables do not need to exist in the catalog before hand - they can be created at run time.
- Based on your sources/targets (and some other options you can setup), Glue will generate either a pypark or scala script for you. Under the covers, this is the script that Glue will submit to the EMR cluster that it will automatically provision for you.
Once the ETL script has been generated, if needed, you can then edit it as you would any other Python/Scala script. However, one of the most compelling features of Glue is the library of powerful data transformations that come right out of the box. Common scenarios such as dropping null fields, flattening JSON (or other types of semi-structured data) into table structures, filtering records, and joining data from two completely different data sources are just some of the scenarios - among many others - that Glue can handle out of the box for you. (For the full list of transforms please check out the official docs.) In addition to working with or editing the script that Glue generates for you, you can also write your own Python/Scala scripts from scratch. Another important feature that Glue jobs support are Bookmarks. Bookmarks allow your jobs to know what data they've processed via timestamps. This helps when determining start and stop points for incremental workloads, and retry points as part of Glue's fault tolerance mechanisms. Aside from what has been discussed here, Glue offers a variety of advanced functionality when it comes to writing ETL scripts that make it an incredibly robust and feature rich tool for creating ETL pipelines.
With Glue, a Trigger is how you start your jobs and crawlers. There are a variety of ways that you can fire triggers:
Scheduled triggers, in a similar fashion to AWS's Data Pipeline service, allow you to create a cron-based schedule for your job to run. You can set the frequency, time, day of week, etc. when creating a schedule. Once set, a trigger will kick-off your job based on the defined schedule. An on-demand trigger is a trigger that fire off one time as soon as you activate it. It's the most basic type of trigger and is very helpful during the testing and debugging phase of writing your scripts. Conditional triggers are fired when certain statuses are set. When a job (or crawler) is up and running, there are various statuses that job will have, such as: SUCCEEDED, STOPPED, FAILED, or TIMEOUT. You can configure your conditional trigger to watch for jobs that hit these statuses, and then when they are fired, the trigger will execute your specified job in response. It's possible to set up complex pipelines for your jobs based on conditional triggers.
A Workflow in Glue is basically a container for a set of ETL scripts involving multiple jobs, crawlers, and triggers that come together to create a complex ETL pipeline. They are Glue's main tool for creating data pipelines that have a higly sophisticated set of steps or dependencies that contain multiple triggers and sub pipelines. Workflows can be created, edited and managed in the AWS console. From the console, you can design and build out your workflow in a visual fashion, which makes it easier to visualize complex inter-dependencies and logic between triggers, jobs, and crawlers. Workflow properties are a mechanism through which you can store shared states among a workflow's various components - workflow properties allow these components to communicate to each
other and share data during their runtime and execution. Another really nice feature of workflows is their rich monitoring and logging that are available. While any workflow is running, you can visually inspect any of its components, and view all the metrics that are being emitted from each component in the workflow. In addition to being able to visually see into running workflows, workflows also integrate with AWS CloudWatch for additional logging and monitoring.
Other notable features and functionality
In addition to the main features above, Glue has much more to over - here are some other features worth calling out:
- Security: Glue offers encryption for all your data at rest for all your crawlers, jobs and data catalog objects. This is handled through AWS Key Management service with which Glue has veyr tight integration with. For all data in transit, Glue uses SSL to make sure all your data stays encrypted during any kind of data movement within your ETL pipelines. Of course Glue integrates into IAM, and IAM's users, groups resources, and policies are fully integrated into all of Glue's functionality.
- CloudWatch and CloudTrail integration: Although CloudWatch was briefly mentioned while discussing workflows, both CloudWatch and CloudTrail and holistically integrated into the Glue service as a whole. All of Glue's objects and components emit various metrics and metadata that can be monitored, logged and audited in real time during workflow, job, or crawler execution.
- Glue Schema Registry: This is a new Glue feature that allows for streaming schema management. Using a known, formal schema for your streaming data formatting has many benefits (especially involving data governance issues), and know Glue can centrally store and manage all of your schemas used for all your streaming ETL workloads and jobs. Furthermore, it supports AVRO, and Kafka right out of the box.
- Glue Studio: Glue Studio is a fully featured graphically IDE that allows you to easily build and author complex ETL pipelines. Glue Studio also supports and works natively with semi-structured data. In addition to being able to graphically incorporate complex data transformations and visually design your workflows, Glue Studio also comes out of the box with a Job performance and monitoring dashboard that allows gives you access to your jobs runtime statistics and other key metrics.
- Glue Data Brew: Data Brew is another visual tool in the Glue ecosystem. Data Brew specifically is geared towards data preparation. Taking raw data and then cleansing and normalizing it for further use downstream is what Data Brew excels at. By allowing this part of the data pipeline to be done in a simple, straightforward, and visual fashion, the amount of time needed for these activities is drastically reduced. (According to AWS time can be reduced here by as much as 80%). Data Brew comes out of the box with over 250 transformations that can be immediately plugged into your data. Data Brew can also make recommendations and suggestions that can help identify data quality issues in your data, which in many cases may be difficult to find and fix. Leveraging Data Brew's suggestions is yet another way to reduce the time taken during the data cleaning/preparation step.