As mentioned in part one of this series, we're taking a look at the various data-centric ETL services that are offered on AWS. It might not always be obvious which service might be the most appropriate one for a particular use case or need, and this series of blog posts will hopefully help make the use of these services more clear. Part one focused on AWS's Database Migration Service (DMS) and the kinds of scenarios it works well for. This article will focus on a service that handles far more sophisticated ETL scenarios and is a step up in complexity - AWS's Data Pipeline service.
Data Pipeline is a fully managed service based on a "pipeline" model that allows for various data movement and transformation tasks. Pipelines can be thought of as workflows that can incorporate various kinds of business logic or rules to help transform data as it moves to its ultimate destination. Data Pipeline allows these workflows to be scheduled in a flexible cron like manner, and also allows for more complex or branching logic setups based on preconditions or success/failure scenarios. Retry and alerting logic can also be configured for each of the various tasks that comprise a specific pipeline. For development, Pipelines can be authored programmatically through the AWS SDKs, through JSON pipeline definition files, or graphically through the AWS Console.
Data Pipeline consists of a Pipeline Definition, which is made up of Pipeline Components, Task Runners, and scheduling information. The pipeline definition contains all the instructions and logic Data Pipeline needs to execute the workflow of specific tasks (more on these tasks below). As mentioned, the AWS console provides a GUI for authoring definitions, where you can select different pipeline objects and components from a menu, and then wire up the necessary inputs, outputs, and properties to define your workflow. Also as mentioned, pipeline definitions can be constructed via JSON files, which allow for more flexibility, and an easy way to store all your pipeline definitions in a source control system. All pipelines authored in the GUI console can also be exported as JSON files as well.
The heart of each pipeline definition, and the bulk of the work done when authoring pipelines, is selecting and configuring the proper pipeline components and objects. These components essentially contain all the ETL logic needed for each pipeline. There are two main types of components - Activities and Data Nodes. Activities determine what kind of work should be executed, and data nodes describe all the source data inputs, and target data outputs. Most pipelines are designed in such a way where data flows from one activity to the next via linked data node inputs and outputs.
Common Data Pipeline Activities
Here are brief descriptions of some of the most common Data Pipeline activities:
- CopyActivity: This activity copies data from one location to another. A CopyActivity can be configured with either a SqlDataNodes or an S3DataNode (more information on nodes below). As an example, you can set up a CopyActivity with a SqlDataNode that points to a PostgreSQL database as the input, and an S3DataNode for a specific S3 bucket configured as the output. This will then copy the data specified in the SqlNode to a file in the specified S3 bucket, based on the configured schedule options.
- EmrActivity: This activity will run an AWS EMR cluster; this can be either an existing cluster, or it will provision a cluster on your behalf (managed by Data Pipeline).
- RedshiftCopyActivity: This activity will copy data from either S3 or DynamoDB to AWS Redshift. The data can be output to either a new or an existing table in the targeted Redshift cluster.
- ShellCommandActivity: This activity runs a shell script (located in an S3 bucket) or one-off shell command. This activity also allows for scheduling these commands to run on a schedule exclusive of the schedule for the pipeline itself.
- SqlActivity: This runs a SQL query or script on the targeted data source.
While all of these activities have specific configuration options relevant to the nature of each activity, many of the activities have common configuration options to handle the operational aspects of the workflow. Logging, retry mechanisms, timeouts, onsuccess and onfailure actions are some of the most common operational properties for each of these tasks that can be configured. There are also options for configuring alerts and notifications as well. Configuring these options help create the necessary logic the workflow needs to execute (branching logic, parallel flows, etc.)
Activities provide the business logic and define the flow for each of our pipelines, but Data Nodes provide all of the sources and destinations. Here are some of the most common types of Data Nodes that Data Pipeline includes out of the box:
- DynamoDBDataNode: This type of node defines a DynamoDB data source, which in Data Pipeline, is going to typically be used as an input for an EmrActivity.
- RedshiftDataNode: This defines a Reshift cluster and table name for use by various kinds of activities as either input or output.
- S3DataNode: This type of node allows for defining S3 buckets as inputs or outputs to the various types of available activities. The S3DataNode is configured to use server-side encryption by default (this can be disabled though).
- SqlDataNode: This node enables running a select query on a table for a configured database resource (this is defined as part of this node configuration as well). Along with the S3DataNode, this is one of the most commonly used nodes in production.
By configuring different activities with different data nodes, you will be able to construct fairly complex ETL or business logic for each of the pipelines you are authoring.
Expressions & Functions
Another important component that is available for use with activities and data nodes is Expressions and Functions. Expressions allow for sharing of values across pipeline objects and are processed by Data Pipeline at runtime. They allow you to refer to other objects and their properties for use while the pipeline is run. An example could be a file name that is dynamically generated at run time, and then it's possible to use an expression for other tasks to refer to that file during the authoring process. In addition to expressions, most pipeline object properties also support a small library of functions that can also be used to help to either create expression values or properties that need dynamic values at runtime. There are math, string, and date/time functions available for use.
Data Pipeline occupies a weird space in AWS's ETL lineup. It was one of the very first ETL/Workflow tools released, and as such is showing its age. There is also a lot of functionality that has been subsumed by more recent services AWS has released (Glue, Step Functions, Managed Airflow). However, Data Pipeline is still more than a robust, capable tool for many ETL scenarios. One major benefit is the ease of use it brings to the table for authoring and coordinating scheduled workflow tasks. Another nice feature of Data Pipeline is that all pipelines that are created are fault-tolerant, repeatable, and highly available. If your workflows are mainly geared around getting data into and out of RDS, S3, and Redshift, in a scheduled or parallel manner, or using these systems in a non-data lake architecture, then Data Pipeline may be the right choice for you.
Next on our agenda will be AWS Glue...