AWS has an incredible array of data-centric services. Some of these services have functionality that overlaps with one another and some services may take two different approaches for the same result. Of course, there are also services that have very clearly defined use cases and it's obvious when to use them vs. others. AWS has various services that can perform ETL operations, or be included as part of an overall larger ETL process and we've noticed this has caused some confusion over which one might be appropriate to use or which service might be ideal for a given scenario or situation. This article hopes to cut through some of this confusion and to provide a high-level overview of some of these available services and high-level guidance as to which one would be appropriate to use for a given need or scenario. The first service we'll be discussing will be AWS's Data Migration Service.
Data Migration Service
Out of all the options we'll be exploring, AWS DMS (Database Migration Service), is the simplest to set up and operate. It's also the least flexible among the set of ETL-like tools that AWS offers. The name says it all - Database Migration... its purpose is to take data in a source database and move it to a target database - and that's it. If you need to do serious data transformations, workflow pipelines, or other processes with your data, then this is not the appropriate service for you. For pure migratory tasks, however, this is the ideal tool in the AWS ecosystem.
Essentially, DMS works by setting up a replication process between a source database and a target database. In DMS terms, the source/target data sources are called Endpoints, and the replication process is called a Replication Task (of which you may have more than one). One thing of note, however, is that at least one of the endpoints must be in AWS on RDS or EC2 - DMS does not support on-prem to on-prem migrations. Aside from that one stipulation, DMS is flexible in what it allows for endpoints. DMS supports quite a wide range of data stores (Oracle, MySQL, SQL Server, PostgreSQL, etc.) and replication scenarios - you can replicate from any supported datastore from one to one another for both homogeneous (same source/target platforms) and heterogeneous (differing source/target platforms) scenarios.
Replication tasks in DMS work in a very straightforward fashion. They consist of a series of phases that ultimately end when all data has been migrated in a satisfactory fashion. The first phase is a full data load from the source tables to the destination tables. During the initial full data load (especially with larger tables), it is possible for the source table's data to change. When this happens, DMS will cache those changes and capture the data so they can be applied to the destination tables. This is what the second phase of a replication task handles. At this point, the tables will be in sync, and the third phase can start - ongoing replication. This phase is responsible for all ongoing data synchronization between the source and destination tables and is also known as CDC (change data capture). At this time you can continue to keep the endpoints in sync or fully move onto the target system and decommission the source. DMS also allows you to enable a Multi-AZ option during your replication tasks enabling fault tolerance for important data.
To help ease the migration process, DMS also allows for both table selection rules and transformation rules. Table selection rules allow for deciding which tables you want to migrate and the order they can be migrated in, helping with dependent data. They can also be used to establish filters that determine which records to migrate over - similar to a SQL Where clause. Transformation rules allow for simple expressions that can be used to help transform tables or views. Schema and table renaming, column renaming, and adding columns are just some of the options available for transformation rules. By using expressions with transformation rules, you can create some simple data expressions at the column level - string concatenation, basic math expressions, and other simple data transformations.
Schema Migration Tool
Another important part of DMS is schema, view, stored procedure, and other secondary database object migration. For migrating this information DMS uses the AWS Schema Migration Tool (SMT). This is a tightly integrated tool with DMS that handles these tasks. (However, if your endpoints are the same platform, you can also use that platform's native tools.) A nice feature of this tool is that it will automatically translate flavors of SQL between two different endpoint platforms. Where it can't do the conversion automatically, it will flag that particular piece of SQL, and allow manual intervention. SMT can also help with migrating table indexes, keys, and constraints in a heterogeneous scenario as well.
Here are some other important aspects of note for AWS DMS:
- Ongoing replication can start at any CDC point - it doesn't need to start right after or be
a part of a full replication process.
- Security - DMS uses AWS Key Encryption Service for your data at rest, and in-flight data can be
secured using SSL.
- DMS works with LOB and BLOB data too and has a variety of options for handling those data types
- DMS integrates with CloudWatch, CloudTrail, and SNS so it's easy to monitor and log what your migration
is doing during its operations. This also includes task statuses and updates in the DMS AWS console.
- DMS also works with source data in flat files on S3, and can also use S3 as a target. In addition, you
can migrate to the following non-relational database endpoints: Redshift, DynamoDB, ElasticSearch
service, Kinesis Data Streams, DocumentDB, Neptune, and AWS Managed Streaming for Kafka.
- DMS also has a JSON-based rule-set for data validation. This allows a pretty robust set of options
for making sure all data migrated has been properly validated and is correct.
- DMS can take data from multiple source endpoints and replicate it to a single target endpoint. This is a great
way to consolidate various existing databases into one data store.
To wrap up with DMS, here is a summary of typical use cases (as taken from AWS):
- Homogeneous database migrations. These are migrations where the source and target endpoints are the same or
compatible platforms. Examples of this could be an on-prem Oracle database to an Oracle database hosted on AWS
RDS, or an on-prem MySQL database to an AWS Aurora database.
- Heterogeneous database migrations. These are migrations where the source and target database engines are different.
An example of this could be going from an on-prem SQL Server database to an AWS RDS PostgreSQL instance. Unlike
the scenario above, these scenarios will also leverage the AWS Schema Conversion Tool.
- Development and testing. DMS makes it incredibly easy to set up development or test instances of databases or help
build staging and QA environments. This can also be used to get data into and out of the cloud in a way that's easy,
fast and secure.
- Database consolidation. As mentioned, you can set up multiple source endpoints for one target endpoint. This allows
for simple and fast database consolidation - whether you're migrating to the cloud or on-prem. This also enables a
smaller operational footprint and easier database management.
- Continuous data replication. Continuous replication is a great way to make sure data sources are kept in sync. Dev/test
environments, disaster recovery, and synchronizing data between geographic regions are just some of the many use
cases for this scenario.
For further and more detailed information on AWS DMS, please take a look at the following links:
Next week, we'll be taking a look at AWS Data Pipeline...