As engineers working with enterprise hybrid cloud architectures, we occasionally must work with legacy corporate IT systems. Oftentimes these integrations are not elegant or ideal - dealing with NFS or SFTP in application code is few people’s ideas of fun or productive work. The protocols used in legacy systems can be quite large and obtuse and the open-source implementations (where they exist) are typically single authors, single-purpose, unmaintained, and far from robust. The error messages can be cryptic and troubleshooting failed connections or transfers can lead to many sleepless nights and premature grey hairs.
Sometimes we even write bespoke code for something that should ideally be an infrastructure concern, increasing our tech debt and maintenance burdens. When working with something like NFS overall network security takes a hit. To connect to an internal corporate NFS share from an AWS network requires both sides to poke many holes in their firewalls across TCP and UDP. Most corporate security teams will not be incredibly pleased with doing this.
The problem is compounded for those of us who enjoy using more modern systems architectures like event-driven lambdas. Where would you mount an NFS share while running a lambda? How do you handle large file listings or exceptionally large file transfers without your Lambda hitting the execution timeout? What if you need to transfer many thousands of files quickly?
The complexity can quickly spiral out of control.
Luckily, Amazon has come to our rescue! DataSync is a new offering that provides secure, robust, and simple file transfer solutions for integrating with legacy and modern systems alike. It provides the ability to transfer entire directory trees and their metadata between S3 buckets, NFS shares, EFS volumes, SMB shares, and Amazon FSx. It supports incremental file transfers, deleted file handling, and scheduling with cron-like syntax. It is also amazingly fast; the agents are multithreaded applications that are likely to be far more performant and robust than a quick-and-dirty solution worked out under a tight deadline.
Our team has recently implemented DataSync for a hybrid data processing workflow that previously was handled by flaky NFS libraries and the results are amazingly robust and performant. Our workflow consisted of lambdas that would run on a CloudWatch schedule, connect to an NFS server and copy files to an S3 bucket to be further processed by other Lambdas. It was written in python and used open-source implementation of NFS.
Sounds simple enough, right? Unfortunately, it required us to support a fork of this mostly unmaintained library with some changes that were needed for our use case, open many obscure ports in our on-premise corporate firewall, and was plagued with performance issues. Troubleshooting connection and transfer errors was a herculean effort, requiring the efforts of many teams and wire-level network traces.
With DataSync now on the scene, we were able to install a single agent in our on-premise network and replace our crusty old NFS library and attendant lambda code with a single, simple job that is far more robust and performant than we could’ve reasonably achieved. DataSync simply runs on a schedule and incrementally syncs an on-premise directory structure into an S3 bucket where our processing lambda picks it up. No obscure legacy protocols, no hard to diagnose issues, less code to support - what is not to love?
If you have a hybrid cloud architecture, check it out and see if it fits your needs. We have found that it works very well for our enterprise archival and data processing workflow needs.