The landscape of platforms and runtimes is ever increasing. There are more ways of packaging and running code than are possible to keep up with. Engineers are now required to learn highly specific workflows in order to remain relevant. This divides the talent pool. When a business decides to choose specific technologies, they effectively also select their available talent pools as well.
These specific workflows have a way of bleeding into the business logic itself, creating a highly coupled system. These workflows can also bleed into the hiring process, limiting growth and complicating on-boarding.
How can we architect in such a way that when a shop needs to expand, the staff and business logic stand the test of time? A simple universal solution may not exist when considering all the varying shop stacks. We at least know that it requires full-stack expertise and real collaboration between several information technology disciplines.
As an old mentor would say, "An Infinitely complex problem takes an infinite amount of time to solve." The problem here will be narrowed down to a subset of common development environments as a well-known problem space. The four environments we will focus on are serverless, Docker (or other) containers, local CLI, and unit tests (yes, unit tests count as their own environment).
For our problem space, a function is considered "portable" if it can run without alteration as a serverless instance (AWS Lambda, Azure Functions, etc.), as a container (Docker or another flavor), from your local dev machine via CLI, and within unit tests.
In order to dramatically scope down this article, we will go with AWS Lambda for serverless and Docker for containerization.
This repository will be a full example of the pattern. There will be an explanation of most of the more complex files and how they tie together. The directory structure is chosen first as a way to create the ideal location for individual concepts and their management. Any consequences of the directory structure are handled with the "glue code" or deployment scripts. The high-level directory should be straight forward and easily maintainable as to discern the three major ingredients of the stack at a glance.
Really, only 4 directories are required
config infra src tests.
High-level directory structure:
config/ docs/ infra/ src/ common/ func/ lib/ tests/ mock/
Shop Stack Choices
There are many IaC (Infrastructure as-code) solutions out there, and even more programming/scripting languages. Here we have chosen Terraform, Makefile, and Python. These choices do not prescribe the only way to accomplish the pattern, rather express a clear example of how the three choices interact with one another. Here are a few features to consider when making stack choices for a shop.
Infrastructure as Code
- First-class CLI support
- Idempotent runs
- Diff detection
- Strong active community
- Allows for clean folder organization
- No infrastructure 1000 line+ mega-file
- Understands dependencies
- Easy to install
- No dependencies
- Bonus if preinstalled _(like bash or Makefile)_
- Can be run locally (doesn't depend on interpreter in the cloud)
- Supported by major CI solutions
Language of Choice
- Easier to find affordable talent
- Hint: don't choose COBOL
As an example, we will use a simple worker script that takes data from one place and puts it in another. We won't call it an ETL job, because we are all just sick of seeing those. We are also sick of seeing non-testable functions. What is meant by "non-testable", are blocks of code that require an impractical amount of mocking to emulate i/o (network disk) activity? Below is such an example; in order to test, we would need to hijack the `boto3 + pymysql` libraries or create an entire environment for testing. This function is not portable, because it will not run within the unit-test setting (using libs like rewire do not count, even though they are a nice workaround).
(just pretend python works on IBM mainframes)
import json import boto3 import pymysql.cursors records =  s3 = boto3.client('s3') obj = s3.get_object(Bucket='in-bucket', Key='new-file.txt') raw = obj['Body'].read() for line in raw.split('\n'): records.append(tuple(line.split(','))) insert_statement = 'insert into `clicks` values(%s, %s, %s);' conn = pymysql.connect( host=os.environ.get('DB_HOST'), db=os.environ.get('DB_NAME'), user=os.environ.get('DB_USER'), password=os.environ.get('DB_PASS'), charset='utf8', cursorclass=pymysql.cursors.DictCursor ) with conn.cursor() as cursor: cursor.executemany(insert_statement, records) conn.commit()
Legacy ETL Process Rant
ibm-mainframe-worker.py code above is inspired by some IBM mainframe code encountered last year where severe limitations prevented modern practices. Many integrations with legacy ETL processes usually mean read/write from an FTP/sFTP server. More "sophisticated" processes can handle multiple files at a time since some sort of dynamic naming scheme is utilized. Less sophisticated processes use the same exact filename (because why not). Fast forward to the modern-day, and we have ourselves entire companies that compete to produce near-mindless WYSIWYG ETL job processors. We could put a pin in ETLs and jump to the conclusion that it is a "solved problem", since it is certainly a well-known problem at the very least. Despite the well-known nature of the problem, shops often still need this kind of work done regularly.
So what kind of problem is best suited for this pattern?
Although this specific example is using a well-known problem space, the pattern is not limited to just ETL, rather is best suited for a Publisher-Subscriber (pub-sub) model. Any function that can work independently on a unit of work concurrently should be able to do so alongside another process without clashing. It also doesn't hurt if the function is idempotent, having no additional effect if processing the same request multiple times. Idempotency is especially important if the message broker might see duplicate messages (AWS SQS can duplicate messages on occasion).
Taking a look at the diagram, we see that vendor(s) produce data that the shop needs. When the raw data appears in the system (via S3), a message is created and sent to any subscribers (SQS in this setup). SQS will house any pending work, acting as a message broker. If there is work, any number of workers can take aim at the queue, pop off a unit of work, and be on their problem-solving way. As a bonus, AWS actually offers a simple integration between Lambda and SQS, so that polling logic for the lambda context can be handled in the infrastructure.
A feedback loop is all of the events that take place between changing a line of code and experiencing that change. The loop is measured in time and complexity. Ideally, the loop is short and simple. Feedback from unit test runs, local instance runs, and deployed runs should be near effortless and fast. It is fair that deployed runs take much longer than unit tests, given that they are more complex. Still, if your unit tests are too slow, then your deploys will lag as well.
The `test/local` feedback loop
The above legacy ETL
ibm-mainframe-worker.py is difficult to test, so it naturally has a poor feedback loop. It could take some time to coordinate with other engineers via chat or email; "is the file in bucket
xyz okay to be overwritten for my local test?" Additionally, opening business logic files and editing the code in order to test it is not ideal. Changes like that could accidentally end up in source control. Workflows like this damage the test feedback loop. Even though the actual execution time of the code is fast, the prep beforehand is manual, repetitive, and risky.
The `deploy` feedback loop
The experience an engineer has while maintaining and extending an app is crucial to keep solid morale. If a deploy feedback loop takes more than a few minutes, then engineers are likely to waste a larger amount of time on process. An engineer should experience the fastest loop time of a few seconds locally, a tester ought to wait no more than a few minutes on deploy changes. Finally, a stakeholder should be able to experience a change within the hour. Not saying that all changes need to be completely rolled out within an hour, just that it needs to be within the realm of possibility.
Continue reading Portable Function Pattern for the Scaling Business Part 2.