Zero to DevOps in Two Weeks

Introduction

“Take two weeks and make a new standard”

I’ve had the good fortune in the recent past to work in an environment with some really strong DevOps slash automated infrastructure folks. As a consumer of such great systems I became a little opinionated about easily repeatable testing and deployment of code. So when I started here at JBS and was given the assignment to compose an automated deployment system from scratch, I jumped at the opportunity. “Take like two weeks, and make every project deploy the same way,” they said. Well, with that guidance and a bit more than a month gone by here is the state of things.

Goals

I had a few goals in mind when starting this project and a few that became self evident as I went along. I’ll address solutions to each of these in detail in the sections below.

  • Testing automation – Every project should have an automated test suite.
  • Cattle not KittensNo one cries for dead cattle
  • Immutable build artifacts – In order to prove that code being tested /is/ the code deployed to production, I opted for an approach with a simple constraint: “The system that gets tested should be exactly the one deployed to production.”
  • Minimize manual interaction – Anything that people do is subject to error and mistakes. Following a checklist is not a sustainable process. Automate and code everything, where possible. Which leads to these two sub-goals
  • Build and deployment as code – In order to review and validate the steps required to build, test, and deploy the project, the config required should be part of the project’s codebase.
  • Infrastructure as code – Again to support peer review and validation of the deployment approach, a description of the required resources should be committed to the code base or a separate repository.

I have a few additional future goals which have not yet been addressed, but you’d do well to keep them in mind if you start traveling this road.

  • Sensitive data should be encrypted
  • Variable data should be centralized when it makes sense.

Automated Testing

Tox is the standard barrier for test execution in python. It ensures that your project will install from its setup file and your test suite will run on any number of selected python versions in a separate virtualenv for each. In my sample project I’ve used pytest as my underlying test runner, but any such tool could be used.

Automated unit tests attempt to ensure that new code does not change the behavior of previously established routines. That is generally considered a GoodThing™.

Cattle Not Kittens

I think the days of running applications on special servers (kittens) built of hand picked hardware with carefully installed and manually configured libraries and supporting services are ready to end. This is an endless nightmare of realtime support, devs logging into prod, and many other sins committed in the name “just get it running again.”

In the modern era, we have a sea of resources (cattle) called “the cloud.” We test upfront to know that our application runs, soup to nuts, with a hands free, automated deployment script. And we sleep easy at night knowing that if something goes wrong (and it will, hardware fails, load scales, and software is still unpredictable) automated contingencies already in place can handle adding horizontal scale or replacing misbehaving app servers.

You don’t have to pet cattle or know their names. When one dies and is replaced, no tears are shed. I aim to create an enviornment where no crying is done for infrastructure.

Immutable Build Artifacts

There are a few possibilities here with varying tradeoffs of consistency, build and deployment time, and vendor lock-in.

One typical approach to build artifacts is using a tool such as Packer to build a virtual machine from the ground up each time there is a code change. This has the unfortunate effect of taking a long time, requiring a VM to boot and packages to be installed/upgraded before your code can even be checked out or tested.

There is a land mine lurking in this approach as well, blindly updating system packages may actually cause instability and version conflicts. By always installing the most recent packages you will update libraries and dependancies to versions you may not have previously vetted or tested leading to confusing errors that can be a challenge to track down. Because of the time require to even build this system, you may be tempted to run your test suite on a different environment – polluting it with your requirements, and possibly testing against different versions as explained above.

Another approach is to leverage some fairly recent kernel features – cgroups, LXC, and kernel namespaces, by way of a tool called Docker. You can think of a Docker container as cross between a lightweight VM and a chroot environment. A Docker container does not use any virtualization though, instead it makes system calls directly through the kernel of the host system. By using those features I mentioned above, it can only see resources within its process space and those explicitly shared with it.

A bit of terminology: a Docker image is a file system at rest, ready to be launched; a Docker container on the other hand is a running process group with its own PID 1 handling all the functions that linux requires of an init system. You may in fact launch the same image into multiple containers at the same time, and because of the layered write on change filesystem, this won’t take up any more disk space.

But what does that mean for deployment and build artifacts?

It’s all about speed! Because they are not virtual machines, Docker containers launch at the speed equivalent of a fork/exec call. And due to their layered filesystem which caches all the steps of your build which haven’t changed you can now quickly build exactly your production image to run your tests on in a reasonable amount of time. Because Docker images don’t change, once you’ve built a base system with a particular version of a library you can be sure it won’t inadvertently be upgraded, and you won’t have to spend the time rebuilding them.

Deployment

Now that I’ve built the Docker image, it has to run somewhere. Popular options include Docker Swarm, Kubernetes, and Amazon ECS. Since we mostly run on AWS I’ve chosen to focus on ECS as the orchestration layer. See InfoQ for a good primer on ECS. Long story made short the orchestration layer handles scheduling your containers to run on an array of hosts. This allows for guided or automatic horizontal scaling, error detection, and self healing. Typically a container that starts failing health checks is killed, and a new one is launched to replace it. The orchestration layer also handles seamless deployments where updated containers are deployed while the old ones are shut down.

Solution

By having a Dockerfile in our project, we can describe exactly how to build a single image which can run our unit tests and be launched as a dev, qa, or prod environment. A project may also choose to provide a docker-compose file, which describes all the required resources to run a project with additional docker images (e.g. memcache, postgres, redis, etc…). This is generally to assist the developer in launching a local development environment that closely mimics the production environment.

Dockerfile

This is a Dockerfile for a python/django webapp:

FROM baxeico/django-uwsgi-nginx
 
MAINTAINER Aaron McMillin
 
# Expose ports
EXPOSE 80
 
# No more console warnings!
ENV DEBIAN_FRONTEND noninteractive
 
# Set the default directory where CMD will execute
WORKDIR /home/docker/code/app
 
# For PIL and mysql, clean afterwards to keep image size down
RUN apt-get update & apt-get install -qq -y libjpeg-dev libmysqlclient-dev & apt-get clean
 
# Jenkins will run tests as UID 1000
RUN groupadd -g 1000 tox & useradd -g 1000 -u 1000 -s /bin/bash -d /home/docker/code/app tox
# Requirements.txt alone for caching, only reinstalls requirements when they change
# Otherwise Docker will use the cached layer
ADD requirements.txt /home/docker/code/app/requirements.txt
RUN pip install -r /home/docker/code/app/requirements.txt
 
# Add media files on their own for layer caching
ADD . /home/docker/code/app/media
# Line up media files with the preconfigured nginx
RUN mkdir /home/docker/persistent & \
    mv /home/docker/code/app/media /home/docker/persistent & \
    chown -R www-data:www-data /home/docker/persistent/media
# Add the whole project
ADD . /home/docker/code/app
# Collect the static assets, the IN_DOCKER env variable lets my django app know which settings to use
RUN IN_DOCKER=build /home/docker/code/app/manage.py collectstatic --noinput

docker-compose.yml

And the docker-compose.yml file. The developer only needs to run docker-compose up to have a fully functional development environment. This will launch the official mysql version 5.7 Docker image from Docker Hub

version: '2'
 
services:
  db:
    image: "mysql:5.7"
    environment:
      MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD}
      MYSQL_USER: webapp
      MYSQL_PASSWORD: webapp
      MYSQL_DATABASE: webapp
 
  web:
    build: .
    volumes:
      - .:/home/docker/code/app/
    ports:
      - "8080:80"
    depends_on:
      - db
    environment:
      IN_DOCKER: dev
      DB_NAME: webapp
      DB_PASSWORD: webapp
      DB_PORT: 3306
      DB_HOST: db

Minimize Manual Interaction

To truly have a repeatable and dependable process every step needs to be automated. Better yet, executing those steps and evaluating any decision points should be automated as well. To that end we’ll look at automation for building and testing a project, and also for getting that built project into production.

Continuous Integration

The “Hitchhiker’s Guide to Python” makes this straightforward recommendation:

Jenkins CI is an extensible continuous integration engine. Use it.

You may have personal feelings about Java (I do), but Jenkins is THE tool for continuous integration and deployment. If you want “stuff” to happen when you push a branch, Jenkins is the [free] way to do that. Imagine a world where you cut a feature branch for the ticket you’re currently working on, commit your changes, then push the branch to github. Immediately, something (Jenkins) checks out your code, assembles your build artifacts, runs unit tests on them, and if they pass deploys a test environment for QA to review. Imagine further that when QA green lights the test the code is merged to master, the branch is closed, and the production environment is updated with no downtime. This is a world you could live in. This is a world I want to live in!

You don’t get it all for free though, and for a long time Jenkins required either lots of manual interaction through its web interface, or additional tooling to write XML job descriptions to build jobs like the one described above. But no more! With Jenkins 2.0 and the pipeline plugin, you can revision your pipeline as code. The steps described above can be kept in a Jenkinsfile in your code repository, and then Jenkins is just told to scan your repository for any branch that contains a Jenkinsfile.

Jenkinsfile

Here’s an example Jenkinsfile to get you started with the Docker based deployment system I’m describing. You’ll have to fill in your AWS account number for ECR users, or change the repo to reflect your docker hub account.

node {
    stage 'checkout'
    git branch: env.BRANCH_NAME, changelog: false, poll: false, url: 'github:MyAccount/MyProject'
 
    def repo = "<<AWS_ACCOUNT_NUMBER>>.dkr.ecr.us-east-1.amazonaws.com/webapp"
    def image = "${repo}:${env.BRANCH_NAME}"
    docker.withRegistry("https://${repo}"){
        stage 'docker build'
        // Setting withRegistry adjusts the tag
        def myEnv = docker.build "webapp:${env.BRANCH_NAME}"
 
        stage 'docker test'
        myEnv.inside {
            sh 'tox'
        }
 
        // Collect the test results for display
        step([$class: 'JUnitResultArchiver', testResults: '**/junit-*.xml'])
        // TODO: Coverage report, cobertura doesn't work with pipeline yet
 
        stage 'docker publish'
        // Get creds from aws cli
        sh "eval \$(aws ecr get-login --region=us-east-1)"
        myEnv.push()
    }
}

Automated Infrastructure

There is more to applications than just software. The hardware and network services have to come from somewhere, and that too can be kept in a repository for revision and review. For this I’m using another relatively new tool, Terraform. They call this approach “infrastructure as code”.

If you’re content being limited to the services AWS provides you may choose to stick with their CloudFormation Templates. This is a verbose json format which describes all the AWS resources you’d like to provision to build a “stack”. Since there’s no comments in json it can be a little hard to follow along.

Since I wanted to keep my options open I chose Terraform which fills the same niche and more. It works with multiple providers including AWS, DigitalOcean, OpenStack, Google Cloud, and many more. With Terraform you could use a single command to deploy an infrastructure that is fully redundant across multiple providers. There is no layer of abstraction though, so if you want to move from AWS EC2 to OpenStack or from Route53 to DNSimple you will have to rewrite parts of your config to match the requirements of the particular backend. I think this is a small price to pay for the ability to continue using a single tool deploy across all providers.

I based my work on these templates made by CapGemini and you can see my fork, which has diverted fairly substantially.

With a simple terraform apply I have Jenkins running from a Docker image in an ECS cluster in an ASG, behind an ELB, with an ECR built ready to hold the Docker images that Jenkins will build for my project. There’s already a database provisioned for my project and the ECS task and service definitions to run it, as soon as an image is built. Again the service is behind an ELB. There are knobs for autoscaling as well, but those are not enabled yet. Just as easily terraform destroy will remove all these resources.

End Results

TF-AWS-Infrastructure-1024x655.png

So at the end of the day, I have:

  • Two load balancers, one for Jenkins and one for my application.
  • Each with their own friendly names provided by Route53 (not shown)
  • A scalable group of EC2 instances to serve as a cluster for ECS
  • A Jenkins service (i.e. container running in ECS) which scans my git repository for Jenkinsfiles and builds docker images which get loaded into the ECR, Amazon’s docker repository.
  • A Webapp service which finds the image in ECR and launches it to run the web application.
  • Not pictured: security groups and IAM roles which limit access to all the non-public resources.

Scale

Docker is built for horizontal scale. You’re encouraged to limit containers to a single service that can be deployed in parallel and load balanced across. To deploy 100 web servers all you’d need to do is set the count param on the group for your ECS cluster servers to 100, and set the number of desired service tasks to 100. Due to the way port mapping works, you can’t run an image that maps to a host port multiple times on the same host, but I can’t think of a good reason to do that anyway.

Flexibility

If we take Jenkins and the Jenkinsfile as a given you’re technically free beyond that to build and deploy how ever you’d like. Want Packer to build AMI’s and use CloudFormation to deploy them? Nothing is stopping you. Want to keep your Docker images in docker-hub rather than ECR, go for it.

If you want to use something like Ansible, Puppet, or Salt to heard kittens for you, that’s an option as well. Just use Jenkins to kick of the process, and you’ve still got some of the benefits of automation.

Future Plans

My next steps include:

  • Using Vault to store and share sensitive data like DB and third party API passwords.
  • Having Jenkins manage the Terraform templates so that I can run one stack per branch
  • Once Jenkins is managing the Terraform state, pipelines allows interactive steps, so I can wait for interactive approval from QA before merging to master, tearing down the dev stack, and possibly updating the prod environment.
  • Amazon just released EFS, Elastic File System, which gives you a low latency NFS mount you can share among multiple machines. I can probably leverage this to add horizontal scaling to Jenkins if needed.
  • Look at Kubernetes to replace ECS as it is more cloud agnostic, but since I still like running my databases in RDS and not in Docker it might not be the best fit.

References

Summary of Tools

That might be a lot of information to digest, so here’s a quick list of all the tools I’m using:

AWS Alphabet Soup

There were a number of AWS service reference in this post. Here’s the quick glossary.

  • AMI – Amazon Machine Image, root volume snap shot used to launch an EC2 instance.
  • ASG – Auto Scaling Group, launch or terminate EC2 instances based on policies.
  • CloudFormation – Cloud Formation Templates, json template that describes AWS resources that can be repeatedly provisioned.
  • EC2 – Elastic Compute Cloud, dynamically provisioned virtual machines (instances) and the surrounding services.
  • ECR – EC2 Container Registry, a repository for multiple tags of a single Docker image, can be private or public.
  • ECS – EC2 Container Service, Docker container management service, distributes tasks to select EC2 instances designated as cluster members.
  • ELB – Elastic Load balancer, distributes traffic across multiple EC2 instances for scaleability and fault tolerance.
  • IAM – Identity and Access Management, users, groups and roles to control access to various AWS resources.
  • RDS – Relational Database Service, hosted solution for RDBMS including MySQL, PostgreSQL, Oracle, and more.
  • Route53 – Hosted DNS service.
  • Security Groups – Virtual firewall that operates within a VPC.
  • VPC – Virtual Private Cloud, lets you provision a logically isolated section of the Amazon Web Services (AWS) cloud.

Additional Reading

Here are just a few of the articles that I came across while developing this solution: