Modern infrastructure is hard. It shouldn’t be, but it is.
DevOps as a culture remains one of the hardest skills for an engineering team to learn, and as a result DevOps often gets turned into a role, or siloed into a separate function, which invariably turns into a huge organisational bottleneck.
In this post, I will show how we approached and solved the DevOps problem at Clause, by providing the right tooling to ease the cultural transition.
Let’s first take a moment to look at the nature of the beast …
It all started back in 2006, when Amazon realised that a lot of web-scale hosting problems they’ve solved for themselves can be commoditised.
The result was Amazon Web Services (AWS).
Their aim was to make infrastructure more scalable and easier to manage than ever before, by abstracting it as virtual services behind web interfaces. This type of flexible DIY hosting has since become popular enough to have become its own category of business known as infrastructure-as-a-service (IaaS).
Effects of Change
The most disruptive effects of successful Innovation rarely lie in its first order offerings — the obvious ones. The second, third, and further order effects are always much more interesting, and are the ones that truly change the course of history.
An effect of the steam engine was the layout of modern cities, which could suddenly grow in places that didn’t make sense without efficient mass transport.
An effect of the Second World War was the Digital Revolution. The machines necessary to break the Enigma code in a short enough time to still produce actionable information turned out to be useful for more things than breaking codes.
An Nth-order effect of the Telephone was the Internet.
Similarly, when Amazon made it possible to create (and destroy) virtual computers of any size at the touch of a button (or merely by sending the right network packets to their services), the effect was not to put system administrators out of work. Quite the opposite. It opened a world of possibilities where system architectures started increasing in complexity to the point where now, in most cases, they exceed the complexity of computer architectures.
And so, the complexity of managing hardware topologies did not shrink — as no doubt the intention was — but grew.
A Simple Solution?
One solution to the complexity of IaaS lies in a paradigm called infrastructure-as-code (IaC). IaC dictates that all infrastructure changes should be made as code in a code repository — usually in some kind of domain-specific declarative language.
AWS’s IaC offering comes in the form of CloudFormation: a behemoth that encapsulates every single nuance of every possible resource that can be created in AWS. It is (or at least aims to be) feature complete, and as such is meant to be able to serve as the sole interface to all of AWS’s offerings.
And therein lies the first problem:
The CloudFormation vernacular is immensely rich and verbose. Consider the following CloudFormation template snippet that specifies a container task definition for an ECS Service — AWS’ Container Orchestration Service, its answer to Kubernetes — (which does not include the Service itself, Load Balancer, Security Groups, Listener or TargetGroup, all of them required to end up with a running ECS service):
Albeit well-documented with well-structured documentation, even that doesn’t overcome the effect of the complexity contained therein. Learning a programming language is a very fast process, followed by a lifetime of learning programming patterns and styles that mostly apply to multiple programming paradigms and languages. CloudFormation is quite the opposite: learning all the options, nuances and edge-cases specific to every service is an ongoing process.
This is, by the way, by no means a solved problem. CloudFormation still, in my opinion, stands head & shoulders above competitors like Terraform.
The second problem lies in the awkwardness of patterns of code reuse.
Being code, you’d expect infrastructure-as-code to automatically benefit from principles of reuse, like the DRY principle. Except, it doesn’t. And even worse, it purports to do exactly that.
I believe this problem to be a fundamental one.
Code reuse works well in a pure, side-effect free environment. This is the environment we usually strive for in software. In such an environment a Function or Object behaves in the same way, no matter where it is used. That is the base condition for code reuse.
Infrastructure doesn’t quite do that. A virtual server (or disk, or database) depends to a fundamental degree on its environment. Anyone who has tried establishing a library of reusable CloudFormation nested stacks will know the pain of maintaining an exploding tree of forks off of every single instance of what was originally meant to be a “reusable” template.
To solve the Infrastructure problem at Clause once and for all, we’ve created a service called The Forge.
The Forge is an API that creates Code Pipelines that creates microservices.
The Forge is so meta that it has created itself.
Jokes aside, version 2 of The Forge was created using version 1, and since then it’s been continuously updated by its own code pipeline
A crucial part of The Forge is powered by GitHub Template Repositories. A Template Repository can be used as the starting point for any new code repository, and we have created a palette of Template Repositories that captures all of the types of service we use in our stack. Each template repository contains a fully functional CloudFormation template for a service of that type, a build specification to build whatever software packages need to be deployed to power the service, and a “Hello World” service stub, to serve as a starting point for the service itself.
The most common type of service, for instance, is a Private REST API. Most of our back-end services are REST APIs that are meant to communicate with each other, and their contents only get exposed to the outside world through more strictly controlled Public REST API services.
When creating a new service in The Forge, a user has to specify two things: A name for the service, and what type of service it is.
The Forge then does the following:
- It creates a new repository using the Template Repository that matches the specified type
- It then runs a CloudFormation template that creates a code pipeline, with the following steps:
- A build step, running off a build specification in the code repository, that creates one build to deploy to all environments (this is crucial, as it ensures code equivalence in all our environments)
- For each environment (Dev, Staging and Production, each in its own AWS Account), it runs a CloudFormation template contained in the code repository to create and update the stack of resources (this includes deploying code changes)
- After each CloudFormation deployment, it runs a series of Integration Tests, also contained in the code repository, before it continues with the next environment. Failure of Integration Tests in Dev, for example, will avoid deployment in Staging and Production.
- When The Forge has finished building the pipeline, it hooks it up to the master branch in the GitHub repository created in 1. above, so that all commits to master get deployed automatically. (Yes, master. We use a trunk-based git workflow, but that’s a topic for a different post)
This is more or less what it looks like on paper:
Infrastructure is currently only slightly beyond where computers were in the 60s: If you used one, it was custom-built for what you used it for, and it lacked the common architecture that would enable common languages and truly global code reuse. It is in a world where we still have to hand-code in assembler for every distinct vendor — there are no compilers that unify it all under a single vocabulary yet.
As such, it remains a hard problem, and one that scales badly and expensively.
At Clause we have found a way to manage this complexity by approaching the CD environment problem as an infrastructure-as-code problem itself. (Most modern CI/CD pipelines get created manually, or are scripted.)
Creating and updating The Forge through a pipeline created by itself has established a unified pattern across the stack that enables us to focus on code, and have to deal with Infrastructure as little as possible.