Plans Are Useless, Automation Is Essential

Zsolt Dollenstein
Prezi Engineering
Published in
7 min readNov 27, 2014

--

In an ideal world, I wouldn’t have to care about infrastructure automation or configuration management. As long as I know what everything does, how everything works, and I have total control over all of my infrastructure, I don’t have anything to worry about. If it weren’t for the fact that I don’t know what everything does, I don’t know how everything works, I don’t have total control over our infrastructure, and I’m not the only one that has to worry about it, this blog post would have ended with the last sentence. Yet, I’m still typing.

It takes a team to operate an application that millions of people across the globe use daily. No one person can be responsible for that task, and no one person should be crucial to it either. But more often than not, that’s how startups do it. You have one guy that’s responsible for setting up servers, even if they’re in EC2, deploying the application to those servers, and making sure that everything’s running smoothly. Typically you call this guy a DevOps so you can call yourselves a DevOps shop. It’s only after spending some time with this system that you realize how risky a move it is to have engineering throw an application over the fence to operations, and how that fence-throwing metaphor is even more apt than you ever before realized.

That’s what happened to us. In the beginning, we outsourced technical operations to another company. Our technical team comprised engineers, and their only concern was to build the application. This is a more common arrangement than you think, and it’s also an extreme example of a division of concerns between development and operations.

Two years ago we realized that this arrangement wouldn’t allow our engineering team to scale as our company grew. The division is not just one of concerns. It’s also one of information, of knowledge of how the parts fit together to make the whole.

We started transitioning operations to an internal responsibility slowly. We took over a service and ran it on a few EC2 instance we managed ourselves. As we made progress, we wrote and saved small scripts to help us install and configure individual components of the system. One for Apache, another for Ruby, etc. About every week we’d deploy the latest version of the service, and we’d chain these scripts together to create a fresh instance.

In theory, this should be enough. Whenever we need a new instance on which to run a service, we can create the instance, run the scripts, and then we’re ready to go. In practice it’s not that simple. Running those setup scripts on one instance, and the running those same scripts on another instance at some other point in time would not always produce the same results. The resulting instance would be usable, but package versions installed on each instance would differ in subtle ways. Our automation scripts were producing snowflake instances.

The problem with snowflakes is that they do work, but they are not good building blocks for a robust system. Each version of a package behaves differently than all the other versions of the package. When you don’t control for versions, you don’t control for the package’s behavior. You end up assuming that a package has a particular behavior, when in reality that’s specific to a particular version of the package. That assumption, more often than not, is written nowhere, and you’re only to discover the assumption once you deploy on an instance that has the wrong package version.

Identifying package versions as the cause of an application’s errant behavior is hard enough, but standardizing versions once the problem is discovered can be a challenge in and of itself. At one point we had services running three different version of Apache. Once we began the work of standardizing our instances, we realized that each service required a unique upgrade strategy to account for the differences in Apache across those three versions.

To fix the snowflake problem, we first had to take an inventory. We had to document the configuration of each server, and find a tool that would help us concisely describe that configuration. Then, we needed to find a way to take that documentation and reproduce snowflakes in a non-production environment, figure out an upgrade path to standardize their configurations according to the services they were running, and finally upgrade them. The tool we found to help us with this task, and the tool that drives our automation to this day, was Chef.

What’s on The Menu

Chef is a configuration management system in three parts and several food-related metaphors. The three parts alone are confusing to beginners and the metaphors don’t help. So forget about the metaphors and forget about Chef. I’m going to explain its architecture without referring to either, and omit some details that are irrelevant to getting the gist of it.

First, you need a configuration server. This server can live anywhere on the internet. It keeps track of types that you define, where a type is a set of 3-tuples consisting of a software package, a version, and configuration information for that software package.

Second, you need a computer that you want to configure. On that computer, run a configuration daemon, assign it one or many types, and point it to your configuration server. The configuration server keeps the daemon up-to-date with changes that you make to the type definitions. The sole purpose of the daemon is to ensure that when type definitions change, the instance is in the state that is defined by the types you assign to the instance.

Finally, you need types. You define types locally, on your workstation, in files. It’s good practice to keep the definitions in version control, so that’s what you do. Once you finish defining or updating your types, you make the configuration server aware of that by using a configuration agent. The agent transfers the type definitions to the server, and the server then notifies the configuration daemons of the new type definitions. The daemons then do their thing.

Chef calls the configuration daemon the client, the configuration server the server, and the configuration agent knife. Types are called recipes. In any given configuration management system, the goals is to define types and assign them to instances. Sometimes the daemon and the server are one and the same, sometimes the agent and the server are one and the same. But all configuration management systems, logically, have these three parts.

Institute Some Knowledge

Chef recipes are a specification of your infrastructure’s components and corresponding state. They it describes the state that your infrastructure’s component should be in, not how they should come to be in that state. Instead of writing brittle scripts, recipes become a repository of institutional knowledge, accessible to everybody, whose representations are disentangled from their implementation. They’re a specification of your infrastructure’s components and corresponding state that Chef just happens to know how to use to put your system in the state that they specify. Recipes are runnable documentation. They’re code.

Whether it’s right or wrong, documentation and code almost never agree. When you know the code, this isn’t a big deal. But when you hire somebody new or bring somebody on from another team, they rely on documentation while they’re getting familiar with the code base. The more time they spend figuring stuff out, the more time they spend not being productive. And when the documentation fails them, they ask you questions that you wish the documentation could answer for you.

With Chef you can never run into this situation. The documentation of your infrastructure and the code you use to set it up are one and the same. If your new team member needs a development environment, he can use Chef to set it up. He won’t understand how everything works together initially, and he doesn’t need to if the priority is to just get up and running. And what he stands up will be exactly the same thing that everybody else is developing against, and exactly what’s running in production. When he needs to understand further, because he needs to modify something, or create something new, he can look at the recipes and see how which recipes it uses to get everything up and running.

Write It Once, Reuse It

A recipe is reusable. You don’t write recipes for a single instance, or even a single instance type. You write recipes for individual software packages and system services that fit together to produce an operational application. For example, if you’re creating a new service, the responsible DevOps thing to do is add basic monitoring to any machine you run the service on. You can encapsulate all the package requirements and configuration into a single recipe that anybody on your team can reuse in their own services. By using recipes, everybody on your team benefits from reuse. And if anybody wants to improve on the recipe, everybody benefits from that as well.

Once you’ve been using Chef for a while, you reach the point where setting up a new service less about writing recipes and more of an exercise in combining recipes to accomplish your goal. Things like OS configuration, monitoring, and language runtime setup are all recipes that you can just include. All that’s left is to write a recipe to install your application and its dependencies, and configure all those properly. In effect, Chef lets you turns your infrastructure into a Platform as a Service (PaaS). Like deploying on Heroku or AppFog, this setup allows you to focus on building your application rather than worrying about infrastructure. However, you control the PaaS, so you can respect that abstraction boundary and use what’s already there, or you can customize and change it however you wish.

The Essence of Automation

Automation is more than scripts that speed up deployment. When done right automation is a framework that your engineering organization uses to define, document, and share the state of your system. It becomes the substrate of shared knowledge that everything feeds from: your infrastructure, to your applications, even to your team.

This story was originally published on the Prezi engineering blog on April 10, 2013
Source and license for background image

--

--