Background

I’ve been a practitioner of infra-as-code for all of my coding career. This means using software/library/framework to define what real world objects (usually on the cloud) are desired by expressing it in code/config. A large chunk of it was done in Terraform (it was an internal joke to call it terrorform). The downside with Terraform is using its configuration langauge, rather than a programming langauge, to define the desired state of the world. Hence, at one of my teams, we wrapped Terraform with our code generator to integrate better with how we define the state of our infra. All this was before Pulumi came into the scene. Pulumi basically replaced that code generator layer that existed above, and it aims to be an sdk for all the popular languages. This means people can manipulate terraform resources in their favourite programming languages. However, fundamentally, the concept and goals of managing infra as code is unaffected by the tooling.

In this post, I write about why such tools are far from perfect.

Key Principle

Infra as code abstracts away from the real cloud resources that are being deployed.

The key principle is to represent the state of the world within the modelling universe of the software/library/framework. There are typically three concepts to understand:

The real world objects.
The desired state of the world, expressed in a programming language (or configuration language).
A state file, describing what the modelling universe thinks the state of the real world is.

For example, if you have a real world EC2 Instance (VM instance), you could have some code expressing that there is an EC2 instance, and a state file storing a JSON object with the ID to identify the real world EC2 instance.

When you want to make changes to the real world, you don’t do it directly. Instead, you make changes to code in the programming language, then review and apply the changes via a “modelling executor”, e.g. running pulumi update. Such executor will calculate the required changes to the real world, most of the time without actually talking to the real world, by comparing the state and the desired state. It then translates the desired changes to real API calls that need to be made to the real world.

For example, if I wanted to stop the EC2 instance, I would change the instance state of the EC2 instance in code, then run the executor, which will figure out that it needs to call the “StopInstance” API based on the state and the desired state.

Drift

It is important to understand the concept of drift. This is the bane of such modelling. The model is useless if the real world state is different from our state file. When this happens, we say there is a drift.

This is not great because the model is no longer useful to reason about the state of the world. Imagine my code and the state file say that I have a single cheap EC2 instance. But in the real world, I have two beefy EC2 instances. If left unfixed, I would get a very shocking bill at the end of the month!

Infra as Code is Still Just a Model

The above description sounds pretty utopian if we can prevent drift. But like all abstractions and models, it fails to represent the real world perfectly.

I’ll illusrate this with a few examples where it fails.

Two Step Changes

As we mentioned above, changes to the real world are done via API calls. However, such endpoints may not have been implemented with Infra-as-Code in mind. Therefore, not all operations are RESTful, or resource orientated which is basically the model described above. One class of issues are changes that require multiple steps.

EC2 Instances Instance Types

In the EC2 instance case, I might want to change the instance type. There is no single API to do this on AWS end. AWS expects you to do this with 3 API calls:

Stop the instance
Modify the instance type
Start the instance

If I were to specify this in code, the change proposed would instead be a teardown of the old resource, and then a creation of a new resource. If someone inexperienced were to apply this without thinking, it would cause unnecessary downtime.

The above example is a bit contrived because the diff would highlight that the instance type is a destructive operation (destroy + recreate) which might alert the operator. But consider the following instead.

Changing RDS Instances Types

Imagine having a RDS database instance instead. Same story as before, we want to change the instance type. Because RDS was designed to be highly available, there is actually an API call to change the instance type while it is still running. And by some design choices, Terraform, and Pulumi, decide to map to this API when the code indicates a need to change instance type.

However, if you were to apply the changes, you would realize that the database hasn’t changed the instance size. At this point, our model has drifted from the real world. And if you were to run the apply action again, you would notice that it is once again saying that it will change the DB’s instance size, as if nothing has happened before! If you were inexperienced, you may at this point be violently pulling your hair out trying to understand what’s going on.

The explanation lies in understanding the underlying API. The change instance API call schedules the change for the DB maintenance period. So after applying the code change, the state of the real world is still the same as before, except a side effect has been scheduled. This “schedule” is not captured in our modelling universe. Eventually, when the side effect has finished (typically maintenance window is a particular day of the week), the state of the real world would reconcile with our model. But until then, there will be drift. Even if we tracked this “schedule” object to prevent drift, we would have the converse problem after the schedule has passed: we would be tracking a non existent schedule.

At one of my previous teams where we generated code, we work around this by having two states in code: one for tracking the desired instance size, and one for the current read-only instance size. We would then add some logic to generate code that ignored the instance type field to suppress the drift.

Changing EC2 Userdata

Userdata are configuration or scripts that an instance can be configured with at boot up time. The API to change this only works if the instance is in the Stop state. One obscure gotcha with Pulumi (and presumably Terraform) is that userdata changes do not show up in the diff, but can cause drift propsing an update to the instance. This means you could be spending time scratching your head trying to decipher why Pulumi is proposing changes when there are no changes to any of the underlying fields.

Github Membership

Membership to teams on Github is a two step process. The member is invited by the API, and then one accepts it, and then one is part of the team. Until the invite is accepted, Pulumi will report drift.

Updates Are Not Resources

Another class of gotchas is APIs that are effecting idempotent changes. For example, imagine I have an API that sets the value of a variable x. If I have in code the following two resources:

const r1 = new XValue(1)
const r2 = new XValue(2)

After applying the change, I have two resources pointing to the same real world value x. It is not possible to deduce what the real value of x is because it would depend on which API call finished last.

AWS ECR Replication Config

ECR is a image container registry. Typically, it is common to replicate some images to other regions. This is done with “Replication Rules”, rule := (region, [repo matching pattern]) that says replicate repositories matching the pattern to the specified region. In Pulumi, you can create a resource for this replication object. However, Pulumi doesn’t stop you from specifying multiple of these resources. So for example, if I wanted to replicate to eu-west-1 and eu-west-2, I could have created two resources, one for each region.

However, that would be the wrong thing to do. What would happen is I would have only one region replication set up, and my Pulumi code will be constantly in drift.

What I should do instead, is have a single resource, that is configured with 3 Rules.

What happened here is that the 3 resources ends up calling the API to create the rule 3 times, and AWS API happily just updates the existing rule even if it exists.

AWS Security Group

AWS Security Group are objects that define rules for ingress and egress. This is the bread and butter for preventing accidentally exposing internal server to the internet.

In Pulumi, you can create a SecurityGroup resource, and then create SecurityGroupAttachment resources, each specifying a rule. Now if you have two attachment specifying the same rule and attaching to the same security group, you would be creating two resources for the single object. Like the cases before, you would be faced with a state of constant drift.

Update conflicts

Another common source of error is runtime error. Because the diff is calculated only from the stored state and the desired state, you won’t really know what will happen when the real API is called. For example, you could add a new Github member in code. The preview would essentially say “I know an API that you call to make this happen”. But only until applying it for real would you realize a 4xx error that you have run out of Github seats.

What Can We Do?

I hope I have convinced you there are headaches even with this seemingly powerful tool. But I want to cover some mitigations.

Don’t Trust Preview. Validate Changes.

Maintain a healthy skepticism when the preview is all green. Just because the diff algorithm has identified what API calls are necessary, it does not mean that calling those API calls do what you want to do. If your organization can afford it, set up a staging environment that developers can safely deploy changes to before merging.

There are SAAS out there that offer to integrate with Gihub PR workflows to run the equivalent of Pulumi update before merging the PR. This means a failed update can block a PR from merging, thereby preventing drifts.

Set up locks

If a human was needed to nurse the real world into the desired state, it would be helpful to have a guard to prevent other unwanted automatic updates to the stack. In one of my previous teams, this was implemented as a S3 lockfile. This means Terraform would not apply the update as long as the S3 lockfile exist. The lockfile can then also carry a payload explaining why and who is holding the lock.

Conclusion

Infrastructure as code is a very powerful paradigm to manage large scale infrastructure. Being able to treat infrastructure as programmable code that can be easily shared and reuse within the organization gives platform team a powerful tool to build paved roads. However, it’s not a silver bullet and is definitely ridden with gotchas and leaky abstractions. In my experience, a successful team is one where the process is flexible enough that Pulumi changes can be applied for the purpose of testing that it works, and merged to keep things consistent. Having regular drift detections, either manually or via automations via SAAS, is important to identify drifts.

Infra As Code Woes

Nov 11, 2024 22:26 +0100 · 1945 words · 10 minutes read infra software-engineering