Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing production software, published by O’Reilly Media!
This is Part 2 of the Fundamentals of DevOps and Software Delivery series. In Part 1, you learned how to deploy your app using PaaS and IaaS, but it required a lot of manual steps clicking around a web UI. This is fine while you’re learning and experimenting, but if you manage everything at a company this way—what’s sometimes called ClickOps—it quickly leads to problems:
- Deployments are slow and tedious
-
So you can’t deploy too often.
- Deployments are error-prone
-
So you end up with lots of bugs, outages, and late-night debugging sessions.
- Only one person knows how to deploy
-
So that person is overloaded, they never have time for long-term improvements, and if they were to leave or get hit by a bus, everything would grind to a halt.[10]
Fortunately, these days, there is a better way to do things: you can manage your infrastructure as code (IaC). Instead of clicking around manually, you use code to define, deploy, update, and destroy your infrastructure. This represents a key insight of DevOps: most tasks that you used to do manually can now be automated using code, as shown in Table 6.
Task | How to manage as code | Example | Part |
---|---|---|---|
Provision servers | Provisioning tools | Use OpenTofu to deploy a server | This blog post |
Configure servers | Server templating tools | Use Packer to create an image of a server | This blog post |
Configure apps | Configuration files and services | Read configuration from a JSON file | |
Configure networking | Software-defined networking | Use Kubernetes networking | |
Build apps | Build systems | Build your app with NPM | |
Test apps | Automated tests | Write automated tests using Jest | |
Deploy apps | Automated deployment | Do a rolling deployment with Kubernetes | |
Scale apps | Auto scaling | Set up auto scaling policies in AWS | |
Recover from outages | Auto healing | Set up liveness probes in Kubernetes | |
Manage databases | Schema migrations | Use Knex.js to update your database schema | |
Test for compliance | Policy as code | Check compliance using Open Policy Agent |
If you search around, you’ll quickly find that there are many tools out there that allow you to manage your infrastructure as code, including Chef, Puppet, Ansible, Pulumi, Terraform, OpenTofu, CloudFormation, Docker, Packer, and so on. Which one should you use? Many of the comparisons you find online between these tools do little more than list the general properties of each tool and make it sound like you could be equally successful with any of them. And while that’s true in theory, it’s not true in practice. There are considerable differences between these tools, and your odds of success go up significantly if you know how to pick the right tool for the job.
This blog post will help you navigate the IaC space by introducing you to the four most common categories of IaC tools:
-
Ad hoc scripts: e.g., use a Bash script to deploy a server.
-
Configuration management tools: e.g., use Ansible to deploy a server.
-
Server templating tools: e.g., use Packer to build an image of a server.
-
Provisioning tools: e.g., Use OpenTofu to deploy a server.
You’ll work through examples where you deploy the same infrastructure using each of these approaches, which will allow you to see how different IaC categories perform across a variety of dimensions (e.g., verbosity, consistency, error handling, and so on), so that you can pick the right tool for the job.
Before digging into the details of various IaC tools, it’s worth asking, why bother? Learning and adopting new tools has a cost, so what are the benefits of IaC that make this worthwhile? This is the focus of the next section.
The Benefits of IaC
When your infrastructure is defined as code, you are able to use a wide variety of software engineering practices to dramatically improve your software delivery processes, including the following:
- Speed and safety
-
If the deployment process is automated, it will be significantly faster, since a computer can carry out the deployment steps far faster than a person, and safer, given that an automated process will be more consistent, more repeatable, and not prone to manual error.
- Documentation
-
If your infrastructure is defined as code, then the state of your infrastructure is in source files that anyone can read, rather than locked away in a single person’s head. In other words, IaC acts as documentation, allowing everyone in the organization to understand how things work.
- Version control
-
Storing your IaC source files in version control (which you’ll do in Part 4) makes it easier to collaborate on your infrastructure, debug issues (e.g., by checking the version history to find out what changed), and to resolve issues (e.g., by reverting back to a previous version).
- Validation
-
If the state of your infrastructure is defined in code, for every single change, you can perform a code review, run a suite of automated tests, and pass the code through static analysis tools—all practices that are known to significantly reduce the chance of defects (you’ll see examples of all of these practices in Part 4).
- Self-service
-
If your infrastructure is defined in code, developers can kick off their own deployments, instead of relying on others to do it.
- Reuse
-
You can package your infrastructure into reusable modules so that instead of doing every deployment for every product in every environment from scratch, you can build on top of known, documented, battle-tested pieces.
- Happiness
-
There is one other important, and often overlooked, reason for why you should use IaC: happiness. Manual deployments are repetitive and tedious. Most people resent this type of work, since it involves no creativity, no challenge, and no recognition. You could deploy code perfectly for months, and no one will take notice—until that one day when you mess it up. IaC offers a better alternative that allows computers to do what they do best (automation) and developers to do what they do best (creativity).
Now that you have a sense of why IaC is so valuable, in the following sections, you’ll explore the most common categories of IaC tools, starting with ad hoc scripts.
Ad Hoc Scripts
The first approach you might think of for managing your infrastructure as code is to use an ad hoc script. You take whatever task you were doing manually, break it down into discrete steps, and use your favorite scripting language (e.g., Bash, Ruby, Python) to capture each of those steps in code. When you run that code, it can automate the process of creating infrastructure for you. The best way to understand this is to try it out, so let’s go through an example of an ad hoc script written in Bash.
Example: Deploy an EC2 Instance Using a Bash Script
Example Code
As a reminder, you can find all the code examples in the blog post series’s sample code repo in GitHub. |
As an example, let’s create a Bash script that automates all the manual steps you did in Part 1 to deploy a simple Node.js app in AWS. Head into the fundamentals-of-devops folder you created in Part 1 to work through the examples in this blog post series, and create a new subfolder for this part and the Bash script:
$ cd fundamentals-of-devops
$ mkdir -p ch2/bash
Copy the exact same user data script from Part 1 into a file called user-data.sh within the ch2/bash folder:
$ cp ch1/ec2-user-data-script/user-data.sh ch2/bash/
Next, create a Bash script called deploy-ec2-instance.sh, with the contents shown in Example 3:
#!/usr/bin/env bash
set -e
export AWS_DEFAULT_REGION="us-east-2"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
user_data=$(cat "$SCRIPT_DIR/user-data.sh")
(1)
security_group_id=$(aws ec2 create-security-group \
--group-name "sample-app" \
--description "Allow HTTP traffic into the sample app" \
--output text \
--query GroupId)
(2)
aws ec2 authorize-security-group-ingress \
--group-id "$security_group_id" \
--protocol tcp \
--port 80 \
--cidr "0.0.0.0/0" > /dev/null
(3)
instance_id=$(aws ec2 run-instances \
--image-id "ami-0900fe555666598a2" \
--instance-type "t2.micro" \
--security-group-ids "$security_group_id" \
--user-data "$user_data" \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=sample-app}]' \
--output text \
--query Instances[0].InstanceId)
public_ip=$(aws ec2 describe-instances \
--instance-ids "$instance_id" \
--output text \
--query 'Reservations[*].Instances[*].PublicIpAddress')
(4)
echo "Instance ID = $instance_id"
echo "Security Group ID = $security_group_id"
echo "Public IP = $public_ip"
If you’re not an expert in Bash syntax, all you have to know about this script is that it uses the AWS Command Line Interface (CLI) to do the following:
1 | Create a security group. |
2 | Update the security group to allow inbound HTTP requests on port 80. |
3 | Deploy an EC2 instance that uses that security group and runs the Node.js app on boot in a user data script. |
4 | Output the IDs of the security group and EC2 instance and the public IP of the EC2 instance. |
Watch out for snakes: these are simplified examples for learning, not for production
The examples in this blog post are still simplified for learning and not suitable for production usage, due to the security concerns and user data limitations explained in Watch out for snakes: these examples have several problems. You’ll see how to work around some of these limitations starting in the next blog post. |
If you want to try the script out, you’ll first need to give the script execute permissions:
$ cd ch2/bash
$ chmod u+x deploy-ec2-instance.sh
Next, authenticate to AWS as described in Authenticating to AWS on the command line, and run the script as follows:
$ ./deploy-ec2-instance.sh
Instance ID = i-0335edfebd780886f
Security Group ID = sg-09251ea2fe2ab2828
Public IP = 52.15.237.52
After the script finishes, give the EC2 instance a minute or two to boot up, and then try opening up
http://<Public IP>
in your web browser, where <Public IP>
is the IP address the script outputs at the very end.
You should see:
Hello, World!
Congrats, you are now managing your infrastructure as code! Well, sort of. This script, and most ad hoc scripts, have quite a few drawbacks in terms of using them to manage infrastructure, as discussed in the next section.
Get your hands dirty
Here are a few exercises you can try at home to go deeper:
|
When you’re done experimenting with this script, you should manually undeploy the EC2 instance by finding it in the
EC2 Console: check the top right corner to make sure you’re in the same region
used by the script (us-east-2
), then on the Instances page, look for the instance ID the script outputs at the end,
click "Instance state," and choose "Terminate instance" in the drop down, as shown in
Figure 19. This ensures that your account doesn’t start accumulating any unwanted
charges.
How Ad Hoc Scripts Stack Up
Below is a list of criteria, which I’ll refer to as the IaC category criteria in this blog post, that you can use to compare different categories of IaC tools. In this section, I’ll flush out how ad hoc scripts stack up according to the IaC category criteria; in later sections, you’ll see how the other IaC categories perform along the same criteria, giving you a consistent way to compare the different options.
- CRUD
-
CRUD stands for create, read, update, and delete. To manage infrastructure as code, you typically need that code to support all four of these operations, whereas most ad hoc scripts only handle create. For example, this script can create a security group and EC2 instance, but if you run this script a second or third time, the script doesn’t know how to "read" the state of the world, so it has no awareness that the security group and EC2 instance already exist, and will always try to create new infrastructure from scratch. Likewise, this script has no built-in support for deleting any of the infrastructure it creates (which is why you had to terminate the EC2 instance manually). So while ad hoc scripts make it much faster to create infrastructure, they don’t really help you manage it.
- Scale
-
Solving the CRUD problem in an ad hoc script for a single EC2 instance is hard enough, but a real architecture may contain hundreds of instances, plus databases, load balancers, networking configuration, and so on, and there’s no easy way to scale up scripts to keep track of and manage so much infrastructure.
- Deployment strategies
-
In real-world architectures, you typically need to use various deployment strategies to roll out updates, such as zero-downtime rolling deployments, blue-green deployments, canary deployments, and so on (you’ll learn more about deployment strategies in Part 5). With ad hoc scripts, you’d have to write the logic for each deployment strategy from scratch.
- Idempotency and error handling
-
To manage infrastructure, you typically want code that is idempotent, which means it can be re-run multiple times and still produce the desired result. Most ad hoc scripts are not idempotent and do not handle errors gracefully. If you hit an error part way through running this script, it just exits, leaving work in a partially completed state, but retaining no memory of what the script got done. If you then try to re-run the script, you’ll often get a different error because some of the partially completed work will now interfere with the new work the script is trying to do. For example, perhaps you ran the script the first time, and it created the security group called "sample-app" successfully, but when it tried to create the EC2 instance, AWS was out of capacity, and you got an error. If you wait until AWS has more capacity and try to re-run the script, you’ll now get an error as the script tries to create a security group called "sample-app" again, which isn’t allowed, as AWS requires security group names to be unique.
- Consistency
-
The great thing about ad hoc scripts is that you can use any programming language you want, and you can write the code however you want. The terrible thing about ad hoc scripts is that you can use any programming language you want, and you can write the code however you want. I wrote the Bash script one way; you might write it another way; your coworker may choose a different language entirely. If you’ve ever had to maintain a large repository of ad hoc scripts, you know that it almost always devolves into a mess of unmaintainable spaghetti code. As you’ll see shortly, tools that are designed specifically for managing infrastructure as code often provide a single, idiomatic way to solve each problem, so that your codebase tends to be more consistent and easier to maintain.
- Verbosity
-
The Bash script to launch a simple EC2 instance, plus the user data script, add up to around 80 lines of code—and that’s without the code for CRUD, deployment strategies, idempotency, and error handling. An ad hoc script that handles all of these properly would be hundreds or thousands of lines of code. And we’re talking about just one EC2 instance; your production infrastructure may include hundreds of instances, plus databases, load balancers, network configurations, and much more. The amount of custom code it takes to manage all of this with ad hoc scripts quickly becomes untenable. As you’ll see shortly, tools that are designed specifically for managing infrastructure as code typically provide APIs that are more concise for accomplishing common infrastructure tasks.
Ad hoc scripts have always been, and will always be, a big part of software delivery. They are the glue and duct tape of the DevOps world. However, they are not the best choice as a primary tool for managing infrastructure as code.
Key takeaway #1
Ad hoc scripts are great for small, one-off tasks, but not for managing all your infrastructure as code. |
If you’re going to be managing all of your infrastructure as code, you should use an IaC tool that is purpose-built for the job, such as one of the ones discussed in the next several sections.
Configuration Management Tools
After trying out ad hoc scripts, and hitting all the issues mentioned in the previous section, the software industry moved on to configuration management tools, such as Chef, Puppet, and Ansible. These tools first started to appear before cloud computing was ubiquitous, so the way they were originally designed was to assume someone else had done the work of setting up the hardware (e.g., your Ops team racked the servers in your own data center), and the primary purpose of these tools was to handle the software, including configuring the operating system, installing dependencies, deploying and updating apps, and so on.
Each configuration management tool has you write code in a different domain specific language (DSL): for example, with Chef, you write code in a DSL built on top of Ruby; with Puppet, you write code in a custom declarative language specifically designed for Puppet; with Ansible, you write code in a DSL built on top of YAML. Once you’ve written the code, most configuration management tools use a mutable infrastructure paradigm, where you have long-running servers that the configuration management tools update (mutate) over and over again, over many years. In order to update your servers, configuration management tools rely on the following two items:
- Master servers
-
You run one or more master servers (Chef Server, Puppet Server, or Ansible Automation Controller[11]), which are responsible for communicating with the rest of your servers, tracking the state of those servers, and running a reconciliation loop that continuously ensures the configuration of each server matches your desired configuration. The master servers also typically provide a central UI and API that you can use to see the state of your servers, perform various operations, and generate reports.
- Agents
-
Chef and Puppet require you to install custom agents (Chef Client and Puppet Agent) on each server, which are responsible for connecting to and authenticating with the master servers. You can configure the master servers to either push changes to these agents, or to have the agents pull changes from the master servers. Ansible, on the other hand, pushes changes to your servers over SSH, which is pre-installed on most servers by default (you’ll learn more about SSH in Part 7). Whether you rely on agents or SSH, this leads to a chicken-and-egg problem: in order to be able to configure your servers (with configuration management tools), you first have to configure your servers (install agents or set up SSH authentication). Solving this chicken-and-egg problem requires either manual intervention or external tools (e.g., you’ll see an example shortly of how you can use AWS APIs to configure SSH access for Ansible).
The best way to understand configuration management is to see it in action, so let’s go through an example of using Ansible.
Example: Deploy an EC2 Instance Using Ansible
To be able to use configuration management, the first thing you need is a server. If you have an existing server you can use—e.g., a physical server on-prem or a virtual server in the cloud—and you have SSH access to that server, you can skip this section, and go to the next one.
If you don’t have a server you can use, this section will show you how to deploy an EC2 instance using Ansible. Note that deploying and managing servers (hardware) is not really what configuration management tools were designed to do—later in this blog post, you’ll see how provisioning tools are typically a better fit for this task—but for spinning up a single server for learning and testing, Ansible is good enough.
Create a new folder called ansible:
$ cd fundamentals-of-devops
$ mkdir -p ch2/ansible
$ cd ch2/ansible
Inside the Ansible folder, create an Ansible playbook called create_ec2_instance_playbook.yml, with the contents shown in Example 4:
- name: Deploy an EC2 instance in AWS
hosts: localhost
gather_facts: no
environment:
AWS_REGION: us-east-2
tasks:
- name: Create security group (1)
amazon.aws.ec2_security_group:
name: sample-app-ansible
description: Allow HTTP and SSH traffic
rules:
- proto: tcp
ports: [8080]
cidr_ip: 0.0.0.0/0
- proto: tcp
ports: [22]
cidr_ip: 0.0.0.0/0
register: aws_security_group
- name: Create a new EC2 key pair (2)
amazon.aws.ec2_key:
name: ansible-ch2
file_name: ansible-ch2.key (3)
no_log: true
register: aws_ec2_key_pair
- name: Create EC2 instance with Amazon Linux 2003 (4)
amazon.aws.ec2_instance:
name: sample-app-ansible
key_name: "{{ aws_ec2_key_pair.key.name }}"
instance_type: t2.micro
security_group: "{{ aws_security_group.group_id }}"
image_id: ami-0900fe555666598a2
tags:
Ansible: ch2_instances (5)
Instead of a general-purpose programming language (GPL), such as Bash or Ruby or Python, Ansible uses a DSL defined on top of YAML. The YAML in the preceding playbook does the following:
1 | Create a security group: Allow inbound HTTP requests on port 8080 and inbound SSH requests on port 22. |
2 | Create an EC2 key pair: An EC2 key pair is a public/private key pair that can be used to authenticate to an EC2 instance. |
3 | Save the private key: Store the private key of the EC2 key pair locally in a file called ansible-ch2.key. You’ll use this private key in the next section to authenticate to the EC2 instance. |
4 | Deploy an EC2 instance: The instance uses the security group and public key from the previous steps. |
5 | Tag the instance: This sets the Ansible tag on the instance to "ch2_instances." You’ll use this tag in the next
section. |
To run this Ansible playbook, install Ansible, authenticate to AWS as described in Authenticating to AWS on the command line, and run the following:
$ ansible-playbook -v create_ec2_instance_playbook.yml
You should get log output for that looks something like this (truncated for readability):
PLAY [Deploy an EC2 instance in AWS] TASK [Create security group] changed: [localhost] => {"changed": true, "description": "..."} TASK [Create a new EC2 key pair] changed: [localhost] => {"censored": "...", "changed": true} TASK [Create EC2 instance with Amazon Linux 2003] changed: [localhost] => {"changed": true, "instance_ids": ["..."]} PLAY RECAP localhost: ok=3 changed=3 unreachable=0 failed=0
Now that you have a server to work with, you can see what configuration management tools are really designed to do: configuring servers to run software.
Example: Configure a Server Using Ansible
In order for Ansible to be able to configure your servers, you have to provide an inventory, which is a file that specifies which servers you want configured, and how to connect to them. If you have a set of physical servers on-prem, you can put the IP addresses of those servers in an inventory file, as shown in Example 5:
webservers:
hosts:
10.16.10.1:
10.16.10.2:
dbservers:
hosts:
10.16.20.1:
10.16.20.2:
10.16.20.3:
The preceding file organizes your servers into groups: the webservers
group has two servers in it and the
dbservers
group has three servers. You’ll then be able to write Ansible playbooks that target specific groups.
If you are running servers in the cloud, where servers come and go often, and IP addresses change more frequently,
you’re better off using an inventory plugin that can dynamically discover your servers. For example, if you deployed
an EC2 instance in AWS in the previous section, you can use the aws_ec2
inventory plugin by creating a file called
inventory.aws_ec2.yml with the contents shown in Example 6:
plugin: amazon.aws.aws_ec2
regions:
- us-east-2
keyed_groups:
- key: tags.Ansible (1)
leading_separator: '' (2)
This code does the following:
1 | Create groups based on the Ansible tag of the instance. In the previous section, you set this tag
to "ch2_instances," so that will be the name of the group. |
2 | By default, Ansible adds a leading underscore to group names. This disables it so the group name matches the tag name. |
For each group in your inventory, you can also specify group variables to configure how to connect to the servers in that group. You define these variables in YAML files in the group_vars folder, with the name of the file set to the name of the group. For example, for the EC2 instance in the ch2_instances group, you should create a file in group_vars/ch2_instances.yml with the contents shown in Example 7:
ansible_user: ec2-user (1)
ansible_ssh_private_key_file: ansible-ch2.key (2)
ansible_host_key_checking: false (3)
The variables this file defines are:
1 | Use "ec2-user" as the username to connect to the EC2 instance. This is the username you need to use with Amazon Linux AMIs. |
2 | Use the private key at ansible-ch2.key to authenticate to the instance. This is the private key of the EC2 key pair the playbook created in the previous section. |
3 | Skip host key checking so you don’t get interactive prompts from Ansible. |
Alright, with the inventory stuff out of the way, you can now create a playbook to configure your server to run the Node.js sample app. Create a file called configure_sample_app_playbook.yml with the contents shown in Example 8:
- name: Configure the EC2 instance to run a sample app
hosts: ch2_instances (1)
gather_facts: true
become: true
roles:
- sample-app (2)
This playbook does two things:
1 | Target the servers in the ch2_instances group, which should be a group with the EC2 instance you deployed in the previous section. If you are configuring some other server (e.g., your own servers on-prem), update this to the name of the group to target in your inventory file. |
2 | Configure the servers using an Ansible role called sample-app , as discussed next. |
An Ansible role is a structured way to organize tasks, templates, files, and other configuration you might want to apply to a server. The standard folder structure for Ansible roles looks like this:
roles └── <role-name> ├── defaults │ └── main.yml ├── files │ └── foo.txt ├── handlers │ └── main.yml ├── tasks │ └── main.yml ├── templates │ └── foo.txt.j2 └── vars └── main.yml
Each folder has a specific purpose: e.g., the tasks folder defines tasks to run on a server; the files folder has files to copy to the server; the templates folder lets you use Jinja templates to dynamically fill in data in files; and so on. Having this standardized structure makes it easier to navigate and understand an Ansible code base.
To create the sample-app
role for this playbook, create a roles/sample-app folder in the same directory as
configure_sample_app_playbook.yml:
. ├── configure_sample_app_playbook.yml ├── group_vars ├── inventory.aws_ec2.yml └── roles └── sample-app ├── files │ └── app.js └── tasks └── main.yml
Within roles/sample-app, you should create files and tasks subfolders, which are the only parts of the standardized role folder structure you’ll need for this simple example. Copy the Node.js sample app you saw earlier in in Section 1.2.1 into files/app.js:
$ cp ../../ch1/sample-app/app.js roles/sample-app/files/
Next, create tasks/main.yml with the code shown in Example 9:
- name: Add Node packages to yum (1)
shell: curl -fsSL https://rpm.nodesource.com/setup_21.x | bash -
- name: Install Node.js
yum:
name: nodejs
- name: Copy sample app (2)
copy:
src: app.js
dest: app.js
- name: Start sample app (3)
shell: nohup node app.js &
This code does the following:
1 | Install Node.js: Use the shell module to run a command on the server to add Node packages to yum , and then
use the yum module to install Node.js. |
2 | Copy the sample app: Use the copy module to copy app.js to the server. |
3 | Start the sample app: Use the shell module to execute the node binary to run the app in the background. |
To run this playbook, authenticate to AWS as described in Authenticating to AWS on the command line, and run the following command:
$ ansible-playbook -v -i inventory.aws_ec2.yml configure_sample_app_playbook.yml
You should get log output for each step, including a recap at the end that looks something like this:
PLAY RECAP xxx.us-east-2.compute.amazonaws.com : ok=5 changed=4 failed=0
The value on the left, "xxx.us-east-2.compute.amazonaws.com," is a domain name you can use to access the instance.
Open http://xxx.us-east-2.compute.amazonaws.com:8080
(note it’s port 8080 this time, not 80) in your web
browser, and you should see:
Hello, World!
Congrats, you’re now using a configuration management tool to manage your infrastructure as code!
Get your hands dirty
Here are a few exercises you can try at home to go deeper:
|
When you’re done experimenting with Ansible, you should manually undeploy the EC2 instance by finding it in the
EC2 Console: check the top right corner to make sure you’re in the same region
used by the Ansible playbook (us-east-2
), then on the Instances page, look for the instance ID the script outputs at
the end, click "Instance state," and choose "Terminate instance" in the drop down, as shown in
Figure 19. This ensures that your account doesn’t start accumulating any unwanted
charges.
How Configuration Management Tools Stack Up
Here is how configuration management tools stack up using the IaC category criteria:
- CRUD
-
Most configuration management tools support three of the four CRUD operations: they can create the initial configuration, read the current configuration to see if it matches the desired configuration, and if not, update the existing configuration. That said, support for read and update is a bit hit or miss. It works well for reading and updating the configuration within a server (if you use tasks that are idempotent, as you’ll see shortly), but for managing the servers themselves, or any other type of cloud infrastructure, it only works if you remember to assign each piece of infrastructure a unique name or tag, which is easy to do with just a handful of resources, but becomes more challenging at scale. Most configuration management tools do not support delete (which is why you had to undeploy the EC2 instance manually).
- Scale
-
Most configuration management tools are designed specifically for managing multiple remote servers. For example, you could easily update the preceding Ansible code to deploy 3 EC2 instances, and Ansible will automatically configure all 3 to run the web server (you’ll see an example of this in Part 3).
- Deployment strategies
-
Some configuration management tools have built-in support for deployment strategies. For example, Ansible has built-in support for rolling deployments, so if you deployed 20 servers, then updated the configuration in the Ansible role (e.g., to deploy a new version of the app) and re-ran Ansible, it could roll out the change in batches (e.g., updating 5 servers at a time), with zero downtime.
- Idempotency and error handling
-
Some tasks you do with configuration management tools are idempotent, some are not. For example, the
yum
task in Ansible is idempotent: it only installs the software if it’s not installed already, so it’s safe to re-run that task as many times as you want. On the other hand, arbitraryshell
tasks may or may not be idempotent, depending on what shell commands you execute. For example, the preceding playbook uses ashell
task to directly execute thenode
binary, which is not idempotent. After the first run, subsequent runs of this playbook will fail, as the Node.js app is already running and listening on port 8080, so you’ll get an error about conflicting ports. In Part 3, you’ll see a better way of running apps with Ansible that is idempotent. - Consistency
-
Most configuration management tools enforce a consistent, predictable structure to the code, including documentation, file layout, clearly named parameters, secrets management, and so on. While every developer organizes their ad hoc scripts in a different way, most configuration management tools come with a set of conventions that makes it easier to navigate and maintain the code, as you saw with the folder structure for Ansible roles.
- Verbosity
-
Most configuration management tools provide a DSL for specifying server configuration that is more concise than the equivalent in an ad hoc script. For example, you saw Ansible’s YAML-based DSL. At first, it might not seem like the code is any shorter than the Bash script: in fact, it’s roughly equal, with around 80 lines of Bash code (script to deploy EC2 instance plus user data script) versus about 80 lines of YAML with Ansible (playbook plus role). However, the 80 lines of Ansible code are doing considerably more: the Ansible code supports most CRUD operations, deployment strategies, idempotency, scaling operations to many servers, and consistent code structure. An ad hoc script that supported all of this would be many times the length.
Configuration management tools brought a number of advantages over ad hoc scripts, but they also introduced their own drawbacks. One big drawback is that some configuration management tools have a considerable setup cost: e.g., you may need to set up master servers and ways to connect to all your other servers (agents or SSH). A second big drawback is that most configuration management tools were designed for a mutable infrastructure paradigm: this can be problematic due to configuration drift, where over time, your long-running servers can build up unique histories of changes, so each server is subtly different from the others, which can make it hard to reason about what’s deployed and debug issues.
As cloud and virtualization becomes more and more ubiquitous, it’s becoming more common to use an immutable infrastructure paradigm, where instead of long-running physical servers, you use short-lived virtual servers that you replace every time you do an update. This is inspired by functional programming, where variables are immutable, so after you’ve set a variable to a value, you can never change that variable again, and if you need to update something, you create a new variable. Because variables never change, it’s a lot easier to reason about your code.
The idea behind immutable infrastructure is similar: once you’ve deployed a server, you never make changes to it again. If you need to update something, such as deploying a new version of your code, you deploy a new server. Because servers never change after being deployed, it’s a lot easier to reason about what’s deployed. The typical analogy used here (my apologies to vegetarians and animal lovers), is cattle vs pets: with mutable infrastructure, you treat your servers like pets, giving each one its own unique name, taking care of it, and trying to keep it alive as long as possible; with immutable infrastructure, you treat your servers like cattle, each one more or less indistinguishable from the others, with random or sequential IDs instead of names, and you kill them off and replace them regularly.
Key takeaway #2
Configuration management tools are great for managing the configuration of servers, but not for deploying the servers themselves, or other infrastructure. |
While it’s possible to use configuration management tools with immutable infrastructure patterns, it’s not what they were originally designed for, and that led to new approaches, as discussed in the next section.
Server Templating Tools
An alternative to configuration management that has been growing in popularity recently is to use server templating tools, such as virtual machines and containers. Instead of launching a bunch of servers and configuring them by running the same code on each one, the idea behind server templating tools is to create an image of a server that captures a fully self-contained "snapshot" of the operating system (OS), the software, the files, and all other relevant details. You can then use some other IaC tool (e.g., provisioning tools, as you’ll see in the next section) to install that image on all of your servers.
As shown in Figure 21, there are two categories of tools for working with images:
- Virtual machines
-
A virtual machine emulates an entire computer system, including the hardware. You run a hypervisor, such as VMware vSphere, VirtualBox, or Parallels, to virtualize (i.e., simulate) the underlying CPU, memory, hard drive, and networking. The benefit of this is that any VM image that you run on top of the hypervisor can see only the virtualized hardware, so it’s fully isolated from the host machine and any other VM images, and it will run exactly the same way in all environments (e.g., your computer, a QA server, a production server). The drawback is that virtualizing all this hardware and running a totally separate OS for each VM incurs a lot of overhead in terms of CPU usage, memory usage, and startup time. You can define VM images as code using tools such as Packer (which you typically use to create images for production servers) and Vagrant (which you typically use to create images for local development).
- Containers
-
A container emulates the user space of an OS.[12] You run a container engine, such as Docker or cri-o, to isolate processes, memory, mount points, and networking. The benefit of this is that any container you run on top of the container engine can see only its own user space, so it’s isolated from the host machine and other containers, and will run exactly the same way in all environments (your computer, a QA server, a production server, etc.). The drawback is that all of the containers running on a single server share that server’s OS kernel and hardware, so it’s much more difficult to achieve the level of isolation and security you get with a VM.[13] However, because the kernel and hardware are shared, your containers can boot up in milliseconds and have virtually no CPU or memory overhead. You can define container images as code using tools such as Docker.
You’ll go through an example of using container images with Docker in Part 3. In this blog post, let’s go through an example of using VM images with Packer.
Example: Create a VM Image Using Packer
As an example, let’s take a look at using Packer to create a VM image for AWS called an Amazon Machine Image (AMI). First, create a folder called packer:
$ cd fundamentals-of-devops
$ mkdir -p ch2/packer
$ cd ch2/packer
Next, copy the Node.js sample app you saw earlier in Section 1.2.1 into the packer folder:
$ cp ../../ch1/sample-app/app.js .
Create a Packer template called sample-app.pkr.hcl, with the contents shown in Example 10:
packer {
required_plugins {
amazon = {
version = ">= 1.3.1"
source = "github.com/hashicorp/amazon"
}
}
}
source "amazon-ebs" "amazon_linux" { (1)
ami_name = "sample-app-packer-${uuidv4()}"
ami_description = "Amazon Linux 2023 AMI with a Node.js sample app."
instance_type = "t2.micro"
region = "us-east-2"
source_ami = "ami-0900fe555666598a2"
ssh_username = "ec2-user"
}
build { (2)
sources = ["source.amazon-ebs.amazon_linux"]
provisioner "file" { (3)
source = "app.js"
destination = "/home/ec2-user/app.js"
}
provisioner "shell" { (4)
inline = [
"curl -fsSL https://rpm.nodesource.com/setup_21.x | sudo bash -",
"sudo yum install -y nodejs"
]
pause_before = "30s"
}
}
You create Packer templates using the HashiCorp Configuration Language (HCL) in files with a .hcl extension. The preceding template does the following:
1 | Source images: Packer will start a server running each source image you specify. The preceding code will result in Packer starting an EC2 instance running the Amazon Linux AMI you saw in the Bash and Ansible examples. |
2 | Build steps: Packer then connects to the server (e.g., via SSH) and runs the build steps in the order you specified. When all the build steps have finished, Packer will take a snapshot of the server and shut the server down. The preceding example runs two build steps, as described in (3) and (4), and the snapshot it creates is an AMI that has everything installed and configured to run the sample app. |
3 | File provisioner: The first build step runs a file provisioner to copy files to the server. The preceding code uses this to copy the Node.js sample app code in app.js to the server. |
4 | Shell provisioner: The second build step runs a shell provisioner to execute shell commands on the server. The preceding code uses this to install Node.js. |
So this Packer template is nearly identical to the Bash script and Ansible playbook, except the result of executing Packer is not a server running your app, but the image of a server with your app and all its dependencies installed. The idea is to use other IaC tools to launch one or more servers running that image; you’ll see an example later in this blog post of using OpenTofu to launch an EC2 instance running this AMI.
If you want to try the Packer template out, install Packer, authenticate to AWS as described in Authenticating to AWS on the command line, and run the following commands:
$ packer init sample-app.pkr.hcl
$ packer build sample-app.pkr.hcl
The first command, packer init
, installs any plugins used in this Packer template. Packer can create images for
many cloud providers—e.g., AWS, GCP, Azure, etc.—and the code for each of these providers lives not in the Packer binary
itself, but in separate plugins that you install via the init
command. The second command, packer build
, kicks off
the build process. When the build is done, which typically takes 3-5 minutes, you should see some log output that looks
like this:
==> Builds finished. The artifacts of successful builds are: --> amazon-ebs.amazon_linux: AMIs were created: us-east-2: ami-0ee5157dd67ca79fc
Congrats, you’re now using a server templating tool to manage your server configuration as code! The ami-xxx
value is
the ID of the AMI that was created from this template. Save the value somewhere, as later in this
post, you’ll see an example of how to deploy this AMI.
Get your hands dirty
Here are a few exercises you can try at home to go deeper:
|
How Server Templating Tools Stack Up
How do server templating tools stack up using the IaC category criteria?
- CRUD
-
Server templating only needs to support the create operation in CRUD. This is because server templating is a key component of the shift to immutable infrastructure: if you need to roll out a change, instead of updating an existing server, you use your server templating tool to create a new image, and deploy that image on a new server. So, with server templating, you’re always creating totally new images; there’s never a reason to read, update, or delete. That said, server templating tools aren’t used in isolation; you need some other tool to deploy these images (e.g., a provisioning tool, as you’ll see shortly), and you typically want that tool to support all CRUD operations.
- Scale
-
Server templating tools scale very well, as you can create an image once, and then roll that same image out to 1 server or 1,000 servers, as necessary.
- Deployment strategies
-
Server templating tools only create images; you use other tools and whatever deployment strategies those tools support to roll the new images out.
- Idempotency and error handling
-
Server templating tools are idempotent by design. Since you create a new image every time, the tool just executes the exact same steps every time. If you hit an error part of the way through, just re-run, and try again.
- Consistency
-
Most server templating tools enforce a consistent, predictable structure to the code, including documentation, file layout, clearly named parameters, secrets management, and so on.
- Verbosity
-
Because server templating tools don’t have to deal with most CRUD operations and are idempotent "for free," the amount of code you need is typically pretty small. Moreover, server templating tools provide concise DSLs. As a result, the code tends to be fairly short.
Key takeaway #3
Server templating tools are great for managing the configuration of servers with immutable infrastructure practices. |
As I mentioned a few times, server templating tools are powerful, but they don’t work by themselves. You need another tool to actually deploy and manage the images you create, such as provisioning tools, which are the focus of the next section.
Provisioning Tools
Whereas configuration management and server templating define the code that runs on each server, provisioning tools such as OpenTofu, Terraform, CloudFormation, OpenStack Heat, and Pulumi are responsible for creating the servers themselves. In fact, you can use provisioning tools to create not only servers but also databases, caches, load balancers, queues, monitoring, subnet configurations, firewall settings, routing rules, TLS certificates, and many other aspects of your infrastructure.
Under the hood, most provisioning tools work by translating the code you write into API calls to the cloud provider you’re using. For example, if you write OpenTofu code to create a server in AWS (which you will do in the next section), when you run OpenTofu, it will parse your code, and based on the configuration you specify, make a number of API calls to AWS to create an EC2 instance, security group, etc.
That means that, unlike with configuration management tools, you don’t have to do any extra work to set up master servers or connectivity. All of this is handled using the APIs and authentication mechanisms already provided by the cloud you’re using. Let’s see this in action by going through an example with OpenTofu.
Example: Deploy an EC2 Instance Using OpenTofu
Terraform versus OpenTofu
Terraform is a popular provisioning tool that HashiCorp open sourced in 2014 under the Mozilla Public License (MPL) 2.0. In 2023, HashiCorp switched Terraform to the non-open source Business Source License (BSL). As a result, the community created OpenTofu, a fork of Terraform that remains open source under the MPL 2.0 license, and is managed by the Linux Foundation. I prefer to use open source tools whenever possible, so this blog post series will use OpenTofu for all example code, but most of the examples should work with Terraform as well. |
As an example of using a provisioning tool, lets create an OpenTofu module that can deploy an EC2 instance. You write OpenTofu modules in HCL (the same language you used with Packer), in configuration files with a .tf extension. OpenTofu will find all files with the .tf extension in a folder, so you can name the files whatever you want, but there are some standard conventions, such as putting the main resources in main.tf, input variables in variables.tf, and output variables in outputs.tf.
First, create a new tofu/ec2-instance folder for the module:
$ cd fundamentals-of-devops
$ mkdir -p ch2/tofu/ec2-instance
$ cd ch2/tofu/ec2-instance
Within the tofu/ec2-instance folder, create a file called main.tf, with the contents shown in Example 11:
provider "aws" { (1)
region = "us-east-2"
}
resource "aws_security_group" "sample_app" { (2)
name = "sample-app-tofu"
description = "Allow HTTP traffic into the sample app"
}
resource "aws_security_group_rule" "allow_http_inbound" { (3)
type = "ingress"
protocol = "tcp"
from_port = 8080
to_port = 8080
security_group_id = aws_security_group.sample_app.id
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_instance" "sample_app" { (4)
ami = var.ami_id (5)
instance_type = "t2.micro"
vpc_security_group_ids = [aws_security_group.sample_app.id]
user_data = file("${path.module}/user-data.sh") (6)
tags = {
Name = "sample-app-tofu"
}
}
The code in main.tf does something very similar to the Bash script and Ansible playbook from earlier in the blog post:
1 | Configure the AWS provider: OpenTofu works with many providers, such as AWS, GCP, Azure, and so on. This code
configures the AWS provider to use the us-east-2 (Ohio) region. AWS has datacenters all over the world, grouped
into regions and
availability zones. An AWS region is a separate geographic area, such as us-east-2 (Ohio), eu-west-1
(Ireland), and ap-southeast-2 (Sydney). Within each region, there are multiple isolated datacenters known as
availability zones (AZs), such as us-east-2a , us-east-2b , and so on. |
2 | Create a security group: For each type of provider, there are many different kinds of resources that you can
create, such as servers, databases, and load balancers. The general syntax for creating a resource in OpenTofu is
as follows:
where The preceding code creates an |
3 | Allow HTTP requests: Use the aws_security_group_rule resource to add a rule to the security group from (2) that
allows inbound HTTP requests on port 8080. |
4 | Deploy an EC2 instance: Use the aws_instance resource to create an EC2 instance that uses the security group
and sets the Name tag to "sample-app-tofu." |
5 | Set the AMI: The EC2 instance sets the AMI to var.ami_id . This is a reference to an input variable defined
in variables.tf, as shown in Example 12. |
6 | Set the user data: The EC2 instance configures a user data script by reading in the user-data.sh file shown in Example 13. |
In the same folder as main.tf, create a file called variables.tf to define input variables, as shown in Example 12:
variable "ami_id" {
description = "The ID of the AMI to run."
type = string
}
As you’ll see shortly, this input variable will allow you to pass in the ID of a custom AMI to run in the EC2 instance: namely, the AMI you built from the Packer template in the previous section. You should also create a file called user-data.sh, which contains the user data script shown in Example 13:
#!/usr/bin/env bash
nohup node /home/ec2-user/app.js &
Note how this user data script is much shorter than the one you saw in the Bash code. That’s because all the dependencies (Node.js) and code (app.js) are already installed in the AMI by Packer. So the only thing this user data script does is start the sample app. This is a more idiomatic way to use user data.
Finally, create a file called outputs.tf with the contents shown in Example 14:
output "instance_id" {
description = "The ID of the EC2 instance"
value = aws_instance.sample_app.id
}
output "security_group_id" {
description = "The ID of the security group"
value = aws_security_group.sample_app.id
}
output "public_ip" {
description = "The public IP of the EC2 instance"
value = aws_instance.sample_app.public_ip
}
The preceding code defines output variables, which you can use to log and share values between modules. The preceding code defines output variables for the EC2 instance ID, security group ID, and EC2 instance public IP.
If you want to try the OpenTofu code out, install OpenTofu, authenticate to AWS as described in Authenticating to AWS on the command line, and run the following command:
$ tofu init
The init
command installs any providers used in this Tofu configuration. OpenTofu works with many
cloud providers, including AWS, which is used in the preceding example, as well as Azure, GCP, Alibaba Cloud, OCI,
and so on. The code for each provider doesn’t live in the tofu
binary, but in separate provider binaries, which you
download via the init
command.
Once init
has completed, run the apply
command to start the deployment process:
$ tofu apply
The first thing the apply
command will do is prompt you for the ami_id
value:
var.ami_id The ID of the AMI to run. Enter a value:
You can paste in the ID of the AMI you built using Packer in the previous section and hit Enter. Alternatively, if you
don’t want to be prompted interactively, you can instead use the -var
flag when running apply
:
$ tofu apply -var ami_id=<YOUR_AMI_ID>
You can also set the value for any input variable foo
using the environment variable TF_VAR_foo
:
$ export TF_VAR_ami_id=<YOUR_AMI_ID>
$ tofu apply
The second thing the apply
command will do is show you the execution plan (just plan for short), which will look
something like this (truncated for readability):
OpenTofu will perform the following actions: # aws_instance.sample_app will be created + resource "aws_instance" "sample_app" { + ami = "ami-0ee5157dd67ca79fc" + instance_type = "t2.micro" ... (truncated) ... } # aws_security_group.sample_app will be created + resource "aws_security_group" "sample_app" { + description = "Allow HTTP traffic into the sample app" + name = "sample-app-tofu" ... (truncated) ... } # aws_security_group_rule.allow_http_inbound will be created + resource "aws_security_group_rule" "allow_http_inbound" { + from_port = 8080 + protocol = "tcp" + to_port = 8080 + type = "ingress" ... (truncated) ... } Plan: 3 to add, 0 to change, 0 to destroy. Changes to Outputs: + instance_id = (known after apply) + public_ip = (known after apply) + security_group_id = (known after apply)
The plan lets you see what OpenTofu will do before actually making any changes, and prompts you for confirmation before
continuing. This is a great way to sanity-check your code before unleashing it onto the world. The plan output is
similar to the output of the diff
command that is part of Unix, Linux, and git
: anything with a plus sign (+) will
be created, anything with a minus sign (–) will be deleted, and anything with a tilde sign (~) will be modified
in place. Every time you run apply
, OpenTofu will show you this execution plan; you can also generate the execution
plan without applying any changes by running tofu plan
instead of tofu apply
.
In the preceding plan output, you can see that OpenTofu is planning on creating an EC2 Instance, security group, and
security group rule, which is exactly what you want. Type yes
and hit Enter to let OpenTofu proceed. You should see
log output that looks like this:
Do you want to perform these actions? OpenTofu will perform the actions described above. Only 'yes' will be accepted to approve. Enter a value: yes aws_security_group.sample_app: Creating... aws_security_group.sample_app: Creation complete after 2s aws_security_group_rule.allow_http_inbound: Creating... aws_security_group_rule.allow_http_inbound: Creation complete after 0s aws_instance.sample_app: Creating... aws_instance.sample_app: Still creating... [10s elapsed] aws_instance.sample_app: Still creating... [20s elapsed] aws_instance.sample_app: Creation complete after 22s Apply complete! Resources: 3 added, 0 changed, 0 destroyed. Outputs: instance_id = "i-0a4c593f4c9e645f8" public_ip = "3.138.110.216" security_group_id = "sg-087227914c9b3aa1e"
You can see the three output variables from outputs.tf at the end, including the public IP address in public_ip
.
Wait a minute or two for the EC2 instance to boot up, copy the public_ip
, open http://<public_ip>:8080
in your web
browser, and you should see:
Hello, World!
Congrats, you’re using a provisioning tool to manage your infrastructure as code!
Example: Update and Destroy Infrastructure Using OpenTofu
One of the big advantages of provisioning tools is that they support not just deploying infrastructure, but also
updating and destroying it. For example, now that you’ve deployed an EC2 instance using OpenTofu, make a change to the
configuration, such as adding a new Test
tag with the value "update," as shown in
Example 15:
Run the apply
command again, and you should see output that looks like this (truncated for readability):
$ tofu apply
aws_security_group.sample_app: Refreshing state...
aws_security_group_rule.allow_http_inbound: Refreshing state...
aws_instance.sample_app: Refreshing state...
OpenTofu used the selected providers to generate the following execution plan.
Resource actions are indicated with the following symbols:
~ update in-place
OpenTofu will perform the following actions:
# aws_instance.sample_app will be updated in-place
~ resource "aws_instance" "sample_app" {
id = "i-0738de27643533e98"
~ tags = {
"Name" = "sample-app-tofu"
+ "Test" = "update"
}
# (31 unchanged attributes hidden)
# (8 unchanged blocks hidden)
}
Plan: 0 to add, 1 to change, 0 to destroy.
Every time you run OpenTofu, it records information about what infrastructure it created in an OpenTofu state file. OpenTofu manages state using backends; if you don’t specify a backend, the default is to use the local backend, which stores state locally in a terraform.tfstate file in the same folder as the OpenTofu module (you’ll see how to use other backends in Part 5). This file contains a custom JSON format that records a mapping from the OpenTofu resources in your configuration files to the representation of those resources in the real world.
When you run apply
the first time on the ec2-instance
module, OpenTofu records in the state file the IDs of
the EC2 instance, security group, security group rules, and any other resources it created. When you run apply
again,
you can see "Refreshing state" in the log output, which is OpenTofu updating itself on the latest status of the world.
As a result, the new plan output that you see is the diff between what’s currently deployed in the real world and
what’s in your OpenTofu code. The preceding diff shows that OpenTofu wants to create a single tag called Test, which is
exactly what you want, so type yes
and hit Enter, and you’ll see OpenTofu perform an update operation, updating the
EC2 instance with your new tag.
When you’re done testing, you can run tofu destroy
to have OpenTofu undeploy everything it deployed earlier, which
should give you log output that looks something like this (log output truncated for readability):
$ tofu destroy
OpenTofu will perform the following actions:
# aws_instance.sample_app will be destroyed
- resource "aws_instance" "sample_app" {
- ami = "ami-0ee5157dd67ca79fc" -> null
- associate_public_ip_address = true -> null
- id = "i-0738de27643533e98" -> null
... (truncated) ...
}
# aws_security_group.sample_app will be destroyed
- resource "aws_security_group" "sample_app" {
- id = "sg-066de0b621838841a" -> null
... (truncated) ...
}
# aws_security_group_rule.allow_http_inbound will be destroyed
- resource "aws_security_group_rule" "allow_http_inbound" {
- from_port = 8080 -> null
- protocol = "tcp" -> null
- to_port = 8080 -> null
... (truncated) ...
}
Plan: 0 to add, 0 to change, 3 to destroy.
Changes to Outputs:
- instance_id = "i-0738de27643533e98" -> null
- public_ip = "18.188.174.48" -> null
- security_group_id = "sg-066de0b621838841a" -> null
When you run destroy
, OpenTofu shows you a destroy plan, which tells you about all the resources it’s about to
delete. This gives you one last chance to check that you really want to delete this stuff before you actually do it.
It goes without saying that you should rarely, if ever, run destroy
in a production environment—there’s no "undo"
for the destroy
command. If everything looks good, type yes
and hit Enter, and in a minute or two, OpenTofu will
clean up everything it deployed.
Get your hands dirty
Here are a few exercises you can try at home to go deeper:
|
Example: Deploy an EC2 Instance Using an OpenTofu Module
One of OpenTofu’s more powerful features is that the modules are reusable. In a general purpose programming language (e.g., JavaScript, Python, Java), you put reusable code in a function; in OpenTofu, you put reusable code in a module. You can then use that module multiple times to spin up many copies of the same infrastructure, without having to copy/paste the code.
So far, you’ve been using the ec2-instance
module as a root module, which is any module on which you run apply
directly. However, you can also use it as a reusable module, which is a module meant to be included in other modules
(e.g., in other root modules) as a means of code re-use.
Let’s give it a shot. First, create a folder called modules to store your reusable modules:
$ cd fundamentals-of-devops
$ mkdir -p ch2/tofu/modules
Next, move the ec2-instance
module into the modules folder:
$ mv ch2/tofu/ec2-instance ch2/tofu/modules/ec2-instance
Create a folder called live to store your root modules, as these modules configure your live environments:
$ mkdir -p ch2/tofu/live
Inside the live folder, create a new folder called sample-app, which will house the new root module you’ll use to deploy the sample app:
$ mkdir -p ch2/tofu/live/sample-app
$ cd ch2/tofu/live/sample-app
In the live/sample-app folder, create a main.tf file with the initial contents shown in Example 16:
module "sample_app_1" {
source = "../../modules/ec2-instance"
# TODO: fill in with your own AMI ID!
ami_id = "ami-09a9ad4735def0515"
}
To use one module from another, all you need is the following:
-
A
module
block. -
A
source
parameter that contains the file path of the module you want to use. The preceding code setssource
to the relative file path of theec2-instance
module in the modules folder. -
If the module defines input variables, you can set those as parameters within the
module
block. Theec2-instance
module defines an input variable calledami_id
, which you’ll need to set to theami_id
to the ID of the AMI you built in the server templating section earlier in this blog post.
If you were to run apply
on this code, it would use the ec2-instance
module code to create a single EC2 instance.
But the beauty of code reuse is that you can use the module multiple times, as shown in
Example 17:
module "sample_app_1" {
source = "../../modules/ec2-instance"
# TODO: fill in with your own AMI ID!
ami_id = "ami-09a9ad4735def0515"
}
module "sample_app_2" {
source = "../../modules/ec2-instance"
# TODO: fill in with your own AMI ID!
ami_id = "ami-09a9ad4735def0515"
}
This code has two module
blocks, so if you run apply
on it, it will create two EC2 instances. If you had three
module
blocks, it would create three EC2 instances; four module
blocks would create four EC2 instances; and so on.
And, of course, you can mix and match different modules, include modules in other modules, and so on. It’s not unusual
for modules to be reused dozens or hundreds of times across a company, so that you put in the work to create
a module that meets your company’s needs once, and then use it over and over again.
However, there are two changes you need to make to the ec2-instance
module in order for it to work effectively as a
reusable module.
The first change is to namespace all the resources created by the ec2-instance
module. Currently, it hard-codes all
names, such as the name of the security group, to "sample-app-tofu." AWS requires all security group names to be unique,
so if you ran apply
on these two module
blocks, you’d get an error due to the name conflicts. To fix this,
introduce a name
input variable in modules/ec2-instance/variables.tf
, as shown in Example 18:
name
input variable to the ec2-instance
module (ch2/tofu/modules/ec2-instance/variables.tf)variable "name" {
description = "The base name for the instance and all other resources"
type = string
}
Next, update the ec2-instance
module to use the name
input variable everywhere that was hard-coded to
"sample-app-tofu," including the aws_security_group
resource and the tags
in the aws_instance
resource, as shown
in Example 19:
name
input variable in the ec2-instance
module (ch2/tofu/modules/ec2-instance/main.tf)resource "aws_security_group" "sample_app" {
name = var.name
description = "Allow HTTP traffic into ${var.name}"
}
resource "aws_instance" "sample_app" {
# ... (other params omitted) ...
tags = {
Name = var.name
}
}
Now, back in sample-app/main.tf, you can set the name
parameter to different values in each of the module
blocks, as shown in Example 20:
name
input to different values (ch2/tofu/live/sample-app/main.tf)module "sample_app_1" {
source = "../../modules/ec2-instance"
# TODO: fill in with your own AMI ID!
ami_id = "ami-09a9ad4735def0515"
name = "sample-app-tofu-1"
}
module "sample_app_2" {
source = "../../modules/ec2-instance"
# TODO: fill in with your own AMI ID!
ami_id = "ami-09a9ad4735def0515"
name = "sample-app-tofu-2"
}
Now you’ll get two EC2 instances, one with all resources named "sample-app-tofu-1," and the other with all resources named "sample-app-tofu-2."
The second change is to remove the provider
block from the ec2-instance
module. Having a provider
block inside
a module isn’t wrong per se, but typically, reusable modules do not declare provider
blocks, and instead inherit
those from the root module. This allows the provider
block to be configured in different ways in different usages of
the module. For example, one usage might configure the provider
to use a different region, another usage might
configure it to a different AWS account, and so on. All you need to do is to move the provider
block from the
ec2-instance
(reusable) module to the sample-app
(root) module, as shown in
Example 21:
provider
block to the sample-app
root module (ch2/tofu/live/sample-app/main.tf)provider "aws" {
region = "us-east-2"
}
module "sample_app_1" {
source = "../../modules/ec2-instance"
# TODO: fill in with your own AMI ID!
ami_id = "ami-09a9ad4735def0515"
name = "sample-app-tofu-1"
}
module "sample_app_2" {
source = "../../modules/ec2-instance"
# TODO: fill in with your own AMI ID!
ami_id = "ami-09a9ad4735def0515"
name = "sample-app-tofu-2"
}
One last step: create an outputs.tf file in the sample-app folder with the contents shown in Example 22:
ec2-instance
module (ch2/tofu/live/sample-app/outputs.tf)output "sample_app_1_public_ip" {
description = "The public IP of the sample-app-1 instance"
value = module.sample_app_1.public_ip
}
output "sample_app_2_public_ip" {
description = "The public IP of the sample-app-2 instance"
value = module.sample_app_2.public_ip
}
output "sample_app_1_instance_id" {
description = "The ID of the sample-app-1 instance"
value = module.sample_app_1.instance_id
}
output "sample_app_2_instance_id" {
description = "The ID of the sample-app-2 instance"
value = module.sample_app_2.instance_id
}
The preceding code "proxies" the output variables from the underlying ec2-instance
module usages so that you can see
those outputs when you run apply
on the sample-app
root module.
OK, you’re finally ready to run this code:
$ tofu init
$ tofu apply
When apply
completes, you should have two EC2 instances running, and the output variables should show their IPs and
instance IDs. If you wait a minute or two for the instances to boot up, and open http://<IP>:8080
in your browser,
where <IP>
is the public IP of either instance, you should see the familiar "Hello, World!" text. When you’re done
experimenting, run tofu destroy
to clean everything up again.
Example: Deploy an EC2 Instance Using an OpenTofu Module from GitHub
There’s one more trick with OpenTofu modules: the source
parameter can be set to not only a local file path, but also
to a URL. For example, the blog post series’s sample code repo in GitHub includes an
ec2-instance
module that is more or less identical to your own ec2-instance
module. You can use the module from the
series’s sample code repo by setting the source
parameters to a URL, as shown in
Example 23:
source
to GitHub URLs (ch2/tofu/live/sample-app/main.tf)module "sample_app_1" {
source = "github.com/brikis98/devops-book//ch2/tofu/modules/ec2-instance"
# ... (other params omitted) ...
}
The preceding code sets the source
URL to a GitHub URL. Note the intentional use of two slashes (//
): the
part to the left of the two slashes specifies the GitHub repo and the part to the right of the two slashes specifies
the subfolder within that repo.
Run init
on this code one more time:
$ tofu init
Initializing the backend...
Initializing modules...
Downloading git::https://github.com/brikis98/devops-book.git...
Downloading git::https://github.com/brikis98/devops-book.git...
Initializing provider plugins...
The init
command is responsible for downloading provider code and module code, and you can see in the preceding
output that, this time, it downloaded the module code from GitHub. If you now run apply
, you should get the exact
same two EC2 instances as before. When you’re done experimenting, run destroy
to clean everything up.
You’ve now seen the power of reusable modules. A common pattern at many companies is for the Ops team to define and manage a library of vetted, reusable OpenTofu modules—e.g., one module to deploy servers, another to deploy databases, another to configure networking, and so on—and for the Dev teams to use these modules as a self-service way to deploy and manage the infrastructure they need for their apps.
This blog post series will make use of this pattern in future blog posts: instead of writing every line of code from scratch, you’ll be able to use modules directly from this series’s sample code repo to deploy the infrastructure you need for each post.
Get your hands dirty
Here are a few exercises you can try at home to go deeper:
|
How Provisioning Tools Stack Up
So, how do provisioning tools stack up using the IaC category criteria from before?
- CRUD
-
Most provisioning tools have full support all four CRUD operations. For example, you just saw OpenTofu create an EC2 instance, read the EC2 instance state, update the EC2 instance (to add a tag), and delete the EC2 instance.
- Scale
-
Provisioning tools scale very well. For example, the self-service approach mentioned in the previous section—where you have a library of reusable modules managed by Ops and used by Devs to deploy the infrastructure they need—can scale to thousands of developers and tens of thousands of resources, something that would be a nightmare to manage with ad hoc scripts.
- Deployment strategies
-
Provisioning tools typically let you use whatever deployment strategies are supported by the underlying infrastructure. For example, OpenTofu allows you to use instance refresh to do a zero-downtime, rolling deployment for groups of servers in AWS; you’ll try out an example of this in Part 3.
- Idempotency and error handling
-
Whereas most ad hoc scripts are procedural, where you specify step by step how to achieve some desired end state, most provisioning tools are declarative, where you specify the end state you want, and the provisioning tool automatically figures out how to get you from your current state to that desired end state. As a result, most provisioning tools are idempotent and can handle errors automatically. For example, you already saw in the CRUD discussion that you can re-run OpenTofu multiple times, and it will refresh its state, and come up with an execution plan to try to make the state of the world match the desired state in your code, handling changes in your code, changes in the outside world, and errors along the way.
- Consistency
-
Most provisioning tools enforce a consistent, predictable structure to the code, including documentation, file layout, clearly named parameters, secrets management, and so on.
- Verbosity
-
The declarative nature of provisioning tools and the custom DSLs they provide typically result in concise code, especially considering that code supports all CRUD operations, deployment strategies, scale, idempotency, and error handling out-of-the-box. The OpenTofu code for deploying an EC2 instance is about half the length of the Bash code, even though it does considerably more, and the more complex the infrastructure you’re managing, the larger this gap becomes.
Provisioning tools should be your go-to option for managing infrastructure. Moreover, many provisioning tools can be used to not only manage traditional infrastructure (e.g., servers), but many other aspects of software delivery as well. For example, you can use OpenTofu to manage your version control system (e.g., using the GitHub provider), metrics (e.g., using the Grafana provider), and your on-call rotation (e.g., using the PagerDuty provider), tying them all together with code.
Key takeaway #4
Provisioning tools are great for deploying and managing servers and infrastructure. |
Although I’ve been comparing IaC tools this entire blog post, the reality is that you’ll probably need to use multiple IaC tools together, as discussed in the next section.
Using Multiple IaC Tools Together
Each of the tools you’ve seen in this blog post has strengths and weaknesses. No one of them can do it all, so for most real-world scenarios, you’ll need to use several different tools, and it’s your job to pick the right tool(s) for the job.
Key takeaway #5
You usually need to use multiple IaC tools together to manage your infrastructure. |
The following sections show three common combinations I’ve seen work well at a number of companies.
Provisioning Plus Configuration Management
Example: OpenTofu and Ansible. You use OpenTofu to deploy all the underlying infrastructure, including the network topology, data stores, load balancers, and servers. You then use Ansible to deploy your apps on top of those servers, as depicted in Figure 22:
This is an easy approach to get started with and there are many ways to get Ansible and OpenTofu to work together (e.g., OpenTofu adds tags to your servers, and Ansible uses an inventory plugin to automatically discover servers with those tags). The major downside is that using Ansible typically means mutable infrastructure, rather than immutable, so as your codebase, infrastructure, and team grow, maintenance and debugging can become more difficult.
Provisioning Plus Server Templating
Example: OpenTofu and Packer. You use Packer to package your apps as VM images. You then use OpenTofu to deploy servers with these VM images and the rest of your infrastructure, including the network topology, data stores, and load balancers, as illustrated in Figure 23:
This is also an easy approach to get started with. In fact, you already had a chance to try this combination out earlier in this post. Moreover, this is an immutable infrastructure approach, which will make maintenance easier. The main drawback is that VMs can take a long time to build and deploy, which slows down iteration speed.
Provisioning Plus Server Templating Plus Orchestration
Orchestration tools, such as Kubernetes, Nomad, and OpenShift, help you deploy and manage apps on top of your infrastructure. You’ll do a deep-dive on orchestration in Part 3.
Example: OpenTofu, Packer, Docker, and Kubernetes. You use Packer to create a VM image that has Docker and Kubernetes agents installed. You then use OpenTofu to deploy a cluster of servers, each of which runs this VM image, and the rest of your infrastructure, including the network topology, data stores, and load balancers. Finally, when the cluster of servers boots up, it forms a Kubernetes cluster that you use to run and manage your Dockerized applications, as shown in Figure 24:
The advantage of this approach is that Docker images build fairly quickly, you can run and test them on your local computer, and you can take advantage of all of the built-in functionality of Kubernetes, including various deployment strategies, auto healing, auto scaling, and so on. The drawback is the added complexity, both in terms of extra infrastructure to run (Kubernetes clusters are difficult and expensive to deploy and operate, though most major cloud providers now provide managed Kubernetes services, which can offload some of this work) and in terms of several extra layers of abstraction (Kubernetes, Docker, Packer) to learn, manage, and debug.
Adopting IaC
At the beginning of this blog post, you heard about all the benefits of IaC (self-service, speed and safety, code reuse, and so on), but it’s important to understand that adopting IaC has significant costs, too. Not only do your team members have to learn new tools and techniques, they also have to get used to a totally new way of working. It’s a big shift to go from the old-school sysadmin approach of spending all day managing infrastructure manually and directly (e.g., connect to a server and update its configuration) to the new DevOps approach of spending all day coding and making changes indirectly (e.g., write some code and let an automated process apply the changes).
Key takeaway #6
Adopting IaC requires more than just introducing a new tool or technology: it also requires changing the culture and processes of the team. |
Changing culture and processes is a significant undertaking, especially at larger companies. Because every team’s culture and processes are different, there’s no one-size-fits-all way to do it, but here are a few tips that will be useful in most situations:
- Focus on the most important problems
-
It might be slightly heretical for the author of a book on DevOps to say this, but not every team needs IaC. Adopting IaC has a relatively high cost, and although it will pay off in the long term for some scenarios, it won’t for others. For example, if your team is spending all of its time dealing with bugs and outages that result from a manual deployment process, then it might make sense to prioritize IaC, but if you’re at a tiny startup where one person can easily manage all your infrastructure, or you’re working on a prototype or side project that might be thrown away in a few months, managing infrastructure by hand may be the right choice.
Don’t adopt IaC, or any other practice, just because you read somewhere that it’s a "best practice." Instead, identify the problems your team has, and always focus on solving the most important ones. As you saw in Section 1.1.2, at a certain scale, most companies face problems that are best solved by IaC, but until you get to that scale and start hitting those problems, it’s OK to focus on other priorities.
- Work incrementally
-
Even if you do prioritize adopting IaC, or any other practice, don’t try to do it all in one massive step. Instead, whenever you adopt any new practice, do it incrementally, as you learned in Part 1: break up the work into small steps, each of which brings value by itself. For example, don’t try to do one giant project where you try to migrate all of your infrastructure to IaC by writing tens of thousands of lines of code. Instead, use an iterative process where you identify the most problematic part of your infrastructure (e.g., the part that is causing the most bugs and outages), fix the problems in that part (e.g., perhaps by migrating that part to IaC), and repeat.
- Give your team the time to learn
-
If you want your team to adopt IaC, then you need to be willing to dedicate sufficient time and resources to it. If your team doesn’t get the time and resources that it needs, then your IaC migration is likely to fail. One scenario I’ve seen many times is that no one on the team has any clue how to use IaC properly, so you end up with a jumble of messy, buggy, unmaintainable code that causes more problems than it solves; another common scenario is that part of the team knows how to do IaC properly, and they write thousands of lines of beautiful code, but the rest of the team has no idea how to use it, so they continue making changes manually, which invalidates most of the benefits of IaC. If you decide to prioritize IaC, then I recommend that (a) you get everyone bought in, (b) you make learning resources available, such as classes, documentation, video tutorials, and, of course, this blog post series, and (c) you provide sufficient dedicated time for team members to ramp up before you start using IaC everywhere.
- Get the right people on the team
-
If you want to be able to use infrastructure as code, you have to learn how to write code. In fact, as you saw at the beginning of the blog post, a key shift with modern DevOps is managing more and more as code, so as a company adopts more DevOps practices, strong coding skills become more and more important. If you have team members who are not strong coders, be aware that some will be able to level up (given sufficient time and resources, as per the previous point), but some will not, which means you may have to hire new developers with coding skills for your team.
Conclusion
You now understand how to manage your infrastructure as code. Instead of clicking around a web UI, which is tedious and error-prone, you can automate the process, making if faster and more reliable. Moreover, whereas manual deployments always require someone at your company to do the busywork, with IaC, you can reuse code written by others, including both open source code (e.g., Ansible Galaxy, Docker Hub, Terraform Registry) and commercial code (e.g., Gruntwork Infrastructure as Code Library). This also includes the examples in the rest of this blog post series, most of which will be defined as code: you’ll see snippets of the code in the series itself, and you can find the fully-working examples in the sample code repo in GitHub.
To help you pick the right category of IaC tool, here are the 6 key takeaways you’ve seen throughout the blog post:
-
Ad hoc scripts are great for small, one-off tasks, but not for managing all your infrastructure as code.
-
Configuration management tools are great for managing the configuration of servers, but not for deploying the servers themselves, or other infrastructure.
-
Server templating tools are great for managing the configuration of servers with immutable infrastructure practices.
-
Provisioning tools are great for deploying and managing servers and infrastructure.
-
You usually need to use multiple IaC tools together to manage your infrastructure.
-
Adopting IaC requires more than just introducing a new tool or technology: it also requires changing the culture and processes of the team.
If the job you’re doing is provisioning infrastructure, you’ll probably want to use a provisioning tool. If the job you’re doing is configuring servers, you’ll probably want to use a server templating or configuration management tool. And as most real-world software delivery setups require you to do multiple jobs, you’ll most likely have to combine several tools together: e.g., provisioning plus server templating.
It’s worth remembering that there is also a lot of variety within an IaC category: e.g., there are big differences between Ansible and Chef within the configuration management category, and between OpenTofu and CloudFormation within the provisioning tool category. For a more detailed comparison, have a look at this comparison of Chef, Puppet, Ansible, Pulumi, CloudFormation, and Terraform/OpenTofu.
Going deeper on OpenTofu / Terraform
Many of the examples in the rest of this blog post series involve provisioning infrastructure, and I use OpenTofu as the provisioning tool for most of these examples, so you may want to become more familiar with this toolset. The best way to do that, with apologies for a bit of self-promotion, is to grab a copy of my other book, Terraform: Up & Running. |
Being able to use code to run a server is a huge advantage over managing it manually, but a single server is also a single point of failure. What if it crashes? What if the load exceeds the capacity of a single server? How do you roll out changes without downtime? These topics are the focus of Part 3, How to Deploy Your Apps.
Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing production software, published by O’Reilly Media!