Early on in our engineering journey at Tia, we made a decision that we would scrap our manual AWS console configuration process in place of infrastructure as code or IaC. After going through this transition we believe that IaC should be an early consideration for any engineering team due to its benefits of consistency, speed, and documentation.

How is infrastructure managed?

Before AWS, Azure, GCP, and other cloud services, setting up infrastructure for web services was an extremely expensive, manual process. Individuals had to physically install and manually configure server and networking hardware and then deploy the applications on them. Not only was this tedious, but it was also expensive. They had to pay the price for the equipment, the salaries of the networking engineers, as well as the lease and utilities of the data center real estate the equipment was deployed in. This created a massive barrier for smaller businesses like Tia to develop web services.

Today those barriers are significantly reduced with services like AWS’s Elastic Beanstalk which allows you to deploy a web application in minutes without ever touching a server rack. These services offer user interfaces to configure and manage the resources powering your application. This is considered manual configuration as a person is responsible for manually configuring each resource. If you were to deploy the same web application again, that person would have to read a runbook of configuration instructions (if someone was kind enough to write them) and make the necessary adjustments.

Infrastructure as code or IaC is an infrastructure management process where resources are defined in human-readable files rather than manually configured in a console. This format allows engineers to use familiar processes and tools like pull requests and source code repositories to develop their infrastructure in a way that enables speed and consistency while enforcing documentation and change management.

Why use IaC?

Consistency and change management

In software development, teams utilize several environments for developing, validating, and releasing software to their customers, typically called the development, staging, and production environments. These environments allow engineers to push new code into a realistic environment and test changes without impacting real customers. When the code is approved, it will be released to customers with high confidence that the new code will not break the existing features or system because that code has already been proven in the internal environments.

The reason this deployment process works is that we enforce consistency across environments. We use a process called a “Pull Request” where updates to the code are compared side by side with the current revision, documented, and approved by other engineers. Once approved the new code is automatically merged into the existing code to create a new revision and then deployed. Without this merging process, we’d almost guarantee inconsistency across our environments due to human error in manually applying code updates in the promotion environment.

Manually configuring infrastructure is like reading a pull request to promote some new code, then opening an editor and tediously applying every line-by-line update in the code to be deployed to the promotion environment and then bravely deploying it. You just wouldn’t do it, or at least not successfully. Infrastructure as code gives us a way to define infrastructure just like we define the code that runs on it.  Because infrastructure as code is defined as human-readable text files, engineers can use the familiar processes of pull requests and code review to promote infrastructure changes to our environments.

As a healthcare company, change management isn’t just nice to have, it’s required. In order to protect our patient’s information, we need every change to be audited and approved before it reaches our patients and providers. With manual configuration, this is increasingly more difficult to achieve as change management isn’t already baked into the process.

Before transitioning to IaC we couldn’t guarantee consistency across our environments. In one situation we found that a service was duplicating jobs, but only in staging. After debugging we realized the issue had to do with horizontal scaling, where the service would replicate jobs based on how many instances of the service were running. In staging, there were two instances, but in development, there was only a single instance. In this case, there was both an issue with the application and the infrastructure, making the root cause trickier to identity. If we had been using IaC we could have ensured consistency of environments to narrow the problem down to the application. Any inconsistencies in the current environment to the expected configuration would be detected as drift, which AWS’s drift detection tool could identify.

Speed and simplicity

With manual configuration, infrastructure is deployed using scripts and runbooks where an individual or team of engineers tediously runs scripts and makes manual adjustments in the user interface to configure the environment. This is an extremely time-consuming and error-prone process.

Utilizing infrastructure as code simplifies and speeds up the provisioning and configuration of resources. Because the instructions to build the infrastructure are fully defined in code, there is no need for manual configuration. Full environments can be deployed with an IaC template file and a few clicks in the AWS console.

With IaC we’ve seen drastic improvements in the speed and simplicity of deploying our infrastructure. When joining Tia, we only had two environments: staging and production. With our QA team running tests against staging, we quickly realized that we would want a 3rd environment for engineers to test new features without breaking the end-to-end tests. This required one of our engineers to spend a week replicating the production environment manually to create a new development environment. After completing our transition to IaC, we’re able to provision and configure our main infrastructure in under 30 minutes.

Documentation

What if all of your infrastructure configuration was carried out by a single person. That would be crazy, right? In a small, growing team, this is very often the case where a single person is responsible for building all of the environments. This is known as a bus factor of one. If that person were “hit by a bus” you’d be in a tough situation where no one fully understands why the infrastructure was built the way it was. The person picking this up is left to reverse engineer the infrastructure from other environments or the CloudTrail logs with little-to-no context.

Infrastructure as Code effectively increases the bus factor. Yes, you could still have a single engineer that’s responsible for the configuration of the IaC resource files, however, knowledge transfer, audibility, and documentation are all enforced through the change management process. If that engineer were to leave, your team still has all the resource files that describe your environment as well as the context for each individual update in the pull request and individual commits.

Poor documentation of infrastructure was a problem that we faced early on at Tia. In our case, this manifested itself by our engineers having a strong aversion to making any changes in the AWS console. The initial resources were not documented, poorly named, or not named at all, so making a change was either: ask everyone on the team if they knew what the purpose of a particular resource was or just leaving it as-is for fear of breaking something. This hesitation was crippling for an organization needing to move quickly to reach our goals and also costly as cruft would never get removed, but continue to be billed.

Transforming our infrastructure with IaC

In this section, I’ll explain the steps we took to make our transformation possible in the hopes that maybe it would be helpful for other teams going through a similar process. As our infrastructure is hosted on AWS I’ll be referring to Cloudformation, which is an IaC service used to provision and manage AWS resources. The three main concepts we followed were: know your architecture, name intentionally, and think in layers.

Know your architecture

This seems like a pretty obvious first step as every engineering team has an architecture diagram. What I’ve found is that very few engineering teams have an architecture diagram that reflects their real architecture, however. Usually, these diagrams are updated a few times a year, generally, around the time a new employee is being onboarded and you realize that the new microservice built last quarter never made it back into the diagram. Before diving in, it was incredibly important to have a well-documented architecture diagram at two levels: something high level to show how clients connect to our backend services as well as something low level that shows the networking intricacies of how each resource interacts with one another. In our case, we were undergoing a shift from Elastic Beanstalk to Fargate so we were already generating new diagrams to understand our networking requirements.

We realized that our architecture diagram would serve as the blueprint for all of our IaC templates so we stressed the importance of making the diagram as close to reality as feasible. We utilized a diagramming tool that included a diagram pack of all the AWS resources to make this process a bit more realistic. There are many tools available for this like Visio, Draw.io, and Lucidchart. There are even tools that can generate architecture diagrams off of IaC templates, however, we didn’t find these would generate a diagram that was particularly useful.

Name intentionally

One of the biggest frustrations we had before moving to IaC was clarity in what resources were being used for. If you’ve ever looked at your EC2 Security groups and saw launch-wizard-1 through launch-wizard-10 with no distinguishable name, you know what I’m talking about. “What are you? *click*... *click*... *click*... Can I delete you?” As a lean engineering team, the last thing you want to be doing is paying for resources that got leftover from some science experiment six months ago or accidentally deleting a cornerstone resource that no one on the team remembered was there. Naming and tagging resources intentionally is a core feature in AWS and something we wanted to ensure we got right the second time around.

When we defined the names of our resources we followed the general naming convention of ${resource}-${environment} e.g. rds-sg-production or if referenced canonically ${environment}/${resource} as in secrets manager and S3. This created a very easy-to-navigate experience within the console and opened some key advantages with IAM roles for granting access to resources in specific environments or explicitly denying others. This also made it extremely easy to reference resources in other templates and the naming convention was consistent and the environment name could be entered in as a parameter. Additionally, we made sure to tag everything possible with at least the environment name. By tagging each resource we could take advantage of advanced billing features to analyze our costs in each environment. For foolproof separation of environments, separate AWS accounts are the way to go, however, this was an effective alternative.

Think in layers

The final core concept was thinking in layers. AWS Cloudformation has two main concepts: resources and stacks. A resource is any AWS entity including S3 Buckets, EC2 instances, security group ingress rules, etc... A stack is a collection of AWS resources as defined in a Cloudformation yaml/json file. Additionally, Cloudformation has a concept of a nested stack, where you can declare a stack within another stack to promote reusable components. This is a similar concept in software engineering and forced us to think about separation of concerns and modular stack reuse.

When writing templates we started to think about our architecture in layers of dependencies. The main goal here was to reduce deployment time by enabling the parallelization of the provisioning of resources. We broke down our templates into several layers including the VPC, security groups, ECS, routing, database, caching, and services layer. Each layer was orchestrated by the main stack which controlled the order of creation based on dependencies between layers. This allowed us to deploy our layers iteratively during development. Just like application development, it’s very rare to write a template correctly the first time. As we were developing the templates we often found errors that would automatically roll back our full deployment. With layers, we could minimize the scope of the rollback so we weren’t starting from zero each deployment.

The largest dependency and first layer was the VPC. The VPC defined resources like the VPC, subnets, gateways, and route tables. After the VPC stack was deployed we kicked off the deployment of our Security Groups and ECS clusters.

The security groups were another significant dependency as each of the resources in the subsequent layers would reference these groups in their own stack e.g. if we wanted the Fargate services to have a rule to access RDS we’d need to specify ingress from the Fargate security group in the rules for the RDS security group. We put all the security group information in the same template so there would be a single template to update any time there was a change required for access between different resources.

After the security groups were created the rest of the deployment could run in parallel. Reducing sequential dependencies was a key strategy to reduce the runtime of the deployment. We created the routing, database, and caching stacks. The routing stack consisted of the resources to build the API gateway, load balancers, and private namespace for internal networking. The database and cache provisioned a database and cache from snapshots indicated in the Cloudformation parameters. Using parameters was key in keeping sensitive information outside of the code repository. Parameters can be stored and referenced in S3 where proper access control can be enforced or manually entered into the console and hidden.

After all the infrastructure was deployed it was time to deploy our services. We made our services environment aware, meaning they were able to deploy themselves into an existing environment e.g. adding themselves to a load balancing target group. It was in this step where we gathered commonality between our services to build reusable templates with resource sets that could be configured through parameters. Here we were able to realize sweeping gains with minimal effort e.g. adding a nested stack to all services for alerting, observability, and logging. Now all new services at Tia have a starting point and deploying them is as easy as defining the parameters, checking them into the source code, and running a cloud formation update.

The future of IaC at Tia

In the future, we’re looking forward to leveraging IaC for our local environments and development processes. A dream for any engineer is to have a full-fledged cloud infrastructure running on their local machines so they can test their code as they develop in a more realistic environment. With IaC and tools like Docker and LocalStack, this is a reality. Additionally, having the ability to deploy an entire environment with a script creates the possibility of deploying full environments for feature development so they can be independently validated without being mixed with the changes from other engineers on the team. This is also close to reality but will require us to make updates to how we create and tear down stacks as today we created explicit rules to not remove certain sensitive resources as a safety mechanism.

With a solid foundation deployed our engineering team feels well prepared for infrastructure changes in the future. We would recommend any engineering team to consider Infrastructure as Code early on in their infrastructure journey. It was an investment in Tia’s future that is already paying dividends.

Loading