Cloud Governance Using Infrastructure as Code

A model for how to rationalize cloud environments and ensure consistency - reducing costs and increasing deployment speed and security.

Jan 24, 2023

Combating Resource Sprawl

As enterprises move their workloads to the cloud, there is usually an adoption journey starting with pilot projects and smaller teams, until eventually the platform is exposed to the broader organization. The first hurdle they face is how to automate the deployment of new resources (VMs, networks, databases) and achieving consistent deployments. Thankfully this is a mostly well known problem that can be addressed by Infrastructure as Code (IaC) tools - converting deployment steps into code for repeatable use.

However it is very common for these organizations and teams to adopt the IaC tools in a haphazard way - different teams using different tools, lack of defined standards, little to no code reutilization across teams, and lack of consistent security and compliance enforcement.

In this blog post we will propose a governance model leveraging Terraform, an IaC open source tool created by Hashicorp that meets enterprise environments needs by being modular, cloud agnostic and enabling automation.

But before diving in, let’s do a quick review of the different IaC tools and when to use them.

Infrastructure as Code Tools

As mentioned previously, IaC tools allow converting deployment steps into code for repeatable use. Here are the most popular ones, and in which situation they fit best:

Each tool serves a specific need, Ansible in particular can be a great complement to an IaC workflow by participating in a separate workflow to configure VMs and generating “golden images”, which can be used to achieve a higher degree of standardization and immutability in cloud environments. Since our aim for this post is to cover standardization and combating resource sprawl in large organizations, we will focus on Terraform and the broader IaC workflow, but will cover VM image creation workflow in a separate post.

There are plenty of in-depth resources online about Terraform that are relevant from a practitioners’ point of view, however in the next section we will provide a high level overview of some of its key characteristics so that we can focus on how it can be used to support a Cloud Governance Model.

Terraform Overview

Terraform was created on 2015 by Hashicorp and is an open source tool with an enterprise version that has additional functionalities. It is declarative, meaning that a developer describes the desired infrastructure as code, and it is Terraform’s job to coordinate the sequence of events of how it will be deployed, and it is stateful, meaning once infrastructure is deployed, a state file is created to help track the infrastructure’s lifecycle.

Terraform code can be grouped into modules, which are reusable blocks of code that help promote standardization and best practices. Modules created by specific companies can be found in the Terraform Registry. It supports multiple providers, which are the different platforms that Terraform supports. Examples include all major cloud platforms like AWS and GCP, but also SaaS platforms such as DNSSimple, Palo Alto and Artifactory. It also supports multiple backends to store the state files, allowing for important flexibility since access to the state files is required to execute changes on a specific deployment. Examples include file, S3, Postgres and others.

The following diagram highlights a traditional Terraform workflow:

Once infrastructure declaration is captured in HCL, the init command will download dependencies (modules and providers), the plan command will show expected outcome, and apply will execute the instructions, creating the infrastructure. From there, changes can be made to the code and have it deployed again, following the infrastructure’s lifecycle. At the end, the destroy command can remove all the resources associated with this deployment.

An Initial Terraform Adoption Example

To illustrate with a real world example, let’s consider a developer who wants to deploy a virtual machine on GCP. To keep it simple and to start using modules, they could use the project factory module to deploy a new project and enable GCP APIs, and use the compute instance resource with minimum configuration and the default network. They would need their GCP credentials, so following a tutorial they might generate the GCP API user keys associated with their account, and set as an environment variable, and use terraform 0.13 locally because that is what they had installed. They might even configure the terraform backend to use an existing GCP bucket to store the state file. Once this is codified, they can execute terraform init, followed by plan, validate everything looks good, and execute terraform apply. After a while they can go back to the code and change the disk in this VM and execute another plan and apply sequence, or decide that they don’t need the VM anymore and execute terraform destroy.

The above example highlights the power of Terraform - the developer was able to create needed resources by themselves without having to file a ticket and wait days, leveraged existing modules to reduce coding and was able to validate deployment before creating resources.

However it also presents the challenges large enterprises face as teams start to adopt IaC organically and without governance. Here are a few of the security and governance risks illustrated above:

Depending on an externally hosted module can be a security risk,
Not enforcing module version tags can break future deployments,
Not enforcing security guardrails such as preventing the use of GCP’s default network can be a security risk or result in excess costs
Depending on manual IaC deployments instead of a centralized, automated pipeline can cause inconsistencies and increase support requests
Not standardizing Terraform version can break future deployments
Not having centralized state management can be a security risk, besides leading to untracked resources

Would it be possible to leverage Terraform’s strengths, but using an approach suited for enterprise governance? Yes! Let’s see how in the next section.

Cloud Governance Model Using Terraform

The above diagram illustrates the three workflows used in the Cloud Governance Model. It aims to achieve standardization of versions, centralized visibility of deployments, and enforcement of best practices. I have personally worked with organizations that have implemented it successfully and seen a decrease of support tickets, increased speed of developer onboarding and reduction of cloud costs.

Let’s go over each workflow in the next sections.

1- Infrastructure CICD Workflow

This workflow captures the creation of the infrastructure and the management of the its lifecycle by a team of developers. Unlike in the “Initial Terraform Adoption” example, here we see the use of a CICD pipeline, which can be combined with a GitOps workflow.

It starts with the developers, who store shared code in a Git repository (Github, GitLab, cloud-native source control, etc), for version control. Once there are updates on the code, the repository triggers and event which notifies the CICD tool that there is a new deployment ready, and after executing a preconfigured set of instructions, the CICD tool can automatically execute terraform apply, or alert specific individuals to approve or deny the deployment. For a deeper dive into the GitOps workflow, check here.

Like Git repositories, there are many options in the market that can serve the CICD tool role - Terraform Cloud, Jenkins, CircleCI, cloud-native solutions. Each has the concept of “pipelines” (or “jobs”, or “workspaces"), representing a collection of resources that will be tracked together, and pointing to a specific Git repository. Once there are changes to the code and the CICD tool receives a notification, it can execute instructions such as linting, unit tests, guardrail checks and other validation steps. Most tools will have the option to auto-execute code that passes the checks, or to notify users through Slack, email or other communication channels.

The advantages of this centralized workflow include:

Standardizing on a terraform version per pipeline, controlled at the CICD level
Visibility of all IaC deployments in the company’s cloud environment
Centralized state management
Quicker developer onboarding

Ideally, the code created by developers references internal modules, which are created and maintained in the following workflow.

2- Modules Workflow

This workflow captures creation of Terraform modules and the management of the their lifecycle by a team of cloud architects. Usually these are individuals that have greater experience with Terraform and with the specific cloud platforms used.

Each Terraform module codifies repeatable infrastructure patterns and help standardize deployments. For example, an organization can define a module to quickstart deployments - perhaps creating a project, a VPN, 2 subnets, a bastion host and another backend VM, and configuring networking and firewalls with sensible values. This would ensure that every new app deployed in the organization’s cloud platform will follow the same set of best practices.

Alternatively, modules can also be used for opinionated deployments, enforcing strict configuration options such as a storage bucket configured for private-only access, which can then be validated by using guardrails, as we will see in the next section.

In the previous “Initial Terraform Adoption” example, our developer used a module hosted in the Terraform Registry - a public module repository with contributions from vendors and the open source community. For enterprise environments, the ideal workflow is for cloud architects to go to Git and fork the original module, and publish to an internal repository that is privately managed by the organization. By default, Terraform will attempt to retrieve modules from the public repository. Private sources can be specified in the module declaration, which supports Git repos, cloud bucket storage and other backends.

Finally, modules also allow version tagging. This is particularly useful in enterprise settings because it allows the cloud architects to codify improvements and manage the lifecycle of existing modules, without impacting existing deployments that leverage such modules. Note that a change on the module won’t impact existing deployments, however if an untagged module is referenced in a new deployment, terraform will retrieve the latest version which can cause unintended consequences.

Like the opinionated deployments described above, enforcing the use of version tags when using modules is one of the foundations to achieve cloud governance. Both can be implemented with the use of guardrails, which we will cover in the next section.

3- Guardrails Workflow

The guardrails workflow captures creation of guardrails and the management of the their lifecycle by a team of security and compliance specialists. Guardrails are a specialized type of tests, focusing on addressing risk and compliance controls that can cover security, cost control and overall IaC best practices.

There can be a debate if this workflow is truly needed, since it can be argued that this level of governance can be achieved by opinionated modules (or blueprints!) and as of 0.13, Terraform custom validation rules. However this is a great example of better together, where strict configuration options and validation rules complement the guardrails. Guardrails ensure that, should the CICD pipeline be enforced, the security controls won’t be bypassed, and there is benefit to decoupling the validation logic and having it maintained by a specialist team, who can focus on the policy’s lifecycle independent of the module and deployment lifecycle. Security specialists can release a new guardrail policy without requiring a new module version be adopted by all dev teams.

There are many tools that can offer guardrail functionality, from Sentinel, available as a paid feature of Terraform Cloud/Enterprise which is great for organizations requiring corporate support and SLA’s, to OPA, an open source policy language which works for organizations that are comfortable with community-supported solutions. Here is a Github example of guardrails using OPA policies and workflow, which will be featured in an upcoming blog post.

Examples of guardrail checks include limiting the type of VMs that can be created in an environment (for example, preventing expensive GPUs in dev envs), the traditional checks against exposed firewall ports, to ensuring only certain versions of a module or provider are accepted.

Guardrails can also be an excellent tool to reduce regulatory compliance burden, for standards such as PCI DSS, FedRamp, HIPPA and others. If the CICD deployment pipelines are enforced, and by mapping the controls tracked by these standards to policy guardrails, an organization can be assured that the regulatory controls are met and this can save time and resources upon an audit, besides helping maintain a state of continuous compliance. There are many efforts on mapping these regulatory controls to guardrails format, unfortunately without a universally harmonized approach, but which can serve as a vision of the possible. Examples include Palo Alto’s PrismaCloud Policies, Azure native controls, and Cloud Native Foundation (CNF) Security Controls Catalogue initiative.

Usually guardrails are stored in a Git repository and version tracking is not as relevant as with the modules, since the latest version should be the one applied - ideally after tests to ensure it won’t break future deployments. It updates the original Terraform workflow with a new step:

In the Cloud Governance Model, calling the guardrail execution is the responsibility of the CICD tool. As described in the first workflow, this tool will allow a sequence of instructions to be executed once there are updates on the Git repository being tracked. Some guardrail tools can provide different level of alerts and associated logic depending on the test results. For example, a failed check can be bypassed - perhaps a more strict test not relevant to a dev environment, it can generate a warning, perhaps just a note on a upcoming deprecation, or it can fail a infrastructure deployment which does not meet required conditions.

The combination of automation and centralization achieved by the CICD workflow, the flexibility and reuse provided by the modules workflow, and the governance and security brought on by guardrails, result in a powerful Cloud Governance Model. However as with any endeavor in an enterprise setting, there are a few caveats, discussed in the next section.

Risks With The Model

The above Cloud Governance Model follows best practices used by organizations that have successful adopted IaC in their cloud journey. However here are a few caveats that deserve attention to ensure successful adoption

Lack of developer adoption or friction
Cloud Architects must understand that developers can have different maturity levels with IaC, and might have strong opinions about which tool or workflow is best. To facilitate adoption, architects should devise a clear communication plan articulating the benefits to the developers (code reuse, standardization, speed), and take steps for continuous onboarding and education of the developer teams.
Maturity level disconnect
Similar to the above point, but when there is a more severe disconnect of the technological maturity level between Cloud Architects and developers. In some organizations developers still heavily rely on manual processes and prefer a more visual UI approach instead of coding. This audience might think that Gitflow discussions are alien to their reality, and perhaps can be categorized as “cloud users” instead of developers. Ideas to address this disconnect could involve creating a separate environment (GCP folder, AWS org, etc) for these users, and restricting their permissions, or leveraging integrations such as Terraform Cloud’s integration with ServiceNow, which provides a more “ticket-based” deployment experience, or solutions such as AWS Service Catalog calling Terraform server.
Reduction of speed because of locked down options
Cloud Architects and the Security and Compliance team must strike the right balance between security and enforcement vs developer speed and flexibility. Terraform and other IaC tools excel at allowing developers to self-serve infrastructure, which in an enterprise environment should be guarded and governed, but within reason. One way of achieving this balance is ensuring both teams are not establishing rules in a vacuum, by inviting developers to the decision process and by creating modules and guardrails based on patterns already in use by the developers.
IaC all the things
As seen in the previous sections, Infrastructure as Code in general, and Terraform specifically, have features that allow for a successful Cloud Governance Model. However even though Terraform supports a variety of providers besides the cloud platforms, in my personal opinion it is not a great fit for Kubernetes workflows. Yes, one should use it to deploy Kubernetes clusters, however the cluster configuration and other operations should ideally be done in a separate workflow not covered by this model. Kubernetes is a universe in itself, it even has internal guardrails such as Gatekeeper, Falco and others. The challenge when these workflows start to entwine is that Terraform is heavily reliant on the State file, so any workflow with opinionated deployments such as Kubernetes, other app orchestrators such as Nomad, golden image creation, and to some extent serverless functions, should be tracked separate from the IaC model described in the above Cloud Governance Model.

Next Steps

Organizations vary in their technology maturity level, and to some the model and workflows presented here might seem overwhelming. Similar to a sedentary person planning to run a marathon, the ideal approach is to map your journey, establish milestones and collect quick wins.

Here are a few suggestions of where to start:

Identify teams already using Terraform in the organization and start standardizing modules, providers and Terraform versions
Create modules based on recurring patterns
Schedule meetings with the three teams - architects, developers and security - to plan the journey
Start CICD with pilot teams for early wins
Communicate, communicate, communicate!

What about you, have you experienced similar workflows in your organization? How was the adoption journey? Do you have strong feelings about one of the IaC tools discussed here?

Feel free to add your thoughts, questions and suggestions in the comment section below!

Compelling Cloud Substack

Discussion about this post