Navigating Safety: A Beginner's Guide to Implementing Terraform Guardrails with OPA

A gentle introduction to developing OPA policies for Terraform deployments

Jan 28, 2023

Terraform is a powerful tool for provisioning infrastructure in the cloud, but it can also be a source of risk if not used properly. One way to mitigate this risk is by using Open Policy Agent (OPA) and Rego, a policy language that is natively supported by OPA, as guardrails for your Terraform deployments.

In this blog post I will go over the fundamentals for creating OPA policies, focusing on protecting cloud infrastructure deployments. The aim is something simple, that achieves three outcomes:

Write code to prevent unsafe Terraform deployment
Use test-driven development
If possible, specify enforcement levels

While the concept of policy as code and the above goals might seen straightforward to some, in the past I have found it surprisingly not intuitive to figure out how to create and execute policies meeting these requirements. OPA is very robust, and there are many related projects and ways of applying Rego policies that can lead into rabbit holes.

Topics such querying external sources for dynamic values, drift detection, optimized CICD use, client/server policy registries and container guardrails are all outside of the scope of this post, but references to these will be found in the last section.

With the goals defined, let’s go over the basic aspects of OPA.

OPA and Rego Overview

Despite sharing its name with a catchy Eurovision tune, a 1959 British Parliament act and a discontinued Toyota car, OPA stands for “Open Policy Agent”, a project incubated by the Cloud Native Computing Foundation (CNCF) and currently graduated to “mature” stage.

It is an open source, general-purpose policy engine that can be used to enforce rules and constraints across a variety of systems and platforms, including cloud infrastructure provisioned with Terraform. Rego, on the other hand, is a declarative policy language that is natively supported by OPA and can be used to express complex rules and constraints in a clear and readable way.

Together, OPA and Rego can be used to enforce a wide range of policies for your Terraform deployments, including:

Enforcing naming conventions for resources
Enforcing resource limits
Enforcing security best practices
Enforcing compliance with regulatory requirements

One of the key benefits of using OPA and Rego with Terraform is that it allows you to codify your policies and constraints in a way that is easy to understand and maintain. This makes it much easier to keep your infrastructure deployments in compliance with your organization's policies and standards, and to quickly identify and remediate any issues that may arise.

In summary: despite often being conflated to mean the same thing, OPA is the general purpose policy engine, and Rego is the declarative language used by policies ingested by OPA.

Now let’s write a policy!

Rego Policies

Rego policies use a declarative format, navigating the data structure hierarchy in a Terraform code to validate specified values.

There are many ways of writing a policy. The following commented example uses the structure:

Package definition
Rules
Event listener

# storage_bucket_private.rego

# 1. Package definition
# Defines the package name that will contain this policy. 
# Any name can be given, this will be used when calling this policy 
# to create namespaces and avoid naming conflicts
package terraform

# 2. Rule
# First rule, any function name is acceptable, "deny" is convention 
# This rule checks Bucket Access Control not set to "public"
deny[msg] {
    changeset := input.resource_changes[_]
    is_create_or_update(changeset.change.actions)
    changeset.type == "google_storage_bucket_access_control"
    changeset.change.after.predefined_acl == "Public"
    msg := sprintf("%-40s :: GCS buckets must not be PUBLIC", [
        changeset.name
    ])
}

# 2. Rule
# Another rule, same idea as above but checking for 
# a different parameter
# Checks google_storage_bucket_acl for predefined ACL's
deny[msg] {
    bad_acls := [ "publicRead", "publicReadWrite" ]
    changeset := input.resource_changes[_]
    is_create_or_update(changeset.change.actions)
    changeset.type == "google_storage_bucket_acl"
    changeset.change.after.predefined_acl == bad_acls[_]
    msg := sprintf("%-40s :: GCS buckets must not use predefined ACL '%s'", [
        changeset.name, changeset.change.after.predefined_acl
    ])
}

# 3. Event listener
# Specifies that if OPA detects that as a result of terraform execution,
# a resource will be "created" or "updated", the same function should
# be called
is_create_or_update(actions) { actions[_] == "create" }
is_create_or_update(actions) { actions[_] == "update" }

This example highlights the importance of understanding the platform that will be used, in order to accurately translate risk and security controls into a policy. Here, the original control could be “cloud storage buckets in GCP cannot be public”. In order to validate this requirement, a security dev creating the policy should know that there are two ways that a GCP Object Storage can be made public - with Bucket Access Control and with Predefined ACLs.

The above template can be updated to match other resources and other values. For example, to ensure a GCP VM is only deployed in the us-central-1a zone, one would use

...
changeset.type == "google_compute_instance"
changeset.change.after.zone == "us-central-1a"
...

The values for available resources and parameter names can be found in the provider’s documentation.

For easier maintenance, it is best practice to separate distinct policies as separate files - one to prevent public cloud buckets, another to block open firewall ports, a third to limit VM machine size, etc.

Once a policy is created it is time to test it before bringing it into production. Tests are covered in the next section.

Testing a Policy

Test-driven development is a great practice to not only ensure that the policies work as intended, but to also serve as documentation of what is allowed in an environment, in addition to providing visibility on percentage of code covered by tests. Many open source projects and organizations do not accept contributions that do not include test validation.

There are two components to testing an OPA policy - the tests to be executed, and the mocked data.

Here is a template for a test of the above policy:

# storage_bucket_private_test.rego

# Package declaration
package terraform

# Tests
# "deny" is the function name defined in the policy
# "input" specifies the block from mocked data that should be tested
# "count(result)" is a way of tracking if the deny function returned 
# any results. Based on how the policy was written, results will only
# be returned upon failed test
test_valid_acl {
    result = deny with input as data.mock.valid_bucket_acl
    count(result) == 0 #given mock, expect success
}

test_invalid_acl {
    result = deny with input as data.mock.invalid_bucket_acl
    count(result) == 1 #given mock, expect fail
}

test_valid_control {
    result = deny with input as data.mock.valid_access_control
    count(result) == 0 #given mock, expect success
}

test_invalid_control {
    result = deny with input as data.mock.invalid_access_control
    count(result) == 1 #given mock, expect fail
}

This test evaluates that the policy works as expected when the Terraform deployment meets the requirements (private bucket), and also when it does not meet the requirement (public bucket). Since there were two ways to specify that, we end up with four tests.

What about the mock data? This is just the json formatted output of a Terraform plan, obtained by executing

# Generates plan result in binary format
terraform plan --out tfplan.binary
# Converts from binary format to json
terraform show -json tfplan.binary > tfplan.json

For organizations with larger teams and following the Cloud Governance Model, it can be challenging for the Security and Compliance Team to create these mocks from large Terraform repositories. A solution is to isolate the blocks of code that matter in the original Terraform, and run the plan separately. Alternatively, smaller Terraform deployments can be created to generate the desired output for the tests.

Here is a mock data example for the private bucket test (storage_bucket.mock.json):

{
    "mock": {
      "valid_bucket_acl": {
        "tfplan": {
          "format_version": "0.1",
          "terraform_version": "0.14.4",
          "resource_changes": [
            {
                "address": "google_storage_bucket_acl.test_acl",
                "mode": "managed",
                "type": "google_storage_bucket_acl",
                "name": "test_acl",
                "provider_name": "registry.terraform.io/hashicorp/google",
                "change": {
                    "actions": [
                        "create"
                    ],
                    "before": null,
                    "after": {
                        "bucket": "pg-gcs-1",
                        "default_acl": null,
                        "predefined_acl": "authenticatedRead"
                    },
                    "after_unknown": {
                        "id": true,
                        "role_entity": true
                    }
                }
            }
          ]
        }
      },
      "invalid_bucket_acl": {
        "tfplan": {
          "format_version": "0.1",
          "terraform_version": "0.14.4",
          "resource_changes": [
            {
                "address": "google_storage_bucket_acl.test_acl",
                "mode": "managed",
                "type": "google_storage_bucket_acl",
                "name": "test_acl",
                "provider_name": "registry.terraform.io/hashicorp/google",
                "change": {
                    "actions": [
                        "create"
                    ],
                    "before": null,
                    "after": {
                        "bucket": "pg-gcs-1",
                        "default_acl": null,
                        "predefined_acl": "publicRead"
                    },
                    "after_unknown": {
                        "id": true,
                        "role_entity": true
                    }
                }
            }
          ]
        }
      },
      "valid_access_control": {
        "tfplan": {
          "format_version": "0.1",
          "terraform_version": "0.14.4",
          "resource_changes": [
            {
              "address": "google_storage_bucket_access_control.public_rule",
              "mode": "managed",
              "type": "google_storage_bucket_access_control",
              "name": "public_rule",
              "provider_name": "registry.terraform.io/hashicorp/google",
              "change": {
                  "actions": [
                      "create"
                  ],
                  "before": null,
                  "after": {
                      "bucket": "pg-gcs-1",
                      "entity": "allUsers",
                      "role": "READER",
                      "timeouts": null
                  },
                  "after_unknown": {
                      "domain": true,
                      "email": true,
                      "id": true
                  }
              }
            }
          ]
        }
      },
      "invalid_access_control": {
        "tfplan": {
          "format_version": "0.1",
          "terraform_version": "0.14.4",
          "resource_changes": [
            {
              "address": "google_storage_bucket_access_control.public_rule",
              "mode": "managed",
              "type": "google_storage_bucket_access_control",
              "name": "public_rule",
              "provider_name": "registry.terraform.io/hashicorp/google",
              "change": {
                  "actions": [
                      "create"
                  ],
                  "before": null,
                  "after": {
                      "bucket": "pg-gcs-1",
                      "entity": "Public",
                      "role": "READER",
                      "timeouts": null
                  },
                  "after_unknown": {
                      "domain": true,
                      "email": true,
                      "id": true
                  }
              }
            }
          ]
        }
      }
    }
  }

Looking at this mock data we can see the parameters being referenced in the Rego policy, using the . (dot) notation to navigate the hierarchy, such as

change.after.predefined_acl

This highlights the flexibility of the language, supporting logic operations on the parameters listed for guardrail validation.

Having seen how to create a policy and its associated tests, let’s check how to execute them next.

CLI commands

OPA is available as source code, compiled binary CLI and as a GO library. Instructions for downloading can be found here. We will focus on the CLI. Make sure you have the binary on the PATH, otherwise change the following commands to “./opa” 🙂

Like any modern CLI, you can execute the following command for help with available commands:

opa -h

For this blog post we will only use two commands:

opa eval
opa test

These are the commands to run the tests and to run the output of a terraform plan.

Let’s start with the tests:

Go to the folder containing the three files - policy, test and mock data
Execute

opa test . -v

This will run the four tests listed in storage_bucket_private_test.rego and list the results. Using the “.” (dot) parameter on step 2 tells OPA to run the tests on all files present in this folder. Alternatively, you can specify which tests you want executed.

To execute the policies on a terraform deployment:

# Generates the terraform output that OPA will evaluate
terraform plan --out tfplan.binary
terraform show -json tfplan.binary > tfplan.json

# Execute the policy checks
# Here, "terraform" is the package name defined in the policy
# "deny" is the name of the function to be called
# "policies" is the folder containing the policies
# "input" is the json file created in the previous step
opa eval 'data.terraform.deny[x]' --data policies/ --input tfplan.json --format raw

Here “eval” will run the specified function and return any results. Following the same pattern as with the tests, in a CICD pipeline we can wrap this call inside a logic checking for return value and perform the appropriate steps given the result.

A Note on Violation Levels

By default, OPA and Rego do not offer different enforcement levels such as “Warning”, “Soft Deny” and “Hard Deny”. These can be useful to adjust requirements based on environments - perhaps the dev environment is more permissive than the prod environment, to document upcoming deprecations, or to nudge developers toward best practices.

Some creative logic can be used by leveraging different function names, such as “warn” alongside “deny”, and decoupling these calls from the original function based on use case. Alternatively, this is a great example of the many rabbit holes found when learning OPA and additional resources are listed in the last section.

OPA in a CICD Pipeline

What we covered above was meant as a gentle introduction to OPA and Rego. In the real world, instead of manually calling Terraform and OPA, you should codify these commands in an automated CICD pipeline to achieve speed and consistency of deployments.

Any CICD tool will suffice - Jenkins, Circle CI, Cloud Build, even serverless functions in the cloud. At a minimum level, the pattern of execution will be the same:

Automatically trigger execution from a Git hook
Execute terraform plan, outputting as a json file
Execute OPA checks
Logic to take appropriate action

The “appropriate action” in step 4 can vary - completely stop deployment upon any violation, ignore results, notify someone to decide what action should be taken, or execute a terraform apply (using “terraform apply —auto-approve” to avoid the prompt).

In the next section we will start wrapping up and see how OPA can be used by companies looking to achieve compliance.

Using OPA to Reduce Audit Toil

Guardrails in general, and OPA in specific, can be used to achieve a state of continuous compliance towards industry regulations, such as PCI DSS, FedRamp or HIPAA, or internal security controls. This is done by mapping a regulatory control to an OPA policy that validates the Terraform execution meets the specified requirements.

One way of tracking controls is to use OPA’s support of annotations, to document the control number addressed by a policy, as part of the METADATA information. This approach can be particularly useful for organizations that must comply with multiple standards - after a harmonization exercise, controls from different standards can be mapped to the same OPA policy.

Another impactful use of OPA is to go beyond just leveraging it as guardrails to deployment, but to also use it to continuously validate the infrastructure. An organization further along in their cloud adoption journey might have fully shifted security to the left, ensuring that no infrastructure creation or changes is done outside the CICD pipeline - in this case OPA as guardrails as described in the Cloud Governance Model should suffice.

However for companies still on the modernization journey, often there is a combination of automated and manual creation and updates to their infrastructure. In those cases, a CRON job or scheduled cloud function can be used to re-run the CICD pipeline at regular intervals and detect drifts from the original Terraform code.

This approach is still limited because it will only validate what is tracked in the Terraform state, however serves as the starting point for a drift detection and remediation system.

Bottom line is - in an ideal scenario there shouldn’t be any drifts to detect and remediate because creation and changes shouldn’t be possible outside of the CICD pipeline, however in reality some level of drift detection and remediation, while accepting compromises, can be achieved.

Having said that, in this blog post we presented the most basic use of OPA policies. In the next section you will find links to solutions that have ready-made policies, use more sophisticated architectures, provide enterprise support and cover additional types of workloads.

Additional Resources

Config-test, a subproject of OPA with different enforcement levels
Kics by Checkmarx: OPA wrapper with ready-made policies, remediation and CICD integration
Easy-Infra: Kics wrapper with Docker images and additional functionalities
OPA APIs: For remote policy repository and log collection in a distributed architecture
OPA FAQ: Documenting keywords, operator precedence and best practices
CIS Controls: Great reference for ideas for cloud security policies
Vendors with solutions in the space: Wiz.io, Concourse Labs, Palo Alto PRISMA, Mondoo, Terraform Enterprise (Sentinel)
Gatekeeper: OPA for Kubernetes

Feel free to post your questions or thoughts in the comment section below!

Compelling Cloud Substack

Discussion about this post