Skip to main content

Automatically Remediate AWS Control Tower Drift with the Async Multi-Account Factory Module

Background: The Problem of Drift in AWS Control Tower

When managing AWS accounts via Control Tower and Service Catalog, you may encounter an issue where OpenTofu/Terraform detects drift in your infrastructure state. This is particularly common when:

  • A new version of the Account Factory Provisioning Artifact is published
  • You move an account between Organizational Units (OUs)
  • Manual changes are made in the AWS Console or via API

In all of these cases, the provisioned_product_id changes behind the scenes, but OpenTofu/Terraform isn’t aware of it. When you next apply your infrastructure code, it attempts to reconcile this drift by updating every affected provisioned product, even if nothing else has changed.

This becomes a major problem at scale:

  • The update process is slow, especially for large organizations
  • AWS imposes a hard limit of 5 concurrent updates, so you're throttled quickly
  • OpenTofu/Terraform updates can take hours to complete
  • You risk timeouts, failed updates, and broken infrastructure state

The Fix: Introducing the Async Multi-Account Factory Module

To solve this, we’ve introduced a new module: control-tower-multi-account-factory-async

Instead of managing provisioned_product_id drift directly via OpenTofu/Terraform, this module uses an asynchronous workflow built from AWS native services:

ComponentRole
EventBridge RuleListens for Service Catalog API calls like UpdateProvisioningArtifact and UpdateProvisionedProduct
Ingest LambdaFinds outdated provisioned products and queues them for update
SQS FIFO QueueStores update jobs with strict ordering and deduplication
Worker LambdaApplies the update and launches Step Functions
AWS Step Functions state machineMonitors the update process and confirms success or failure

This async approach operates as follows:

Why is this better?

  • Drift is resolved outside OpenTofu/Terraform
  • Updates happen automatically, with no user action
  • Concurrency is controlled to avoid throttling
  • Your OpenTofu/Terraform applies stay fast and clean

Step-by-Step: Switching to the Async Module

  1. Update your terragrunt.hcl to use the new module

Replace this:

terraform {
source = "git@github.com:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory?ref=VERSION"
}

With this:

terraform {
source = "git@github.com:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory-async?ref=VERSION"
}

Note: No state migration is needed — this is a drop-in replacement.

  1. Apply your changes

Run terragrunt apply either directly or through GitHub Actions. This will deploy:

  • The new Lambda functions
  • SQS FIFO queue + DLQ
  • EventBridge rules for Service Catalog API monitoring
  • AWS Step Functions state machine

After apply, drifted provisioned_product_id values will be remediated whenever UpdateProvisioningArtifact or UpdateProvisionedProduct API calls occur.

Note: If your environment is already in a drifted state, you will need to trigger UpdateProvisioningArtifact or UpdateProvisionedProduct to initiate drift remediation. The simplest way to do this is to deactivate and reactivate the current provisioning artifact version.

Optional: Control Concurrency with lambda_worker_max_concurrent_operations

AWS Service Catalog currently enforces a hard limit of 5 account-related operations concurrently that includes provisioning, updating, and enrolling. Exceeding this limit may result in throttling errors or failed updates.

To avoid hitting that limit (and prevent failed updates), you can configure the number of concurrent updates with the lambda_worker_max_concurrent_operations variable. Example:

inputs = {
lambda_worker_max_concurrent_operations = X # default is 4
}

This variable tells the worker Lambda to never initiate more than X updates at a time, which can be used to leave headroom for other processes (like provisioning new accounts) to succeed.

ValueBehavior
5Max concurrency allowed by AWS (use with caution)
4The default set by the async module
<5Safe concurrency with headroom for other ops
1Serialized updates, safest but slowest