Automatically Remediate AWS Control Tower Drift with the Async Multi-Account Factory Module
Background: The Problem of Drift in AWS Control Tower
When managing AWS accounts via Control Tower and Service Catalog, you may encounter an issue where OpenTofu/Terraform detects drift in your infrastructure state. This is particularly common when:
- A new version of the Account Factory Provisioning Artifact is published
- You move an account between Organizational Units (OUs)
- Manual changes are made in the AWS Console or via API
In all of these cases, the provisioned_product_id
changes behind the scenes, but OpenTofu/Terraform isn’t aware of it. When you next apply your infrastructure code, it attempts to reconcile this drift by updating every affected provisioned product, even if nothing else has changed.
This becomes a major problem at scale:
- The update process is slow, especially for large organizations
- AWS imposes a hard limit of 5 concurrent updates, so you're throttled quickly
- OpenTofu/Terraform updates can take hours to complete
- You risk timeouts, failed updates, and broken infrastructure state
The Fix: Introducing the Async Multi-Account Factory Module
To solve this, we’ve introduced a new module: control-tower-multi-account-factory-async
Instead of managing provisioned_product_id
drift directly via OpenTofu/Terraform, this module uses an asynchronous workflow built from AWS native services:
Component | Role |
---|---|
EventBridge Rule | Listens for Service Catalog API calls like UpdateProvisioningArtifact and UpdateProvisionedProduct |
Ingest Lambda | Finds outdated provisioned products and queues them for update |
SQS FIFO Queue | Stores update jobs with strict ordering and deduplication |
Worker Lambda | Applies the update and launches Step Functions |
AWS Step Functions state machine | Monitors the update process and confirms success or failure |
This async approach operates as follows:
Why is this better?
- Drift is resolved outside OpenTofu/Terraform
- Updates happen automatically, with no user action
- Concurrency is controlled to avoid throttling
- Your OpenTofu/Terraform applies stay fast and clean
Step-by-Step: Switching to the Async Module
- Update your terragrunt.hcl to use the new module
Replace this:
terraform {
source = "git@github.com:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory?ref=VERSION"
}
With this:
terraform {
source = "git@github.com:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory-async?ref=VERSION"
}
Note: No state migration is needed — this is a drop-in replacement.
- Apply your changes
Run terragrunt apply
either directly or through GitHub Actions. This will deploy:
- The new Lambda functions
- SQS FIFO queue + DLQ
- EventBridge rules for Service Catalog API monitoring
- AWS Step Functions state machine
After apply, drifted provisioned_product_id
values will be remediated whenever UpdateProvisioningArtifact or UpdateProvisionedProduct API calls occur.
Note: If your environment is already in a drifted state, you will need to trigger UpdateProvisioningArtifact or UpdateProvisionedProduct to initiate drift remediation. The simplest way to do this is to deactivate and reactivate the current provisioning artifact version.
Optional: Control Concurrency with lambda_worker_max_concurrent_operations
AWS Service Catalog currently enforces a hard limit of 5 account-related operations concurrently that includes provisioning, updating, and enrolling. Exceeding this limit may result in throttling errors or failed updates.
To avoid hitting that limit (and prevent failed updates), you can configure the number of concurrent updates with the lambda_worker_max_concurrent_operations
variable. Example:
inputs = {
lambda_worker_max_concurrent_operations = X # default is 4
}
This variable tells the worker Lambda to never initiate more than X updates at a time, which can be used to leave headroom for other processes (like provisioning new accounts) to succeed.
Value | Behavior |
---|---|
5 | Max concurrency allowed by AWS (use with caution) |
4 | The default set by the async module |
<5 | Safe concurrency with headroom for other ops |
1 | Serialized updates, safest but slowest |