Using Terragrunt for large scale disaster recovery
Do you have any examples or case studies on using Terragrunt to do a large scale disaster recovery operation? For example: Deploying 100+ resources to a new cloud region in the event of a disaster in another region What is your recommendation for how to use Terragrunt for such an event?
We can't provide advice on disaster recovery in general, as it will be VERY use case specific. E.g., What data do you have? How are you replicating it? What sort of consistency requirements do you have? How do you handle DNS fail over? And so on. All we can comment on here is, in relation to Terragrunt, if you have an environment `env1` deployed, how can you structure your code so it's easy to quickly spin up a duplicate environment `env2` (e.g., in case of disaster recovery)? Here's the rough idea: 1. Create `env1` using Terragrunt (i.e., folders + `terragrunt.hcl` files), with each `source` URL pointing to a specific version of your modules. 2. Ensure that any parameters that must be set differently in each environment (e.g., domain names, CIDR blocks) are extracted into top-level vars, such as e.g., an `env.hcl` file at the root of the environment. 3. When you need to spin up an `env2`, create a copy of the `env1` folder named `env2`. 4. Go into the `env.hcl` file and update it accordingly. 5. Take care of any manual steps that are necessary before spinning up the new environment. These depend on your use case, but it's common to have certain things managed outside of your infrastructure code, by design: e.g., writing secrets to your secrets store (e.g., AWS Secrets Manager or Vault); buying a new domain name; etc. 6. Run `terragrunt run-all apply`.