The lifecycle of terraform plan/apply usage in CI pipelines for pull requests
Hey, I'd like to ask more general question from terraform/terragrunt world. I expect to learn something since what I was able to find on the internet kinda contradicts my everyday experience. In the application world, on feature branch one just builds the thing, deploys it somewhere, test it, iterate, have it approved and then it's merged. Some additional testing and deploying may be done in main branch. When it comes to terraform code everything I was able to find talks about running `terraform plan` in PR. Only once its approved and merged to main branch, `terraform apply` is run. In real world however, successful plan is not even near a guarantee that apply will be successful. Possible reasons for ok plan but failed apply: - Remote apis often run many custom validations - health check interval cannot be less then 5s, resource name too long, config missing, invalid combination of settings etc etc - sure one learns most common ones over time but with 100s of different resource types for every cloud provider this is just bound to happen even to most senior engineers - Resource cannot be created at all because some external restriction is met - s3 bucket with this name exists, quota for number of instances in a region reached etc - Plan may be run against different state then apply - between last plan run and merge many hours/days can pass - unless there is always single PR being worked on at any given time - If it is first time a resource is being re-created it might error out - dependency between resources is not configured properly (first creation is usually gradual, second one is done all by TF) - configs like create_before_destroy might be missing If some of these happens after a merge it means just one thing. Open another PR and continue the work. In other words **running apply is often important part** of creating the code to be merged. This already goes against what internet tells me about plan in branch and apply after merge. My fuzzy feelings: Non production environemnts - one with low requirements for availability. One where new infra things are being tested before prod, or where dev instance of app is running and 10m of downtime is not a big deal etc - In here I would **just run apply on branch** - Locally or in ci through optional manual trigger - CI job for main branch should just check that tf plan is empty and fail otherwise - Wait but this will affect all other people working on some parallel PRs! - Yes it will. But unless we have some very good solutions to apply issues listed above, there is not going to be any parallel infra development anyway - Wait but this leads to semi applied, possibly broken state of my infrastructure! - Yes it does. But from infrastructure point of view it does not matter whether apply was called by ci from main branch or by me from branch. The difference is however that in branch I can iterate faster to fix stuff - And also as pointed below I would not do this for critical environemnts Business critical infra - typically production - Here it is generally desired that more sets of eyes sing off a change before it is even attempted. - Also changes made here are usually already being tested in other less strict envs (by modules promotion using terragrunt for example) - Here I would go for apply only after approve/merge <hr> - Would you agree that plan-on-branch-apply-after-merge is not the sole correct way of using terra(form/grunt/mate/whatever) in CI? - Have you implemented some other approach in your company/project? - Or there is some strategy how to deal with all the issues listed at the beginning which I am missing entirelly and plan-on-branch-apply-after-merge is good fit for all after all? I hope this is not too abstract and out of scope for this discussion community. I am looking forward to learn what other people think about this :-) --- <ins datetime="2022-06-02T07:24:18Z"> <p><a href="https://support.gruntwork.io/hc/requests/108704">Tracked in ticket #108704</a></p> </ins>
Our thoughts on this are laid out in the Core Concepts section of [this guide](https://docs.gruntwork.io/guides/build-it-yourself/pipelines/). Specifically, the relevant section starts at [Types of Infrastructure Code](https://docs.gruntwork.io/guides/build-it-yourself/pipelines/core-concepts/types-of-infrastructure-code). The key insight here is that, as much as possible, you want to have robustly tested Infrastructure Modules that run through an apply-validate-destroy cycle to ensure the component can be deployed correctly. This can act as a tested module artifact that you can deploy with confidence into your environments by the time it gets to rolling out to the existing environments. This is similar to the "golden image" concept of AMIs in immutable infrastructure patterns. Basically, you should have thoroughly tested your infrastructure in sandbox environments with [automated testing](https://www.infoq.com/presentations/automated-testing-terraform-docker-packer/) even before you get to rolling it out live. Rolling out live should be more a matter of rolling out a "golden image" of the infra code ("golden terraform module"). Note that although live infra config doesn't have testing, you can simulate testing by using a promotion workflow in this model. That is, you can promote the "golden terraform module" from `dev`, to `stage`, to `prod`, validating what's deployed along the way.