Knowledge Base

Reference Architecture: Handling CI/CD and EKS deployments

Answer

We had reference architecture installed already with EKS , but suspecting we will have some troubles in future. 1 - Trying to understrand more on recent Infrastructure pipeline CI/CD, we see that changes to envcommon (for all stages) and it is supposed to do apply, but if we have multiple changes in one PR (add eks node group + add changes to application like affinity for helm) and if only one fails (e.g. simple helm changes seem to take too much time and even if change is done timeout occurs and makes apply fail) we are stuck with code is being there for the architecture yet changes are not applied. I saw it was suggested to make changes as little as possible https://github.com/gruntwork-io/terragrunt/issues/720 but it does seem to fit for a fast test and deploy way as a single CI run takes at least 20 minutes (deploy runner bootup is only 2 minutes other are plan and deploy steps, timeout and waits from AWS excluded). 2 - Terraform does state locking during plan and deploy , if there are two CI deployments (even independent modules) at close time done to each other (think of 2 K8s services which are updating ECR docker image at same time) one of them will likely fail. How can we solve this issue (using Gitlab actions)? Can Deploy Runner wait for previous one to finish (it seems kind of stateless and no DB backing) Although may not be a big issue for 1-2 service it will be an issue once there are 10s even 50s of docker images using multiple ECR repositories. 3- Relevant to 2nd question we mainly use docker images for multiple applications (it may include frontend + backend that has public side or backend only within VPC) and all may have different CPU/memory even latency requirements and may need to add specific affinitys for targeting node groups. Wrapping them each in a K8service limits our options as they maybe used as simple helm charts and instead of using another wrapper we may use directly yaml files. Is there a proper/suggested way to better organize multiple helm deployments (instead of relying to use override_chart_inputs which is not guaranteed to convert to proper yaml ) ? Thanks in response --- <ins datetime="2022-05-06T09:18:37Z"> <p><a href="https://support.gruntwork.io/hc/requests/108560">Tracked in ticket #108560</a></p> </ins>

> 1 - (e.g. simple helm changes seem to take too much time and even if change is done timeout occurs and makes apply fail) You can tweak the waiting behavior of `helm` deployments using the `wait` and `wait_timeout` input variables of the `k8s-service` module: https://github.com/gruntwork-io/terraform-aws-service-catalog/blob/master/modules/services/k8s-service/variables.tf#L570-L582 This may help stabilize your deployments. > we are stuck with code is being there for the architecture yet changes are not applied. If a deployment failed, the idea is that something needs to change (either the cloud, or the code). The resolution path will be different depending on the nature of the error. In this case, you probably want to retry the CI job so that it attempts to `apply` the code again to try to get to steady state. Note that you can also have `terragrunt` automatically retry on errors using the `retryable_errors` attribute in the config: https://terragrunt.gruntwork.io/docs/reference/config-blocks-and-attributes/#retryable_errors Basically, even if we skip the error, you still will have undeployed code because the `helm_release` object will be tainted in `terraform`, so at a minimum a retry is necessary. > 2 - Terraform does state locking during plan and deploy , if there are two CI deployments (even independent modules) at close time done to each other (think of 2 K8s services which are updating ECR docker image at same time) one of them will likely fail. How can we solve this issue (using Gitlab actions)? > Can Deploy Runner wait for previous one to finish (it seems kind of stateless and no DB backing) > Although may not be a big issue for 1-2 service it will be an issue once there are 10s even 50s of docker images using multiple ECR repositories. Unfortunately we don't have the ability to bake in locking mechanisms to the ECS Deploy Runner. I filed https://github.com/gruntwork-io/terraform-aws-ci/issues/440 to track this feature request. In the meantime, as a workaround, you can probably handle this in `terragrunt` using the `retryable_errors` mechanism mentioned above. That is, you can have terragrunt automatically retry if it gets the "could not obtain lock" error from terraform. With that said, this should only an issue if you have many commits changing the same service. If you have such a situation, I think there is more risk that something will undo a change unintentionally even if you have waiting involved. I recommend rearchitecting your terragrunt code to minimize overlapping changes as much as possible. > 3- Relevant to 2nd question we mainly use docker images for multiple applications (it may include frontend + backend that has public side or backend only within VPC) and all may have different CPU/memory even latency requirements and may need to add specific affinitys for targeting node groups. > Wrapping them each in a K8service limits our options as they maybe used as simple helm charts and instead of using another wrapper we may use directly yaml files. > Is there a proper/suggested way to better organize multiple helm deployments (instead of relying to use override_chart_inputs which is not guaranteed to convert to proper yaml ) ? I am a bit confused as to what you want to accomplish here, but assuming you want to deploy the different services as a single unit, then the best approach would be to define you own service module that makes the necessary calls to `k8s-service` for each of your service.