I received an email from AWS regarding shared NAT Gateways, what should I do?
I received the following email from AWS regarding NAT Gateways. As a Reference Architecture customer and Gruntwork VPC module user I would like to know if there is anything I need to do > We have observed that your Amazon VPC resources are using a shared NAT Gateway across multiple Availability Zones (AZ). To ensure high availability and minimize inter-AZ data transfer costs, we recommend utilizing separate NAT Gateways in each AZ and routing traffic locally within the same AZ. Each NAT Gateway operates within a designated AZ and is built with redundancy in that zone only. As a result, if the NAT Gateway or AZ experiences failure, resources utilizing that NAT Gateway in other AZ(s) also get impacted. Additionally, routing traffic from one AZ to a NAT Gateway in a different AZ incurs additional inter-AZ data transfer charges. We recommend choosing a maintenance window for architecture changes in your Amazon VPC. The following is a list of your VPCs and NAT Gateways that are shared across AZ(s), in the format: ‘VPC | NAT Gateway’: vpc-xxxxxx | nat-xxxxxx Please refer to the AWS public documentation on how to create a NAT Gateway [1], and how to configure routes for different NAT Gateway use cases [2]. Should you have any questions or concerns, please reach out to the AWS Support team [3]. [1] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html#nat-gateway-working-with [2] https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-scenarios.html [3] https://aws.amazon.com/support --- <ins datetime="2023-03-10T19:52:39Z"> <p><a href="https://support.gruntwork.io/hc/requests/109981">Tracked in ticket #109981</a></p> </ins>
AWS recently sent an email notification about using shared NAT Gateways in VPCs and recommended a separate NAT Gateway for each availability zone (AZ) to ensure high availability and to minimize inter-AZ data transfer costs. ### **How serious is this?** First, there is no security implication, underlying module bug, or critical status. If you received the AWS notification, AWS identified an issue where your network configuration could be made more resilient, and we agree with their recommendation. The main downside of not following this recommendation is the possibility that your apps might be unavailable if just one AZ fails, versus multiple AZs failing before your app becomes unavailable. The net effect of this change will be more AWS costs but also more resilience. ### Gruntwork Reference Architecture Customers We found a sub-optimal configuration in the default Reference Architecture where the `vpc-app` module set the `num_nat_gateways` variable to 1, even for a production environment. This should have been set to 3. As a result, we recommend you update your reference architecture configuration prod environment to use multiple NAT Gateways for the `vpc-app`. To update this configuration, modify the `num_nat_gateways` input in your Terragrunt configuration to deploy the desired number of NAT Gateways for your application VPC (at least 3 NAT Gateways, or 1-per AZ is recommended). The Gruntwork VPC Module will automatically distribute the requested number of NAT Gateways across available AZs and configure your private app subnet route tables accordingly. (Always test and validate any changes before deploying to production environments!) The module itself does not have a bug. The mistake was deploying a Reference Architecture with the wrong value for `num_nat_gateways`. While we agree with Amazon’s recommendation and 1 NAT Gateway per AZ is a best practice, the main downside of not doing this is the possibility that your apps might be unavailable if just one of the AZs fails, versus multiple AZs. The net effect of this change will be more cost but also more resilience. Find the relevant application VPC Terragrunt configuration in the inputs block in `_envcommon/networking/vpc-app.hcl`. If you also want to add additional NAT Gateways to your management VPC, find the relevant Terragrunt configuration in the inputs block in `_envcommon/mgmt/vpc-mgmt.hcl`. We tested adding additional NAT Gateways and found that Terraform uses the ReplaceRoute API to update the route table rules. We found that there was no observable downtime or network interruptions while the route table rules were being updated. As AWS mentioned in the email, choose a maintenance window for architecture changes in your VPC. Always test and validate changes in lower environments before deploying to production. **Also see:** * [Gruntwork Docs: Making changes to your infrastructure: Terragrunt](https://docs.gruntwork.io/reference/services/intro/make-changes-to-your-infrastructure/#making-changes-to-terragrunt-code) * [Gruntwork Docs: `num_nat_gateways`](https://docs.gruntwork.io/reference/modules/terraform-aws-vpc/vpc-app/#num_nat_gateways) ### Gruntwork VPC Module Users For users who want to configure additional NAT Gateways in a VPC created with the Gruntwork VPC Module, specify the number of gateways you'd like to create with the `num_nat_gateways` input. The VPC module will automatically distribute your requested NAT Gateways across available AZs and configure your private subnet route table rules accordingly. We tested adding additional NAT Gateways and found that Terraform uses the ReplaceRoute API to update the route table rules. We found that there was no observable downtime or network interruptions while the route table rules were being updated. As AWS mentioned in the email, choose a maintenance window for architecture changes in your VPC. Always test and validate changes in lower environments before deploying to production. **Also see:** * [Gruntwork Docs: Making changes to your infrastructure: Terraform](https://docs.gruntwork.io/reference/services/intro/make-changes-to-your-infrastructure/#making-changes-to-vanilla-terraform-code) * [Gruntwork Docs: `num_nat_gateways`](https://docs.gruntwork.io/reference/modules/terraform-aws-vpc/vpc-app/#num_nat_gateways)