Knowledge Base

Deploying EKS control plane to mgmt VPC or app VPC

Answer

In following [the guide](https://docs.gruntwork.io/docs/guides/build-it-yourself/kubernetes-cluster/deployment-walkthrough/configure-the-control-plane) for EKS deployment, it was unclear initially if the control plane was deployed to the app VPC or the management VPC because the management VPC peering had just been set-up. It seems the management VPC is not really used in the guide, and I think this contributed to my confusion, but I've since placed a bastion there. The docs are missing a step for establishing peering prior to adding the DNS resolver. (Establishing peering is omitted entirely, but the DNS resolver cannot be established due to errors on the SGs for being in different VPCs). I had an issue initially where the control plane terraform never seems to finish: ``` odule.eks_cluster.null_resource.wait_for_api: Still creating... [19m20s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [19m30s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [19m40s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [19m50s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [20m0s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [20m10s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [20m20s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [20m30s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [20m40s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [20m50s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [21m0s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [21m10s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [21m20s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [21m30s elapsed] module.eks_cluster.null_resource.wait_for_api: Still creating... [21m40s elapsed] module.eks_cluster.null_resource.wait_for_api (local-exec): [] time="2021-12-21T14:45:48-05:00" level=warning msg="Error retrieiving info from endpoint: Head \"https://REDACTED.eks.amazonaws.com\": dial tcp REDACTED:443: connect: operation timed out" name=kubergrunt module.eks_cluster.null_resource.wait_for_api (local-exec): [] time="2021-12-21T14:45:48-05:00" level=warning msg="Marking api server as not ready" name=kubergrunt module.eks_cluster.null_resource.wait_for_api (local-exec): [] time="2021-12-21T14:45:48-05:00" level=warning msg="EKS cluster arn:aws:eks:REDACTED:cluster/REDACTED Kubernetes api server is not active yet" name=kubergrunt module.eks_cluster.null_resource.wait_for_api (local-exec): [] time="2021-12-21T14:45:48-05:00" level=info msg="Waiting for 15s..." name=kubergrunt module.eks_cluster.null_resource.wait_for_api: Still creating... [21m50s elapsed] ``` I needed to install `kubergrunt`. Having done that, I still get issues: ``` ERROR: Get │ "https://REDACTED.eks.amazonaws.com/apis/apps/v1/namespaces/kube-system/daemonsets/kube-proxy": │ dial tcp REDACTED:443: i/o timeout ``` The issue here is that the guide says to make the API endpoint private, but public is required by the templates to terraform the cluster. I have completed the guide, but our nodes are not registered by the cluster. We're a little disappointed by how much the documentation and guides have diverged from the recent modules, and while we've been able to figure things out, there's been a significant time investment to get everything working properly. The above are just some of the issues we've run into. It'd be very helpful to keep the docs and guides in sync with the modules. ``` [] INFO[2021-12-27T21:45:51-05:00] Not all nodes are registered yet name=kubergrunt [] INFO[2021-12-27T21:45:51-05:00] Waiting for 15s... name=kubergrunt [] INFO[2021-12-27T21:46:06-05:00] Checking if nodes ready name=kubergrunt [] INFO[2021-12-27T21:46:06-05:00] Not all nodes are registered yet name=kubergrunt [] INFO[2021-12-27T21:46:06-05:00] Waiting for 15s... name=kubergrunt [] INFO[2021-12-27T21:46:21-05:00] Checking if nodes ready name=kubergrunt [] INFO[2021-12-27T21:46:21-05:00] Not all nodes are registered yet name=kubergrunt [] INFO[2021-12-27T21:46:21-05:00] Waiting for 15s... name=kubergrunt [] INFO[2021-12-27T21:46:36-05:00] Checking if nodes ready name=kubergrunt [] INFO[2021-12-27T21:46:36-05:00] Not all nodes are registered yet name=kubergrunt ``` Do we need to provision additional IAM roles and set the mapping in the cluster in order for the nodes to be registered? Do we need to run some script? Did the registration script which invoked in the `user-data` not run successfully? The docs do not address these issues or how to proceed. What should we be checking for node registration issues? r:terraform-aws-eks

Hello, apologies for the frustration and challenges with using the guide. We are aware of how out of date the guide is and are intending on overhauling both the guide contents and process to ensure that they stay up to date. Regarding the issues with node registration, you should not need to do anything beyond making sure the worker ASG IAM roles are included in the `eks_worker_iam_role_arns` attribute for the call to the `eks-k8s-role-mapping` module. I suspect there were some issues with the IAM role mapping creation when you ran into issues with the private API endpoint setup. I would check the following things to troubleshoot this issue: - Introspect the `aws-auth` ConfigMap to make sure it has the worker IAM role in the configuration. You can use `kubectl` to retrieve the config map directly form the cluster: `kubectl describe configmap aws-auth -n kube-system`. - If the ConfigMap is correct, then SSH into the running nodes and introspect the `kubelet` logs for more info. You should be able to find the error logs in either syslog,`/var/log/messages` (e.g., try running `sudo tail /var/log/messages | grep kubelet`). This should give you some insights into what might be causing the issue. --- If you still have issues with deploying using the guide, you can try provisioning the cluster using an alternative approach. A recommended alternative to the guide is using [our Service Catalog module](https://github.com/gruntwork-io/terraform-aws-service-catalog/tree/master/modules/services/eks-cluster). The Service Catalog module has less configuration freedom as you are relying on prebuilt `infrastructure-modules` modules, but may work better as a starting point. You can deploy using the Service Catalog by doing the following: 1. Build the AMI using the provided [packer template](https://github.com/gruntwork-io/terraform-aws-service-catalog/blob/master/modules/services/eks-workers/eks-node-al2.pkr.hcl). To do so, git clone the service catalog repo and run `cd modules/services/eks-workers && packer build -var="version_tag=v0.68.7" -var="service_catalog_ref=v0.68.7" -var="aws_region=YOUR_AWS_REGION" eks-node-al2.pkr.hcl`. Note that you may want to pass in additional `-var` inputs depending on your needs. 1. Use the following updated terragrunt config, with all the `<>` variables updated to the real values for your environment: ``` terraform { source = "git@github.com/gruntwork-io/terraform-aws-service-catalog.git//modules/services/eks-cluster?ref=v0.68.7" } include { path = find_in_parent_folders() } generate "provider" { path = "provider.tf" if_exists = "overwrite_terragrunt" contents = <<EOF provider "aws" { region = "<YOUR_AWS_REGION>" } EOF } inputs = { cluster_name = "eks-stage" cluster_instance_keypair_name = "stage-services-us-east-1-v1" vpc_id = "<APP_VPC_ID>" control_plane_vpc_subnet_ids = ["<LIST_OF_PRIVATE_APP_SUBNET_IDS>"] allow_inbound_api_access_from_cidr_blocks = ["0.0.0.0/0"] allow_private_api_access_from_cidr_blocks = [ "<CIDR_BLOCK_OF_APP_VPC>", "<CIDR_BLOCK_OF_MGMT_VPC>", ] endpoint_public_access = true # Set to false for private API # Fill in the ID of the AMI you built from your Packer template cluster_instance_ami = "<AMI_ID>" # Set the max size to double the min size so the extra capacity can be used to do a zero-downtime deployment of updates # to the EKS Cluster Nodes (e.g. when you update the AMI). For docs on how to roll out updates to the cluster, see: # https://github.com/gruntwork-io/terraform-aws-eks/tree/master/modules/eks-cluster-workers#how-do-i-roll-out-an-update-to-the-instances autoscaling_group_configurations = { asg = { min_size = 3 max_size = 6 asg_instance_type = "t2.small" subnet_ids = ["<LIST_OF_PRIVATE_APP_SUBNET_IDS>"] } } # If your IAM users are defined in a separate AWS account (e.g., in a security account), pass in the ARN of an IAM # role in that account that ssh-grunt on the worker nodes can assume to look up IAM group membership and public SSH # keys external_account_ssh_grunt_role_arn = "arn:aws:iam::1111222233333:role/allow-ssh-grunt-access-from-other-accounts" # Configure your role mappings iam_role_to_rbac_group_mappings = { # Give anyone using the full-access IAM role admin permissions "arn:aws:iam::444444444444:role/allow-full-access-from-other-accounts" = ["system:masters"] # Give anyone using the developers IAM role developer permissions. Kubernetes will automatically create this group # if it doesn't exist already, but you're still responsible for binding permissions to it! "arn:aws:iam::444444444444:role/allow-dev-access-from-other-accounts" = ["developers"] } } ``` Note that like the guide, you will want to deploy using `endpoint_public_access = true` first, and then switching to `endpoint_public_access = false` due to the network access issues you ran into. Alternatively, you can deploy through a VPN connection that allows you to VPN into the mgmt VPC. --- Side note: I believe you can deploy the VPC without the DNS resolvers now. This used to be a requirement for accessing the Kubernetes API endpoint on EKS clusters with private access over a VPC peer, but as far as I know, AWS has since updated the networking infrastructure to no longer need it. The only reason I mention it is because the DNS resolvers can add up to be quite pricey (approximately $500/month), so you may want to consider removing it if you are tight on budget. Alternatively, you can consider omitting the mgmt VPC altogether and deploy the bastion/VPN server into the app VPC in the public network space. The `mgmt` VPC architecture is most useful/recommended if you intend on having more than one VPC for your applications. Otherwise, it can be unnecessary overhead. It is fairly straightforward to introduce one after the fact as well, so you may want to consider a single VPC architecture if you don't have the networking needs.