Graceful Node Termination: Managing Node Shutdown with Node Termination Handler

Our customer, a leading fintech business in the micro banking sector, recently migrated their infrastructure to Amazon Elastic Kubernetes Service (EKS) for managing their containerized applications. As their business expanded rapidly, they faced challenges during node termination in AWS EKS, resulting in service disruptions and data loss.

After thorough research and analysis, our team discovered that a Node Termination Handler (NTH) could effectively handle the graceful shutdown of nodes in AWS EKS. The NTH would monitor node health, ensure proper pod eviction, and coordinate load balancer redirection for a seamless transition.

What is NTH?

Graceful Node Termination with an NTH is used in scenarios where you want to ensure a smooth and controlled shutdown process for nodes in a Kubernetes cluster.

The AWS NTH project ensures that the Kubernetes control plane responds appropriately to events that can cause your EC2 instances to become unavailable, such as EC2 maintenance events, EC2 Spot-interruptions, ASG Scale-In, ASG AZ Rebalance, and EC2 Instance Termination via the API or Console.

Advantages:

The NTH allows for the graceful termination of Kubernetes nodes.
With the NTH, you have the flexibility to define custom termination logic specific to your application needs.
By gracefully terminating pods and ensuring they complete their tasks, the NTH helps maintain high application availability.

Disadvantages:

Implementing and managing the NTH adds an extra layer of complexity to your Kubernetes environment.
The NTH may consume additional resources, such as CPU and memory, on each node to execute the termination logic.
Proper configuration and maintenance of the NTH are crucial to ensure its effectiveness.
The NTH may rely on external services, such as the Instance Metadata Service (IMDS), to retrieve information about the node termination event.

AWS-node-termination-handler (NTH) can operate in two different modes:

Instance Metadata Service (IMDS)

IMDS must be deployed as a Kubernetes DaemonSet. The termination handler daemonset installs into your cluster a ServiceAccount, ClusterRole, ClusterRoleBinding, and a DaemonSet. All four of these Kubernetes constructs are required for the termination handler to run properly.

Monitor EC2 Instance Metadata for: Spot Instance Termination Notifications, Scheduled Events, Instance Rebalance Recommendations.

Helm install

helm upgrade --install aws-node-termination-handler 
  --namespace kube-system 
  --set enableSpotInterruptionDraining="true" 
  --set enableRebalanceMonitoring="true" 
  --set enableScheduledEventDraining="false" 
  oci://public.ecr.aws/aws-ec2/helm/aws-node-termination-handler --version $CHART_VERSION

The enable* configuration flags above enable or disable IMDS monitoring paths.

Running Only On Specific Nodes:

helm upgrade --install aws-node-termination-handler 
  --namespace kube-system 
  --set nodeSelector.lifecycle=spot 
  oci://public.ecr.aws/aws-ec2/helm/aws-node-termination-handler --version $CHART_VERSION

Webhook Configuration:

helm upgrade --install aws-node-termination-handler 
  --namespace kube-system 
  --set webhookURL=https://hooks.slack.com/services/YOUR/SLACK/URL 
  oci://public.ecr.aws/aws-ec2/helm/aws-node-termination-handler --version $CHART_VERSION

Queue Processor

The workflow (shown in the diagram) consists of the following high-level steps:

The automatic scaling EC2 instance terminate event is sent to the SQS queue.
The NTH Pod monitors for new messages in the SQS queue.
The NTH Pod receives the new message and does the following:- Cordons the node so that the new pod does not run on the node.- Drains the node, so that the existing pod is evacuated.- Sends a lifecycle hook signal to the Auto Scaling group so that the node can be terminated.

You will need the following AWS infrastructure components:

Amazon Simple Queue Service (SQS) Queue
AutoScaling Group Termination Lifecycle Hook
Amazon EventBridge Rule
IAM Role for the aws-node-termination-handler Queue Processing Pods

Prerequisites:

Create the IAM policy for NTH with the below permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "autoscaling:CompleteLifecycleAction",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeTags",
                "ec2:DescribeInstances",
                "sqs:DeleteMessage",
                "sqs:ReceiveMessage"
            ],
            "Resource": "*"
        }
    ]
}

Attach the IAM policy to the IAM role created and add the trust policy as given below
Update the Trust policy with the Account ID, Region name, and OpenID Connect provider URL
Ensure to add the IAM role’s arn as annotation for the service account in the values-aws.yaml

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::<Account ID>:oidc-provider/<OIDC URL>"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "<OIDC URL>:sub": "system:serviceaccount:kube-system:<Service Account Name>",
                    "<OIDC URL>:aud": "sts.amazonaws.com"
                }
            }
        }
    ]
}

Create an SQS Queue: Create an SQS with the SQS policy as below

{
  "Version": "2012-10-17",
  "Id": "__default_policy_ID",
  "Statement": [
    {
      "Sid": "__owner_statement",
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "sqs.amazonaws.com",
          "events.amazonaws.com"
        ]
      },
      "Action": "SQS:*",
      "Resource": "SQS ARN"
    }
  ]
}

Update the SQS queue URL in the values-aws.yaml as queueURL: <SQS Queue URL>

Create an ASG Termination Lifecycle Hook

Here is the AWS CLI command to create a termination lifecycle hook on an existing ASG when using EventBridge.

Update the –auto-scaling-group-name=my-k8s-asg with the name of your existing ASG name.

aws autoscaling put-lifecycle-hook 
  --lifecycle-hook-name=my-k8s-term-hook 
  --auto-scaling-group-name=my-k8s-asg 
  --lifecycle-transition=autoscaling:EC2_INSTANCE_TERMINATING 
  --default-result=CONTINUE 
  --heartbeat-timeout=300

Tag the Instances

To tag ASGs and propagate the tags to your instances – By default the aws-node-termination-handler will only manage terminations for instances tagged with key=aws-node-termination-handler/managed. The value of the key does not matter.

Update the ResourceId=my-auto-scaling-group with the autoscaling group name.

aws autoscaling create-or-update-tags 
  --tags ResourceId=my-auto-scaling-group,ResourceType=auto-scaling-group,Key=aws-node-termination-handler/managed,Value=,PropagateAtLaunch=true

To tag an individual EC2 instance:

Make sure to update the resources i-12************f0 with the instance ID.

aws ec2 create-tags 
    --resources i-12**********f0 
    --tags 'Key="aws-node-termination-handler/managed",Value='

Create Amazon EventBridge Rules

Create Amazon EventBridge rules so that ASG termination events, Spot Interruptions, Instance state changes, Rebalance Recommendations, and AWS Health Scheduled Changes are sent to the SQS queue.

Update the ARN of SQS –targets “Id”=”1″,”Arn”=”<SQS ARN>”

$ aws events put-rule 
  --name MyK8sASGTermRule 
  --event-pattern "{"source":["aws.autoscaling"],"detail-type":["EC2 Instance-terminate Lifecycle Action"]}"

$ aws events put-targets --rule MyK8sASGTermRule 
  --targets "Id"="1","Arn"="<SQS ARN>"

$ aws events put-rule 
  --name MyK8sSpotTermRule 
  --event-pattern "{"source": ["aws.ec2"],"detail-type": ["EC2 Spot Instance Interruption Warning"]}"

$ aws events put-targets --rule MyK8sSpotTermRule 
  --targets "Id"="1","Arn"="<SQS ARN>"

$ aws events put-rule 
  --name MyK8sRebalanceRule 
  --event-pattern "{"source": ["aws.ec2"],"detail-type": ["EC2 Instance Rebalance Recommendation"]}"

$ aws events put-targets --rule MyK8sRebalanceRule 
  --targets "Id"="1","Arn"="<SQS ARN>"

$ aws events put-rule 
  --name MyK8sInstanceStateChangeRule 
  --event-pattern "{"source": ["aws.ec2"],"detail-type": ["EC2 Instance State-change Notification"]}"

$ aws events put-targets --rule MyK8sInstanceStateChangeRule 
  --targets "Id"="1","Arn"="<SQS ARN>"

$ aws events put-rule 
  --name MyK8sScheduledChangeRule 
  --event-pattern "{"source": ["aws.health"],"detail-type": ["AWS Health Event"],"detail": {"service": ["EC2"],"eventTypeCategory": ["scheduledChange"]}}"

$ aws events put-targets --rule MyK8sScheduledChangeRule 
  --targets "Id"="1","Arn"="<SQS ARN>"

Installing the helm chart

helm install aws-node-termination-handler eks/aws-node-termination-handler -f values-aws.yaml --namespace kube-system

Benefits delivered:

Minimized Downtime by 80%: By implementing the NTH, our team reduced the average downtime during node termination from 10 minutes to just 2 minutes. This translates into significant improvements in service availability and customer satisfaction.

Seamless User Experience: With the NTH load balancer redistribution and graceful shutdown process, our team achieved a seamless user experience during node termination. User requests are efficiently redirected to other healthy nodes, preventing service disruptions and reducing latency by 70%.

Enhanced Scalability: The NTH enables the customer team to efficiently scale their infrastructure by automating the replacement node deployment process. This reduced the time required to add new nodes by 50%, enabling faster response to increased service demands.

Data Integrity and Protection: Through data replication and persistence, the NTH ensures the integrity and protection of critical data during node termination. Our team achieved a 99.9% reduction in data loss incidents, significantly enhancing data reliability and customer trust.

Cost Optimization by 30%: The implementation of the NTH allowed our team to optimize costs associated with node termination. With reduced downtime, improved resource utilization, and faster replacement node deployment, the company achieved a 30% reduction in operational expenses.

If you would like to read more about our work, check out the Case Studies page on our website. To know how we can help you, write to us at sales@cloudifyops.com.

Graceful Node Termination: Managing Node Shutdown with Node Termination Handler

What is NTH?

Advantages:

Disadvantages:

Queue Processor

Installing the helm chart

Benefits delivered:

Services

Solutions

Recent Blogs

Our Services

Our Solutions

Offices

Contact us