Our customer from the healthcare industry has microservices-based applications deployed in EKS. They had multiple non-production environments dedicated to regression testing, sandbox, customer demos, and so on. Ensuring high availability of the application and addressing cost-effectiveness were the customer’s primary concerns.
To solve this problem, CloudifyOps proposed and implemented a solution using Horizontal Pod Autoscaling, Karpenter, and Node Termination Handler with AWS Spot machines. Our solution saved around 40% of our customer’s current infrastructure cost.
In this blog, we will walk you through the solution we implemented and how we overcame the challenges.
The entire non-production cluster was running on on-demand instances. Since the number of nodes varies based on the usage of the environments, the customer was able to reserve only a base number of instances and ended up paying more for the on-demand ones. To address this, we introduced the usage of Spot Instances.
Then came the problem of effective scaling. Since these are Spot Instances, the need to anticipate losing the spot machines kicked up.
Since spot nodes are ephemeral, we needed a mechanism in place to gracefully drain the pods till the new node comes in. AWS Node Termination Handler (NTH) is used to achieve this. AWS sends an interruption warning event 2 minutes before the real interruption and the NTH drains the node once it receives the event. NTH is configured in IMDS mode which leverages the AWS instance metadata service.
We leveraged Karpenter’s auto-scaling capability to address the scenario. Karpenter has provisioners which effectively handle the node scaling. In our setup by default, the provisioner spins up a Spot node (uses one of the instance types provided in the provisioner configuration). The frequency of spot instance unavailability may vary with region, availability zone, and instance class, on an unavailability of the spot machine Karpenter spins up an on-demand node.
# This provisioner will provision general-purpose(t3a) instances
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: general-purpose
spec:
requirements:
# Include general purpose instance families
- key: node.kubernetes.io/instance-type
operator: In
values:
- t3.xlarge
- t3.2xlarge
- t3a.xlarge
- t3a.2xlarge
- t2.xlarge
- t2.2xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
- key: kubernetes.io/os
operator: In
values:
- linux
- key: topology.kubernetes.io/zone
operator: In
values: [us-east-1a, us-east-1b]
providerRef:
name: default
Here, you can see that we are giving capacity as spot and then on-demand which does the magic for us
One of the concerns was how to handle the taints in the node. The spinned-up node should also have the taints in place so that the tolerated pods will be deployed into it. This was handled by adding the taints configuration in Karpenter. You can see this below.
spec:
taints:
– effect: NoSchedule
key: workload/cpu-accelerated
value: “true”
This will make sure that the node is tainted. Make sure you create two provisioners, one for the general purpose (node with no taints) and the other dedicated for the tainted nodes.
To know more about how the CloudifyOps team can help you with cloud cost optimization solutions
Karpenter will be constantly monitoring the cluster for any need for nodes. When there is a shortage of nodes, it spins up the spot node for us. When a spot interruption happens, NTH will identify the same from the instance metadata and will start draining the node. Upon draining the nodes, the pod will go into a pending state and Karpenter will get notified. Hooyah! We have a new node in place. We kept the EKS node group with only 2 nodes which was tainted to accommodate only the critical workloads, including the addons.
This solution is implemented in the customer environment and it has been doing its charm for the past 8 months and counting. We were able to bring down the non-production EKS infrastructure cost by 40%. Also, we reduced the number of Availability Zones (AZ) from three to two to save inter-AZ data transfer costs, and keeping one AZ would have saved us more cost but we had to ensure high availability at the same time.
CloudifyOps Pvt Ltd, Ground Floor, Block C, DSR Techno Cube, Survey No.68, Varthur Rd, Thubarahalli, Bengaluru, Karnataka 560037
Indiqube Vantage, 3rd Phase, No.1, OMR Service Road, Santhosh Nagar, Kandhanchavadi, Perungudi, Chennai, Tamil Nadu 600096.
CloudifyOps Inc.,
200, Continental Dr Suite 401,
Newark, Delaware 19713,
United States of America
Copyright 2024 CloudifyOps. All Rights Reserved