From basic autoscaling to advanced multi-cluster orchestration - everything you need to know about scaling workloads on Kubernetes
Introduction
Kubernetes has become the de facto standard for container orchestration, but one of its most powerful yet complex capabilities remains autoscaling. As workloads grow increasingly diverse - from stateless microservices to GPU-intensive ML models - understanding the full spectrum of scaling strategies is essential for any platform engineer or DevOps practitioner.
This comprehensive guide covers 17 distinct scaling strategies, their version requirements, use cases, and practical implementation guidance for 2025.
Table of Contents
- Understanding Kubernetes Scaling Fundamentals
- Horizontal Pod Autoscaling (HPA)
- Vertical Pod Autoscaling (VPA)
- Cluster Autoscaling
- Karpenter: Next-Generation Node Provisioning
- Event-Driven Scaling with KEDA
- Scheduled and Cron-Based Scaling
- Serverless and Scale-to-Zero
- Burst Scaling with Virtual Nodes
- Multi-Cluster and Geographic Scaling
- Stateful Workload Scaling
- Batch Processing and Job Scaling
- Streaming Workload Scaling
- GPU and ML Workload Scaling
- Cost-Based and FinOps Scaling
- Pod Priority and Preemption-Based Scaling
- Service Mesh Traffic Scaling
- Advanced Patterns: MPA, Descheduler, and Proportional Scaling
- Spot Instance Scaling
- Version Compatibility Reference
- Choosing the Right Strategy
- Comprehensive Strategy Selection Guide
1. Understanding Kubernetes Scaling Fundamentals
Before diving into specific strategies, it's important to understand the three dimensions of Kubernetes scaling:
| Dimension |
What Scales |
Tools |
| Horizontal |
Number of pods |
HPA, KEDA, Manual |
| Vertical |
Pod resources (CPU/memory) |
VPA, In-Place Resize |
| Cluster |
Number of nodes |
Cluster Autoscaler, Karpenter |
Each dimension addresses different scaling challenges, and modern production environments typically combine multiple approaches.
Kubernetes Version Timeline (2024-2025)
Understanding version history is crucial for planning scaling implementations:
| Version |
Release |
Key Scaling Features |
| v1.27 |
Apr 2023 |
In-Place Pod Resize (Alpha) |
| v1.30 |
May 2024 |
Enhanced autoscaling responsiveness |
| v1.32 |
Dec 2024 |
DRA Structured Parameters, Pod-level resources |
| v1.33 |
Apr 2025 |
In-Place Pod Resize (Beta), HPA Configurable Tolerance |
| v1.35 |
Dec 2025 |
In-Place Pod Resize (GA), VPA InPlaceOrRecreate (Beta) |
2. Horizontal Pod Autoscaling (HPA)
Minimum Version: v1.1 (basic), v1.23 (recommended for v2 API)
HPA is the most widely used autoscaling mechanism in Kubernetes. It automatically adjusts the number of pod replicas based on observed metrics.
Benefits
| Benefit |
Description |
| Native Kubernetes |
No external dependencies, built into every cluster |
| Proven & Stable |
Battle-tested in production for 10+ years |
| Low Overhead |
Minimal resource consumption |
| Flexible Metrics |
CPU, memory, custom, and external metrics |
| Easy to Implement |
Simple YAML configuration |
Ideal Workloads
| Workload Type |
Why HPA Works Well |
| Stateless Web Applications |
Easy to add/remove replicas without state concerns |
| REST APIs |
Request volume correlates with CPU usage |
| Microservices |
Independent scaling of each service |
| Frontend Applications |
Traffic-driven scaling |
| GraphQL Endpoints |
Query complexity reflected in resource usage |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Traffic correlates with CPU/memory |
Highly recommended |
| Stateless application architecture |
Highly recommended |
| Need quick, reactive scaling |
Recommended |
| Predictable request patterns |
Recommended |
| Already using Kubernetes |
Default choice |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Stateful applications |
Replicas need coordination |
VPA, Operators |
| Event-driven workloads |
Metrics don't reflect queue depth |
KEDA |
| Scale-to-zero needed |
HPA minimum is 1 |
KEDA, Knative |
| Long-running batch jobs |
Jobs complete, not scale |
Kueue, YuniKorn |
| GPU workloads |
CPU metrics don't reflect GPU usage |
Custom metrics HPA |
API Evolution
v1.1 β autoscaling/v1 # CPU only
v1.6 β autoscaling/v2beta1 # Custom metrics
v1.12 β autoscaling/v2beta2 # External metrics, behavior
v1.23 β autoscaling/v2 # GA - USE THIS
v1.26 β v2beta2 REMOVED
Important: If you're still using autoscaling/v2beta2, migrate immediately. It was removed in v1.26.
Basic HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
New in v1.33+: Configurable Tolerance
Historically, HPA used a fixed 10% tolerance globally. This often caused issues:
- Sensitive workloads couldn't scale when needed
- Other workloads oscillated unnecessarily
Starting with Kubernetes v1.33 (alpha) and v1.34 (beta), you can configure tolerance per-HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
behavior:
scaleUp:
tolerance: 0.05 # 5% tolerance for scaling up
scaleDown:
tolerance: 0.15 # 15% tolerance for scaling down
HPA Best Practices
- Always set resource requests - HPA cannot calculate utilization without them
- Use multiple metrics - Combine CPU with memory or custom metrics
- Configure scale-down behavior - Prevent thrashing with stabilization windows
- Avoid conflicts with VPA - Don't use both on the same metrics
3. Vertical Pod Autoscaling (VPA)
Minimum Version: v1.9 (external), v1.33+ (in-place recommended)
VPA automatically adjusts the CPU and memory requests/limits of pods based on historical usage patterns.
Benefits
| Benefit |
Description |
| Right-sizing |
Automatically finds optimal resource allocation |
| Cost Reduction |
Eliminates over-provisioning (20-50% savings typical) |
| Performance |
Prevents OOM kills and CPU throttling |
| Zero Manual Tuning |
No guesswork for resource requests |
| In-Place Resize (v1.33+) |
No pod restarts needed |
Ideal Workloads
| Workload Type |
Why VPA Works Well |
| Databases (PostgreSQL, MySQL) |
Consistent memory needs, hard to horizontally scale |
| Caches (Redis, Memcached) |
Memory-bound, fixed instance count |
| Message Brokers |
Predictable resource patterns |
| Monolithic Applications |
Single instance optimization |
| Java/JVM Applications |
Heap sizing optimization |
| Long-running Services |
Builds accurate usage profile over time |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Stateful applications |
Highly recommended |
| Unknown resource requirements |
Highly recommended |
| Memory-intensive workloads |
Recommended |
| Applications with variable load phases |
Recommended |
| Cost optimization priority |
Recommended |
| K8s v1.33+ available |
Use InPlaceOrRecreate mode |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Already using HPA on same metric |
Feedback loops |
Choose one or use MPA |
| Ephemeral/short-lived pods |
Not enough data |
HPA or manual |
| Real-time latency requirements |
Pod recreation causes downtime |
In-place resize (v1.33+) |
| Rapid scaling needed |
VPA is slow to react |
HPA |
VPA Update Modes
| Mode |
Description |
Restart Required |
Best For |
Off |
Recommendations only |
No |
Analysis, planning |
Initial |
Apply on pod creation |
No (new pods) |
Gradual rollout |
Auto |
Apply automatically |
Yes (recreates) |
Non-critical workloads |
InPlaceOrRecreate |
Try in-place first (v1.33+) |
Maybe |
Production databases |
The In-Place Pod Resize Revolution
One of the most anticipated features in Kubernetes history, In-Place Pod Resize eliminates the need to restart pods when adjusting resources:
| Version |
Status |
Feature Gate |
| v1.27 |
Alpha |
InPlacePodVerticalScaling=true (manual) |
| v1.28-v1.32 |
Alpha |
InPlacePodVerticalScaling=true (manual) |
| v1.33 |
Beta |
Enabled by default |
| v1.35 |
GA |
Always enabled |
VPA Configuration (v1.33+)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "InPlaceOrRecreate" # New in v1.33+
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
VPA Limitations
- Cannot run with HPA on the same metrics - This creates feedback loops
- Recommendations based on historical data - May not react fast enough to spikes
- Metrics Server dependency - 15-60 second polling intervals
4. Cluster Autoscaling
Minimum Version: v1.8
The Cluster Autoscaler adjusts the number of nodes in your cluster based on pending pods and node utilization.
Benefits
| Benefit |
Description |
| Infrastructure Elasticity |
Automatic node provisioning/termination |
| Cost Optimization |
Scale down unused nodes |
| Hands-off Operation |
No manual node management |
| Cloud Integration |
Works with all major cloud providers |
| Predictable Behavior |
Well-understood algorithms |
Ideal Workloads
| Workload Type |
Why Cluster Autoscaler Works Well |
| Variable Traffic Applications |
Nodes scale with demand |
| Multi-tenant Clusters |
Shared infrastructure elasticity |
| Development/Staging Environments |
Scale down when unused |
| Traditional Enterprise Apps |
Stable, predictable scaling |
| Regulated Industries |
Well-audited, mature solution |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Using node groups/ASGs |
Required approach |
| Multi-cloud or on-prem |
Recommended (supported everywhere) |
| Conservative scaling approach |
Recommended |
| Compliance requirements |
Recommended (mature, auditable) |
| Using GKE Standard mode |
Default choice |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Need sub-minute provisioning |
3-5 minute scale-up time |
Karpenter |
| Highly diverse workloads |
Node groups are rigid |
Karpenter |
| Spot instance optimization |
Basic spot handling |
Karpenter |
| Bin-packing optimization |
Limited consolidation |
Karpenter |
| AWS with dynamic requirements |
Slower than alternatives |
Karpenter |
How It Works
- Scale Up: When pods can't be scheduled due to insufficient resources
- Scale Down: When nodes are underutilized for a configurable period
Configuration Example (AWS)
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.0
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --nodes=1:10:my-node-group
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --skip-nodes-with-local-storage=false
Key Parameters
| Parameter |
Default |
Description |
--scale-down-delay-after-add |
10m |
Wait time after scale-up |
--scale-down-unneeded-time |
10m |
How long node must be underutilized |
--scale-down-utilization-threshold |
0.5 |
Utilization below which node is unneeded |
--max-node-provision-time |
15m |
Maximum time for node to become ready |
5. Karpenter: Next-Generation Node Provisioning
Minimum Version: v1.21 (AWS), expanding to Azure and others
Karpenter represents a paradigm shift in Kubernetes node management. Unlike Cluster Autoscaler, it provisions nodes in seconds rather than minutes.
Benefits
| Benefit |
Description |
| Speed |
30-60 second node provisioning |
| Optimal Selection |
Chooses best instance type for workload |
| No Node Groups |
Eliminates rigid infrastructure |
| Bin-packing |
Advanced consolidation reduces waste |
| Spot Native |
Built-in interruption handling |
| Multi-arch |
Automatic ARM64/AMD64 selection |
Ideal Workloads
| Workload Type |
Why Karpenter Works Well |
| Microservices at Scale |
Rapid, diverse node needs |
| ML/AI Training |
GPU instance optimization |
| Batch Processing |
Quick provisioning, cost optimization |
| CI/CD Pipelines |
Fast runners, terminate when done |
| Spot-tolerant Workloads |
Native interruption handling |
| Multi-arch Applications |
Automatic Graviton selection |
Best Circumstances to Use
| Circumstance |
Recommendation |
| AWS EKS deployment |
Highly recommended |
| Need fast scaling (<1 min) |
Highly recommended |
| Cost optimization priority |
Highly recommended |
| Diverse workload requirements |
Recommended |
| Using Spot instances |
Recommended |
| Azure AKS (preview) |
Consider for new clusters |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| GCP GKE |
Not yet supported |
Cluster Autoscaler, Autopilot |
| On-premises |
Cloud-specific |
Cluster Autoscaler |
| Strict node group compliance |
No node group concept |
Cluster Autoscaler |
| Simple, stable workloads |
Over-engineered |
Cluster Autoscaler |
Karpenter vs Cluster Autoscaler
| Feature |
Cluster Autoscaler |
Karpenter |
| Provisioning Speed |
3-5 minutes |
30-60 seconds |
| Node Groups |
Required |
Not needed |
| Instance Selection |
Pre-defined |
Dynamic/optimal |
| Consolidation |
Basic |
Advanced bin-packing |
| Spot Handling |
Basic |
Native interruption handling |
NodePool Configuration
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.xlarge", "m6g.large", "m6g.xlarge"]
limits:
cpu: 1000
memory: 1000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
Real-World Results
Organizations report significant improvements with Karpenter:
- 70% cost reduction through proper rightsizing
- CPU utilization increased from 25% to 70%
- 20% additional savings through ARM64/Graviton instances
6. Event-Driven Scaling with KEDA
Minimum Version: v1.16
KEDA (Kubernetes Event-Driven Autoscaling) is a CNCF graduated project that enables fine-grained autoscaling based on event sources.
Benefits
| Benefit |
Description |
| Scale to Zero |
No running pods when idle |
| Event-Driven |
React to external triggers |
| 70+ Scalers |
Extensive integration ecosystem |
| Works with HPA |
Extends native autoscaling |
| Lightweight |
Single-purpose, low overhead |
| Cost Savings |
Pay only when processing |
Ideal Workloads
| Workload Type |
Why KEDA Works Well |
| Message Queue Consumers |
Scale based on queue depth |
| Kafka Consumers |
Consumer lag-based scaling |
| Webhook Handlers |
Scale to zero between requests |
| Scheduled Jobs |
Cron-based activation |
| Database Triggers |
Scale on row count changes |
| Serverless Functions |
Event-driven activation |
| IoT Data Processors |
Device message volume |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Message queue processing |
Highly recommended |
| Need scale-to-zero |
Highly recommended |
| Event-driven architecture |
Highly recommended |
| External metric sources |
Recommended |
| Asynchronous processing |
Recommended |
| Cost-sensitive workloads |
Recommended |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Synchronous request handling |
No queue to measure |
HPA |
| Simple CPU-based scaling |
Over-engineered |
HPA |
| Stateful services |
Scale-to-zero problematic |
HPA, VPA |
| Real-time latency requirements |
Cold start on scale-from-zero |
Keep minReplicas > 0 |
Key Capabilities
- Scale to zero - No pods when there's no work
- 70+ built-in scalers - Message queues, databases, cloud services
- Works with HPA - Extends rather than replaces native autoscaling
KEDA Architecture
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Event Source ββββββΆβ KEDA Scaler ββββββΆβ HPA β
β (Kafka, SQS, β β β β β
β Prometheus) β β Metrics Server β β Scale Pods β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
ScaledObject Example (Kafka)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
spec:
scaleTargetRef:
name: kafka-consumer
pollingInterval: 15
cooldownPeriod: 300
minReplicaCount: 0 # Scale to zero!
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: my-consumer-group
topic: my-topic
lagThreshold: "100"
Popular KEDA Scalers by Use Case
| Use Case |
Scalers |
Trigger Metric |
| Async Messaging |
Kafka, RabbitMQ, AWS SQS, Azure Service Bus |
Queue depth, consumer lag |
| Database Polling |
PostgreSQL, MySQL, MongoDB |
Row count, query results |
| Monitoring Alerts |
Prometheus, Datadog, New Relic |
Custom queries |
| Cloud Events |
CloudWatch, Azure Monitor, GCP |
Cloud metrics |
| Scheduled Tasks |
Cron |
Time-based activation |
| HTTP Traffic |
HTTP Add-on, Prometheus |
Request rate |
7. Scheduled and Cron-Based Scaling
For workloads with predictable traffic patterns, scheduled scaling provides proactive resource management.
Benefits
| Benefit |
Description |
| Proactive |
Scale before demand hits |
| Predictable |
Consistent resource availability |
| Cost Control |
Scale down during known quiet periods |
| Simple |
Easy to understand and configure |
| Reliable |
Not dependent on metrics accuracy |
Ideal Workloads
| Workload Type |
Why Scheduled Scaling Works Well |
| E-commerce Sites |
Known traffic peaks (sales, promotions) |
| Business Applications |
Office hours usage patterns |
| Regional Services |
Timezone-based demand |
| Media Streaming |
Evening viewing peaks |
| Financial Systems |
Market hours operation |
| Educational Platforms |
School day patterns |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Known traffic patterns |
Highly recommended |
| Business hours operation |
Highly recommended |
| Recurring events (sales, launches) |
Recommended |
| Complement to HPA |
Recommended |
| Cost optimization during off-hours |
Recommended |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Unpredictable traffic |
Wrong capacity at wrong time |
HPA, KEDA |
| Global 24/7 services |
No off-peak period |
HPA |
| Highly variable demand |
Cron can't adapt |
HPA + KEDA |
| Real-time responsiveness needed |
Cron is pre-scheduled |
HPA |
Implementation Options
| Option |
Complexity |
Features |
Best For |
| CronJob + kubectl |
Low |
Native K8s |
Simple schedules |
| KEDA Cron Scaler |
Medium |
Scale-to-zero |
Event-driven apps |
| CronHPA |
Medium |
Dedicated controller |
Alibaba Cloud users |
| AHPA (Advanced) |
High |
Predictive + scheduled |
Enterprise |
Option 1: Native CronJob + kubectl
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-up-morning
spec:
schedule: "0 8 * * 1-5" # 8 AM weekdays
jobTemplate:
spec:
template:
spec:
containers:
- name: kubectl
image: bitnami/kubectl
command:
- /bin/sh
- -c
- kubectl scale deployment/web-app --replicas=10
restartPolicy: OnFailure
Option 2: KEDA Cron Scaler
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: cron-scaler
spec:
scaleTargetRef:
name: web-app
triggers:
- type: cron
metadata:
timezone: America/New_York
start: "0 8 * * 1-5" # Scale up at 8 AM
end: "0 18 * * 1-5" # Scale down at 6 PM
desiredReplicas: "10"
Option 3: CronHPA (Alibaba Cloud)
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
name: cronhpa-sample
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
jobs:
- name: scale-up
schedule: "0 8 * * *"
targetSize: 10
- name: scale-down
schedule: "0 20 * * *"
targetSize: 2
8. Serverless and Scale-to-Zero
Scale-to-zero is critical for cost optimization and resource efficiency, especially for infrequently-used services.
Benefits
| Benefit |
Description |
| Maximum Cost Savings |
Zero cost when not in use |
| Resource Efficiency |
Free up cluster resources |
| Event-Driven |
Activate only when needed |
| Developer Experience |
Deploy without capacity planning |
| Environment Parity |
Same scaling in dev/staging/prod |
Ideal Workloads
| Workload Type |
Why Scale-to-Zero Works Well |
| Webhooks |
Sporadic incoming requests |
| Scheduled Reports |
Run once daily/weekly |
| Development APIs |
Infrequent testing |
| Internal Tools |
Occasional admin usage |
| Preview Environments |
Per-PR deployments |
| Seasonal Features |
Holiday-specific functionality |
| Low-traffic Microservices |
Cost optimization |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Infrequent usage (<1 req/min) |
Highly recommended |
| Development/staging environments |
Highly recommended |
| Webhook receivers |
Recommended |
| Cost-sensitive projects |
Recommended |
| Event-driven architectures |
Recommended |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Latency-sensitive APIs |
Cold start adds 2-5s |
Keep minReplicas >= 1 |
| High-traffic services |
Constantly scaling |
Standard HPA |
| Stateful services |
State lost on scale-down |
VPA |
| Database connections |
Connection pool issues |
Keep warm replicas |
| Platform |
K8s Version |
Scale-to-Zero |
Cold Start |
Complexity |
Best For |
| Knative |
v1.26+ |
Native |
2-5s |
High |
HTTP services |
| OpenFaaS |
v1.20+ |
Supported |
1-3s |
Medium |
Functions |
| KEDA |
v1.16+ |
Native |
Varies |
Low |
Event-driven |
| Fission |
v1.19+ |
Supported |
<100ms |
Medium |
Fast cold starts |
Knative Serving Example
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-world
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/min-scale: "0"
autoscaling.knative.dev/max-scale: "100"
autoscaling.knative.dev/target: "70"
spec:
containers:
- image: gcr.io/my-project/hello-world
resources:
requests:
cpu: 100m
memory: 128Mi
Knative Pod Autoscaler (KPA) vs HPA
| Feature |
KPA |
HPA |
| Scale-to-zero |
Native |
Not supported |
| Metric source |
Request concurrency |
CPU/Memory |
| Scale speed |
Fast (request-aware) |
Slower (metric polling) |
| Use case |
HTTP workloads |
General workloads |
9. Burst Scaling with Virtual Nodes
Virtual nodes enable "burst" scaling to serverless infrastructure, handling traffic spikes without pre-provisioning nodes.
Benefits
| Benefit |
Description |
| Instant Capacity |
No node provisioning wait |
| Unlimited Scale |
Cloud provider capacity |
| Pay-per-Use |
Only pay for container runtime |
| No Node Management |
Zero infrastructure overhead |
| Disaster Recovery |
Instant failover capacity |
Ideal Workloads
| Workload Type |
Why Burst Scaling Works Well |
| Marketing Campaigns |
Viral traffic handling |
| Flash Sales |
E-commerce spikes |
| Gaming Events |
Launch day capacity |
| CI/CD Runners |
Parallel test execution |
| Data Processing |
Large-scale batch jobs |
| Disaster Recovery |
Instant backup capacity |
| Video Rendering |
Parallel encoding |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Unpredictable traffic spikes |
Highly recommended |
| Event-driven massive scale |
Highly recommended |
| CI/CD with variable parallelism |
Recommended |
| Disaster recovery capacity |
Recommended |
| Seasonal business peaks |
Recommended |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Sustained high traffic |
Cost higher than dedicated |
Reserved capacity |
| GPU workloads |
Limited GPU in serverless |
Dedicated GPU nodes |
| Local storage needs |
No persistent volumes |
Regular nodes |
| Custom kernel requirements |
Standard containers only |
Dedicated nodes |
| Low-latency requirements |
Cold start overhead |
Pre-provisioned nodes |
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β Node 1 β β Node 2 β β Virtual Node β β
β β (Regular) β β (Regular) β β (ACI/Fargate) β β
β β β β β β β β
β β [Pod][Pod] β β [Pod][Pod] β β [Burst Pods...] β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Cloud Serverlessβ
β (ACI, Fargate) β
βββββββββββββββββββ
Cloud Provider Options
| Provider |
Service |
Provisioning |
Use Case |
| Azure |
ACI + Virtual Nodes |
~10 seconds |
AKS burst |
| AWS |
Fargate |
~30 seconds |
EKS serverless |
| GCP |
Cloud Run for Anthos |
~5 seconds |
GKE burst |
Azure AKS Virtual Nodes
# Pod with toleration for virtual node
apiVersion: v1
kind: Pod
metadata:
name: burst-pod
spec:
containers:
- name: app
image: my-app:latest
tolerations:
- key: virtual-kubelet.io/provider
operator: Exists
- key: azure.com/aci
effect: NoSchedule
nodeSelector:
kubernetes.io/role: agent
type: virtual-kubelet
10. Multi-Cluster and Geographic Scaling
For high availability, latency reduction, or regulatory compliance, scaling across multiple clusters is essential.
Benefits
| Benefit |
Description |
| High Availability |
Survive entire cluster failures |
| Low Latency |
Serve users from nearest region |
| Compliance |
Data residency requirements |
| Blast Radius |
Limit impact of incidents |
| Infinite Scale |
Horizontal cluster scaling |
Ideal Workloads
| Workload Type |
Why Multi-Cluster Works Well |
| Global SaaS Applications |
Users worldwide |
| Financial Services |
Regional compliance |
| Healthcare Systems |
Data sovereignty |
| Gaming Platforms |
Low latency critical |
| E-commerce |
Regional availability |
| Government Services |
Jurisdictional requirements |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Global user base |
Highly recommended |
| Regulatory compliance |
Highly recommended |
| Zero-downtime requirements |
Recommended |
| Multi-cloud strategy |
Recommended |
| Disaster recovery |
Recommended |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Single region users |
Unnecessary complexity |
Single cluster |
| Limited budget |
High operational cost |
Single cluster + backup |
| Simple applications |
Over-engineered |
Single cluster |
| Tight coupling between services |
Latency issues |
Single cluster |
Federation Options
| Solution |
Approach |
Scale |
Best For |
| KubeAdmiral |
Control plane federation |
10M+ pods |
Hyperscale (ByteDance) |
| K8GB |
DNS-based routing |
Medium |
Geographic routing |
| Istio Multi-cluster |
Service mesh |
Large |
Service-level routing |
| ArgoCD |
GitOps-based |
Any |
Configuration sync |
K8GB for Geographic Load Balancing
apiVersion: k8gb.absa.oss/v1beta1
kind: Gslb
metadata:
name: my-gslb
spec:
ingress:
rules:
- host: app.example.com
http:
paths:
- path: /
backend:
service:
name: my-app
port:
number: 80
strategy:
type: roundRobin # or failover, geoip
splitBrainThresholdSeconds: 300
Multi-Cluster Patterns
| Pattern |
Description |
Use Case |
Complexity |
| Active-Active |
All clusters serve traffic |
Global load distribution |
High |
| Active-Passive |
Standby cluster for failover |
Disaster recovery |
Medium |
| Follow-the-Sun |
Route to closest region |
Latency optimization |
High |
| Hybrid |
Mix cloud and on-prem |
Regulatory compliance |
Very High |
11. Stateful Workload Scaling
Databases and stateful applications require special consideration when scaling.
Benefits
| Benefit |
Description |
| Data Integrity |
Ordered scaling preserves consistency |
| Stable Identity |
Predictable pod names and storage |
| Controlled Scaling |
One pod at a time |
| Persistent Storage |
Data survives pod restarts |
| Operator Automation |
Complex logic automated |
Ideal Workloads
| Workload Type |
Why Stateful Scaling Works Well |
| Databases (PostgreSQL, MySQL) |
Read replica scaling |
| Distributed Caches (Redis Cluster) |
Shard scaling |
| Message Queues (Kafka, RabbitMQ) |
Broker scaling |
| Search Engines (Elasticsearch) |
Data node scaling |
| Distributed Storage |
Storage node scaling |
| Consensus Systems (etcd, Zookeeper) |
Careful member management |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Database read replicas needed |
Horizontal StatefulSet scaling |
| Consistent resource needs |
VPA for databases |
| Complex topology management |
Operator-based scaling |
| Data sharding required |
Horizontal with operators |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Frequent scaling needed |
Slow, ordered operations |
Caching layer |
| Unknown data patterns |
StatefulSet scaling is slow |
Managed database |
| No DBA expertise |
Complex failure handling |
Managed services |
StatefulSet Scaling Approaches
| Approach |
Method |
Use Case |
Complexity |
| Horizontal |
Add replicas |
Read replicas, sharding |
Medium |
| Vertical (VPA) |
Increase resources |
Single-instance performance |
Low |
| Operator-based |
Automated management |
Complex topologies |
High |
Database Operators
| Database |
Operator |
Capabilities |
| PostgreSQL |
Zalando, CrunchyData |
HA, backup, scaling |
| MySQL |
Percona, Vitess |
Sharding, replication |
| MongoDB |
MongoDB Enterprise |
Auto-scaling, sharding |
| Cassandra |
K8ssandra |
Multi-DC, repair |
| Redis |
Redis Enterprise |
Clustering, geo-replication |
Scaling StatefulSets
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3 # Scale by changing this
selector:
matchLabels:
app: postgres
template:
spec:
containers:
- name: postgres
image: postgres:15
resources:
requests:
cpu: "500m"
memory: "1Gi"
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
Custom Metrics for Database Scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: postgres-read-replicas
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: postgres-read
minReplicas: 2
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: postgres_connections_active
target:
type: AverageValue
averageValue: "50"
12. Batch Processing and Job Scaling
Large-scale batch processing requires specialized schedulers for efficiency.
Benefits
| Benefit |
Description |
| Gang Scheduling |
All-or-nothing pod placement |
| Queue Fairness |
Resource sharing across teams |
| Priority Management |
Critical jobs run first |
| Resource Efficiency |
Maximize cluster utilization |
| Job Dependencies |
Complex workflow support |
Ideal Workloads
| Workload Type |
Why Batch Scaling Works Well |
| Spark Jobs |
Distributed data processing |
| ML Training |
GPU resource coordination |
| ETL Pipelines |
Scheduled data transformations |
| Scientific Computing |
HPC workloads |
| Video Encoding |
Parallel rendering |
| Financial Calculations |
End-of-day processing |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Distributed computing (Spark, Flink) |
Gang scheduling required |
| Multi-tenant resource sharing |
Queue-based scheduling |
| Large-scale ML training |
YuniKorn or Volcano |
| Kubernetes-native jobs |
Kueue |
| Complex workflows |
Argo Workflows |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Simple, single-pod jobs |
Over-engineered |
Native K8s Jobs |
| Long-running services |
Not job-oriented |
Deployments + HPA |
| Real-time processing |
Batch-oriented |
Streaming (Flink) |
Batch Schedulers Comparison
| Scheduler |
Key Feature |
Best For |
Complexity |
| YuniKorn |
Gang scheduling, queues |
Spark, ML training |
Medium |
| Volcano |
HPC workloads |
Scientific computing |
Medium |
| Kueue |
Native K8s integration |
General batch |
Low |
| Argo Workflows |
DAG workflows |
Data pipelines |
Medium |
Gang Scheduling with YuniKorn
Gang scheduling ensures all pods of a job start together or not at all - critical for distributed computing.
apiVersion: v1
kind: Pod
metadata:
labels:
app: spark-driver
queue: root.default
annotations:
yunikorn.apache.org/task-group-name: spark-job
yunikorn.apache.org/task-groups: |
[{
"name": "spark-driver",
"minMember": 1,
"minResource": {"cpu": "1", "memory": "2Gi"}
},
{
"name": "spark-executor",
"minMember": 10,
"minResource": {"cpu": "2", "memory": "4Gi"}
}]
Kueue for Job Queueing
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: team-a-queue
spec:
clusterQueue: cluster-queue
---
apiVersion: batch/v1
kind: Job
metadata:
name: my-batch-job
labels:
kueue.x-k8s.io/queue-name: team-a-queue
spec:
parallelism: 10
template:
spec:
containers:
- name: worker
image: my-batch-image
13. Streaming Workload Scaling
Real-time streaming applications have unique scaling requirements based on data throughput.
Benefits
| Benefit |
Description |
| Throughput-Based |
Scale on actual data volume |
| Lag-Aware |
React to processing backlog |
| Backpressure Handling |
Prevent system overload |
| Real-time Adaptation |
Continuous adjustment |
| End-to-End Latency |
Maintain SLA targets |
Ideal Workloads
| Workload Type |
Why Streaming Scaling Works Well |
| Kafka Consumers |
Consumer lag-based scaling |
| Flink Applications |
Throughput-aware autoscaling |
| Event Processors |
Event rate scaling |
| Log Aggregators |
Volume-based scaling |
| Real-time Analytics |
Query load scaling |
| IoT Data Ingestion |
Device message volume |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Apache Kafka-based systems |
KEDA Kafka scaler |
| Apache Flink workloads |
Flink Operator autoscaler |
| Variable event rates |
Event-driven scaling |
| SLA-bound latency |
Lag-based scaling |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Consistent throughput |
No scaling needed |
Fixed replicas |
| Request-response patterns |
Not stream-oriented |
HPA |
| Batch processing |
Not real-time |
Batch schedulers |
Flink Kubernetes Operator Autoscaler
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: flink-streaming
spec:
flinkVersion: v1_17
flinkConfiguration:
kubernetes.operator.job.autoscaler.enabled: "true"
kubernetes.operator.job.autoscaler.stabilization.interval: "5m"
kubernetes.operator.job.autoscaler.metrics.window: "10m"
kubernetes.operator.job.autoscaler.target.utilization: "0.7"
jobManager:
resource:
memory: "2Gi"
cpu: 1
taskManager:
resource:
memory: "4Gi"
cpu: 2
KEDA Kafka Scaler
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-stream-processor
spec:
scaleTargetRef:
name: stream-processor
minReplicaCount: 1
maxReplicaCount: 100
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: stream-processor
topic: events
lagThreshold: "1000"
activationLagThreshold: "10" # Scale from 0 when lag > 10
14. GPU and ML Workload Scaling
AI/ML workloads require specialized scaling strategies due to GPU scarcity and cost.
Benefits
| Benefit |
Description |
| Resource Optimization |
Maximize expensive GPU usage |
| Throughput Maximization |
Process more requests |
| Cost Control |
Scale down when idle |
| Latency Management |
Meet inference SLAs |
| Queue Management |
Prevent request buildup |
Ideal Workloads
| Workload Type |
Scaling Approach |
Why It Works |
| LLM Inference |
Queue-based |
Request volume varies |
| Image Generation |
Batch-based |
Latency tolerance varies |
| Model Training |
Gang scheduling |
All GPUs needed together |
| Real-time Detection |
GPU utilization |
Consistent load patterns |
| Recommendation Systems |
Hybrid |
Traffic + processing |
Best Circumstances to Use
| Circumstance |
Recommended Approach |
| Variable inference load |
Queue-based autoscaling |
| Latency-sensitive inference |
Batch-based scaling |
| Training workloads |
Gang scheduling + priority |
| Mixed GPU/CPU workloads |
Priority classes |
| GPU cost optimization |
Pause pod pattern |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Consistent GPU load |
No scaling needed |
Fixed allocation |
| Training with checkpoints |
Restart disrupts training |
Priority preemption |
| Real-time requirements |
Cold start unacceptable |
Keep minimum replicas |
GPU Scaling Approaches
| Approach |
Metric |
Best For |
Complexity |
| Queue-based |
Inference queue length |
Throughput optimization |
Medium |
| Batch-based |
Current batch size |
Latency sensitivity |
Low |
| GPU Utilization |
DCGM metrics |
Resource efficiency |
High |
| Pause Pod Pattern |
Pre-provisioned nodes |
Cold start reduction |
Medium |
Queue-Based Autoscaling for Inference
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: gpu-inference-scaler
spec:
scaleTargetRef:
name: gpu-inference
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: inference_queue_length
query: sum(inference_requests_pending)
threshold: "50"
Pause Pod Pattern for GPU Pre-provisioning
# Low-priority pause pod to keep GPU nodes warm
apiVersion: v1
kind: Pod
metadata:
name: gpu-placeholder
spec:
priorityClassName: low-priority-preemptible
containers:
- name: pause
image: registry.k8s.io/pause:3.9
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
When a real GPU workload arrives, the pause pod is preempted instantly, avoiding node provisioning delays.
GPU Metrics for Scaling (DCGM)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gpu-workload-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gpu-workload
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: "70"
15. Cost-Based and FinOps Scaling
FinOps integrates cost awareness into scaling decisions.
Benefits
| Benefit |
Description |
| Cost Visibility |
See where money goes |
| Right-sizing |
Eliminate waste |
| Budget Control |
Prevent overspending |
| ROI Optimization |
Maximum value from spend |
| Showback/Chargeback |
Team accountability |
Ideal Workloads
| Workload Type |
FinOps Approach |
| All Production Workloads |
Right-sizing recommendations |
| Development Environments |
Aggressive scale-down |
| Non-critical Services |
Spot instances |
| Over-provisioned Apps |
VPA recommendations |
| Multi-tenant Platforms |
Cost allocation |
Best Circumstances to Use
| Circumstance |
Recommendation |
| High cloud bill |
Immediate priority |
| Unknown resource usage |
Implement visibility first |
| Multi-team clusters |
Enable cost allocation |
| Startup with limited budget |
Critical for survival |
| Enterprise FinOps initiative |
Full tooling investment |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Performance is only priority |
Cost optimization may impact |
Performance-first approach |
| Very small clusters |
Tool overhead not worth it |
Manual monitoring |
| Proof of concept |
Premature optimization |
Simple monitoring |
| Tool |
Approach |
Key Feature |
Best For |
| Kubecost |
Visibility |
Cost allocation |
Getting started |
| Cast.ai |
Automation |
Multi-cloud optimization |
Hands-off savings |
| ScaleOps |
Real-time |
Continuous rightsizing |
Dynamic workloads |
| Goldilocks |
Recommendations |
VPA-based suggestions |
Manual optimization |
| Zesty |
AI-powered |
Up to 70% savings |
Maximum automation |
Goldilocks for Resource Recommendations
Goldilocks uses VPA in recommendation-only mode:
# Install Goldilocks
helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks
# Enable for a namespace
kubectl label namespace default goldilocks.fairwinds.com/enabled=true
Access the dashboard to see recommendations:
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80
Cost-Aware Scaling Strategy
- Right-size pods using VPA recommendations
- Use Spot/Preemptible instances for fault-tolerant workloads
- Scale to zero during off-hours
- Bin-pack nodes with Karpenter consolidation
- Monitor continuously with Kubecost alerts
16. Pod Priority and Preemption-Based Scaling
Priority classes enable mixed workloads to share resources efficiently.
Benefits
| Benefit |
Description |
| Instant Preemption |
Faster than provisioning nodes |
| Resource Efficiency |
Utilize spare capacity |
| Cost Optimization |
Run low-priority work for free |
| Quality of Service |
Critical workloads always run |
| Flexible Scheduling |
Dynamic resource allocation |
Ideal Workloads
| Workload Type |
Priority Level |
Rationale |
| Production APIs |
High |
Business-critical |
| Background Jobs |
Low |
Can wait for resources |
| CI/CD Pipelines |
Medium |
Important but interruptible |
| ML Training |
Low-Medium |
Long-running, restartable |
| Log Processing |
Low |
Eventual consistency OK |
| Development Pods |
Low |
Non-production |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Mixed criticality workloads |
Highly recommended |
| Cluster resource contention |
Recommended |
| Cost optimization |
Recommended (spare capacity) |
| Fast scaling needed |
Preemption faster than provisioning |
| Multi-tenant clusters |
Fair resource sharing |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| All workloads equal priority |
No differentiation |
Standard scheduling |
| Non-restartable workloads |
Preemption causes loss |
Dedicated resources |
| Single application cluster |
No priority needed |
Simple HPA |
Built-in Priority Classes
| Class |
Priority Value |
Use |
system-node-critical |
2000001000 |
Core system (etcd, kubelet) |
system-cluster-critical |
2000000000 |
Cluster services (coredns) |
Custom Priority Configuration
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-production
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority-batch
value: 100
globalDefault: false
preemptionPolicy: Never # Don't preempt others
description: "Batch jobs that can wait"
Using Priority for Efficient Scaling
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
template:
spec:
priorityClassName: low-priority-batch
containers:
- name: processor
image: batch-processor:latest
Strategy: Run low-priority workloads to utilize spare capacity. When high-priority pods arrive, they preempt low-priority ones instantly - faster than provisioning new nodes.
17. Service Mesh Traffic Scaling
Service meshes provide traffic management independent of pod scaling.
Benefits
| Benefit |
Description |
| Traffic Splitting |
Route percentages to versions |
| Canary Deployments |
Safe gradual rollouts |
| A/B Testing |
Feature experimentation |
| Circuit Breaking |
Prevent cascade failures |
| Load Balancing |
Advanced algorithms |
| Observability |
Deep traffic insights |
Ideal Workloads
| Workload Type |
Why Service Mesh Works Well |
| Microservices |
Service-to-service management |
| Canary Releases |
Traffic splitting by percentage |
| A/B Testing |
Header-based routing |
| Multi-version APIs |
Version-specific routing |
| Security-sensitive |
mTLS enforcement |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Gradual rollouts needed |
Traffic splitting |
| Multiple API versions |
Version routing |
| Complex traffic rules |
Policy-based routing |
| Zero-downtime deployments |
Traffic management |
| Microservices architecture |
Full mesh benefits |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Simple applications |
Overhead not justified |
Native K8s |
| High-performance requirements |
Sidecar latency |
Direct networking |
| Small teams |
Operational complexity |
Simple ingress |
| Monolithic applications |
No service-to-service |
Standard deployment |
Istio Traffic Splitting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: my-app
spec:
hosts:
- my-app
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: my-app
subset: v2
- route:
- destination:
host: my-app
subset: v1
weight: 90
- destination:
host: my-app
subset: v2
weight: 10
Benefits for Scaling
- Canary deployments without changing replica counts
- A/B testing based on headers or user attributes
- Circuit breaking prevents cascade failures during scaling
- Load balancing algorithms (round-robin, least connections)
18. Advanced Patterns
Multidimensional Pod Autoscaler (MPA)
Minimum Version: v1.27+
MPA combines HPA and VPA in a single controller, solving conflicts when both operate on similar metrics.
Benefits
| Benefit |
Description |
| Unified Scaling |
One controller for both dimensions |
| No Conflicts |
Avoids HPA/VPA feedback loops |
| Optimal Resource Allocation |
Best of both approaches |
Ideal Workloads
| Workload Type |
Why MPA Works Well |
| Variable Traffic + Memory Apps |
Scale horizontally on CPU, vertically on memory |
| JVM Applications |
Heap + request scaling |
| Complex Resource Patterns |
Multiple scaling dimensions |
apiVersion: autoscaling.gke.io/v1
kind: MultidimPodAutoscaler
metadata:
name: my-app-mpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
goals:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
constraints:
container:
- name: "*"
requests:
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
Descheduler for Rebalancing
The Descheduler finds pods that should be moved and evicts them, allowing fresh scheduling decisions.
Benefits
| Benefit |
Description |
| Node Rebalancing |
Even distribution over time |
| Policy Enforcement |
Fix scheduling violations |
| Cost Optimization |
Consolidate to fewer nodes |
Ideal Circumstances
| Circumstance |
Use Descheduler When |
| Nodes added to cluster |
Redistribute pods to new nodes |
| Taints/labels changed |
Enforce affinity rules |
| Resource imbalance |
Even out utilization |
| Topology violations |
Fix spread constraints |
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy
data:
policy.yaml: |
apiVersion: descheduler/v1alpha1
kind: DeschedulerPolicy
strategies:
LowNodeUtilization:
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
cpu: 20
memory: 20
targetThresholds:
cpu: 50
memory: 50
RemovePodsViolatingTopologySpreadConstraint:
enabled: true
Cluster Proportional Autoscaler
Scales workloads based on cluster size rather than utilization:
Ideal Workloads
| Workload Type |
Why Proportional Scaling Works |
| DNS (CoreDNS) |
More nodes = more DNS queries |
| Monitoring Agents |
Per-node data collection |
| Log Collectors |
Volume scales with cluster |
| Network Policies |
Cluster-wide enforcement |
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns-autoscaler
spec:
template:
spec:
containers:
- name: autoscaler
image: registry.k8s.io/cpa/cluster-proportional-autoscaler
command:
- /cluster-proportional-autoscaler
- --namespace=kube-system
- --configmap=coredns-autoscaler
- --target=deployment/coredns
- --logtostderr=true
- --v=2
19. Spot Instance Scaling
Spot instances offer up to 90% cost savings but can be reclaimed by cloud providers.
Benefits
| Benefit |
Description |
| Up to 90% Savings |
Massive cost reduction |
| Same Performance |
Identical to on-demand |
| High Availability |
With proper diversification |
| Flexible Capacity |
Access more instance types |
Ideal Workloads
| Workload Type |
Why Spot Works Well |
| Stateless Web Apps |
Easy to replace instances |
| CI/CD Pipelines |
Short-lived, restartable |
| Batch Processing |
Checkpointing handles interruptions |
| Development Environments |
Non-critical |
| ML Training |
With checkpointing |
| Data Processing |
Fault-tolerant frameworks |
Best Circumstances to Use
| Circumstance |
Recommendation |
| Fault-tolerant workloads |
Highly recommended |
| Cost optimization priority |
Highly recommended |
| Development/staging |
Recommended |
| Batch processing |
Recommended (with checkpoints) |
| CI/CD runners |
Recommended |
When NOT to Use
| Circumstance |
Why Not |
Alternative |
| Database primaries |
Data loss risk |
Reserved/On-demand |
| Single-replica critical services |
Availability risk |
On-demand |
| Long-running stateful jobs |
Interruption costly |
On-demand |
| Strict SLA requirements |
Unpredictable |
Reserved capacity |
Karpenter Spot Configuration
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-pool
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.large
- m5.xlarge
- m5a.large
- m5a.xlarge
- m6i.large
- m6i.xlarge
limits:
cpu: 500
disruption:
consolidationPolicy: WhenEmpty
Azure AKS Spot Node Pool
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name spotnodepool \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--enable-cluster-autoscaler \
--min-count 0 \
--max-count 10
Best Practices for Spot
- Diversify instance types - Reduce simultaneous interruption risk
- Use Pod Disruption Budgets - Ensure minimum availability
- Implement graceful shutdown - Handle SIGTERM properly
- Mix Spot and On-Demand - Critical workloads on reliable nodes
- Use multiple availability zones - Further reduce interruption correlation
20. Version Compatibility Reference
Quick Reference Table
| Feature |
Minimum |
Recommended |
GA Version |
| HPA v2 (multi-metric) |
v1.23 |
v1.30+ |
v1.23 |
| VPA |
v1.9 |
v1.33+ |
External |
| In-Place Pod Resize |
v1.27 |
v1.35 |
v1.35 |
| Cluster Autoscaler |
v1.8 |
v1.30+ |
External |
| Karpenter |
v1.21 |
v1.30+ |
External |
| KEDA |
v1.16 |
v1.30+ |
External |
| DRA Structured |
v1.32 |
v1.32+ |
Beta |
| HPA Configurable Tolerance |
v1.33 |
v1.34+ |
v1.35 |
| VPA InPlaceOrRecreate |
v1.33 |
v1.35 |
v1.35 |
Managed Kubernetes Versions (January 2025)
| Provider |
Supported Versions |
Default |
| AWS EKS |
1.28 - 1.32 |
1.30 |
| Azure AKS |
1.28 - 1.32 |
1.30 |
| GCP GKE |
1.28 - 1.32 |
1.30 |
21. Choosing the Right Strategy
Decision Matrix by Workload Type
| Workload Type |
Primary Strategy |
Secondary Strategy |
Why |
| Stateless Web App |
HPA |
Karpenter/CA |
Traffic-based scaling |
| REST API Service |
HPA + Custom Metrics |
KEDA |
Request patterns |
| Message Consumer |
KEDA |
Scale-to-zero |
Queue-based |
| Database |
VPA |
Operator-based |
Resource optimization |
| ML Training |
Gang Scheduling |
GPU Autoscaling |
Coordinated resources |
| ML Inference |
Queue-based HPA |
Spot instances |
Cost + throughput |
| Batch Jobs |
Kueue/YuniKorn |
Priority Classes |
Fair scheduling |
| Streaming |
Flink Autoscaler |
KEDA Kafka |
Lag-based |
| Unpredictable Traffic |
Virtual Nodes |
Karpenter |
Instant capacity |
| Global App |
Multi-cluster |
K8GB |
Geographic routing |
Implementation Priority
- Start with HPA - Essential for any production workload
- Add VPA recommendations - Use Goldilocks to right-size
- Implement Cluster Autoscaler or Karpenter - Node-level elasticity
- Add KEDA for event-driven workloads - Queue-based scaling
- Consider cost optimization - Spot instances, scale-to-zero
- Advanced patterns - MPA, priority classes, service mesh
22. Comprehensive Strategy Selection Guide
By Traffic Pattern
| Traffic Pattern |
Recommended Strategies |
Configuration Tips |
| Steady, predictable |
HPA + VPA (Off mode) |
Focus on right-sizing |
| Daily cycles |
Scheduled + HPA |
Pre-scale before peaks |
| Spiky, unpredictable |
HPA + Karpenter |
Aggressive scale-up |
| Event-driven bursts |
KEDA + Virtual Nodes |
Scale-to-zero capable |
| Seasonal (holidays) |
Scheduled + Burst |
Pre-provision + overflow |
By Cost Sensitivity
| Cost Priority |
Recommended Strategies |
Expected Savings |
| Maximum savings |
Spot + Scale-to-zero + VPA |
50-80% |
| Balanced |
HPA + Karpenter + Right-sizing |
30-50% |
| Performance-first |
HPA + Reserved capacity |
10-20% |
| Development environments |
Scale-to-zero + Spot |
70-90% |
By Latency Requirements
| Latency Requirement |
Recommended Strategies |
Trade-offs |
| Ultra-low (<10ms) |
Pre-scaled, no scale-to-zero |
Higher cost |
| Low (<100ms) |
HPA, warm pools |
Moderate cost |
| Moderate (<1s) |
Standard HPA + Karpenter |
Good balance |
| Tolerant (>1s) |
Scale-to-zero, Spot |
Maximum savings |
By Team Expertise
| Team Level |
Recommended Start |
Next Steps |
| Beginner |
HPA + Cluster Autoscaler |
Add VPA recommendations |
| Intermediate |
+ KEDA, Karpenter |
Cost optimization |
| Advanced |
+ MPA, Priority classes |
Multi-cluster |
| Expert |
Full stack optimization |
Custom operators |
By Compliance Requirements
| Requirement |
Recommended Strategies |
Considerations |
| Data residency |
Multi-cluster (regional) |
Cluster per region |
| Financial services |
Cluster Autoscaler (auditable) |
Avoid experimental features |
| Healthcare (HIPAA) |
Dedicated nodes, no Spot |
Premium for compliance |
| Startup/Agile |
Any |
Maximize innovation speed |
Conclusion
Kubernetes scaling has evolved far beyond simple CPU-based autoscaling. Modern production environments require a multi-layered approach:
- Pod Level: HPA for horizontal, VPA for vertical scaling
- Node Level: Karpenter or Cluster Autoscaler for infrastructure
- Event Level: KEDA for event-driven workloads
- Cost Level: Spot instances, scale-to-zero, right-sizing
The key to success is understanding your workload characteristics and choosing the right combination of strategies. With Kubernetes v1.33+ bringing features like In-Place Pod Resize and configurable HPA tolerance to GA, 2025 offers more options than ever for building efficient, cost-effective, and responsive applications.
Key Takeaways
- No single strategy fits all - Combine approaches based on workload needs
- Start simple, iterate - Begin with HPA, add complexity as needed
- Measure before optimizing - Use Goldilocks/Kubecost for visibility
- Version matters - Newer K8s versions unlock powerful features
- Cost and performance balance - Define your priorities clearly
References
Last updated: January 2025