The Complete Guide to Kubernetes Scaling Strategies in 2025

>by Roman Tsyupryk
>

From basic autoscaling to advanced multi-cluster orchestration - everything you need to know about scaling workloads on Kubernetes


Introduction

Kubernetes has become the de facto standard for container orchestration, but one of its most powerful yet complex capabilities remains autoscaling. As workloads grow increasingly diverse - from stateless microservices to GPU-intensive ML models - understanding the full spectrum of scaling strategies is essential for any platform engineer or DevOps practitioner.

This comprehensive guide covers 17 distinct scaling strategies, their version requirements, use cases, and practical implementation guidance for 2025.


Table of Contents

  1. Understanding Kubernetes Scaling Fundamentals
  2. Horizontal Pod Autoscaling (HPA)
  3. Vertical Pod Autoscaling (VPA)
  4. Cluster Autoscaling
  5. Karpenter: Next-Generation Node Provisioning
  6. Event-Driven Scaling with KEDA
  7. Scheduled and Cron-Based Scaling
  8. Serverless and Scale-to-Zero
  9. Burst Scaling with Virtual Nodes
  10. Multi-Cluster and Geographic Scaling
  11. Stateful Workload Scaling
  12. Batch Processing and Job Scaling
  13. Streaming Workload Scaling
  14. GPU and ML Workload Scaling
  15. Cost-Based and FinOps Scaling
  16. Pod Priority and Preemption-Based Scaling
  17. Service Mesh Traffic Scaling
  18. Advanced Patterns: MPA, Descheduler, and Proportional Scaling
  19. Spot Instance Scaling
  20. Version Compatibility Reference
  21. Choosing the Right Strategy
  22. Comprehensive Strategy Selection Guide

1. Understanding Kubernetes Scaling Fundamentals

Before diving into specific strategies, it's important to understand the three dimensions of Kubernetes scaling:

Dimension What Scales Tools
Horizontal Number of pods HPA, KEDA, Manual
Vertical Pod resources (CPU/memory) VPA, In-Place Resize
Cluster Number of nodes Cluster Autoscaler, Karpenter

Each dimension addresses different scaling challenges, and modern production environments typically combine multiple approaches.

Kubernetes Version Timeline (2024-2025)

Understanding version history is crucial for planning scaling implementations:

Version Release Key Scaling Features
v1.27 Apr 2023 In-Place Pod Resize (Alpha)
v1.30 May 2024 Enhanced autoscaling responsiveness
v1.32 Dec 2024 DRA Structured Parameters, Pod-level resources
v1.33 Apr 2025 In-Place Pod Resize (Beta), HPA Configurable Tolerance
v1.35 Dec 2025 In-Place Pod Resize (GA), VPA InPlaceOrRecreate (Beta)

2. Horizontal Pod Autoscaling (HPA)

Minimum Version: v1.1 (basic), v1.23 (recommended for v2 API)

HPA is the most widely used autoscaling mechanism in Kubernetes. It automatically adjusts the number of pod replicas based on observed metrics.

Benefits

Benefit Description
Native Kubernetes No external dependencies, built into every cluster
Proven & Stable Battle-tested in production for 10+ years
Low Overhead Minimal resource consumption
Flexible Metrics CPU, memory, custom, and external metrics
Easy to Implement Simple YAML configuration

Ideal Workloads

Workload Type Why HPA Works Well
Stateless Web Applications Easy to add/remove replicas without state concerns
REST APIs Request volume correlates with CPU usage
Microservices Independent scaling of each service
Frontend Applications Traffic-driven scaling
GraphQL Endpoints Query complexity reflected in resource usage

Best Circumstances to Use

Circumstance Recommendation
Traffic correlates with CPU/memory Highly recommended
Stateless application architecture Highly recommended
Need quick, reactive scaling Recommended
Predictable request patterns Recommended
Already using Kubernetes Default choice

When NOT to Use

Circumstance Why Not Alternative
Stateful applications Replicas need coordination VPA, Operators
Event-driven workloads Metrics don't reflect queue depth KEDA
Scale-to-zero needed HPA minimum is 1 KEDA, Knative
Long-running batch jobs Jobs complete, not scale Kueue, YuniKorn
GPU workloads CPU metrics don't reflect GPU usage Custom metrics HPA

API Evolution

v1.1  β†’ autoscaling/v1      # CPU only
v1.6  β†’ autoscaling/v2beta1 # Custom metrics
v1.12 β†’ autoscaling/v2beta2 # External metrics, behavior
v1.23 β†’ autoscaling/v2      # GA - USE THIS
v1.26 β†’ v2beta2 REMOVED

Important: If you're still using autoscaling/v2beta2, migrate immediately. It was removed in v1.26.

Basic HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

New in v1.33+: Configurable Tolerance

Historically, HPA used a fixed 10% tolerance globally. This often caused issues:

  • Sensitive workloads couldn't scale when needed
  • Other workloads oscillated unnecessarily

Starting with Kubernetes v1.33 (alpha) and v1.34 (beta), you can configure tolerance per-HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  behavior:
    scaleUp:
      tolerance: 0.05 # 5% tolerance for scaling up
    scaleDown:
      tolerance: 0.15 # 15% tolerance for scaling down

HPA Best Practices

  1. Always set resource requests - HPA cannot calculate utilization without them
  2. Use multiple metrics - Combine CPU with memory or custom metrics
  3. Configure scale-down behavior - Prevent thrashing with stabilization windows
  4. Avoid conflicts with VPA - Don't use both on the same metrics

3. Vertical Pod Autoscaling (VPA)

Minimum Version: v1.9 (external), v1.33+ (in-place recommended)

VPA automatically adjusts the CPU and memory requests/limits of pods based on historical usage patterns.

Benefits

Benefit Description
Right-sizing Automatically finds optimal resource allocation
Cost Reduction Eliminates over-provisioning (20-50% savings typical)
Performance Prevents OOM kills and CPU throttling
Zero Manual Tuning No guesswork for resource requests
In-Place Resize (v1.33+) No pod restarts needed

Ideal Workloads

Workload Type Why VPA Works Well
Databases (PostgreSQL, MySQL) Consistent memory needs, hard to horizontally scale
Caches (Redis, Memcached) Memory-bound, fixed instance count
Message Brokers Predictable resource patterns
Monolithic Applications Single instance optimization
Java/JVM Applications Heap sizing optimization
Long-running Services Builds accurate usage profile over time

Best Circumstances to Use

Circumstance Recommendation
Stateful applications Highly recommended
Unknown resource requirements Highly recommended
Memory-intensive workloads Recommended
Applications with variable load phases Recommended
Cost optimization priority Recommended
K8s v1.33+ available Use InPlaceOrRecreate mode

When NOT to Use

Circumstance Why Not Alternative
Already using HPA on same metric Feedback loops Choose one or use MPA
Ephemeral/short-lived pods Not enough data HPA or manual
Real-time latency requirements Pod recreation causes downtime In-place resize (v1.33+)
Rapid scaling needed VPA is slow to react HPA

VPA Update Modes

Mode Description Restart Required Best For
Off Recommendations only No Analysis, planning
Initial Apply on pod creation No (new pods) Gradual rollout
Auto Apply automatically Yes (recreates) Non-critical workloads
InPlaceOrRecreate Try in-place first (v1.33+) Maybe Production databases

The In-Place Pod Resize Revolution

One of the most anticipated features in Kubernetes history, In-Place Pod Resize eliminates the need to restart pods when adjusting resources:

Version Status Feature Gate
v1.27 Alpha InPlacePodVerticalScaling=true (manual)
v1.28-v1.32 Alpha InPlacePodVerticalScaling=true (manual)
v1.33 Beta Enabled by default
v1.35 GA Always enabled

VPA Configuration (v1.33+)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "InPlaceOrRecreate" # New in v1.33+
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi

VPA Limitations

  • Cannot run with HPA on the same metrics - This creates feedback loops
  • Recommendations based on historical data - May not react fast enough to spikes
  • Metrics Server dependency - 15-60 second polling intervals

4. Cluster Autoscaling

Minimum Version: v1.8

The Cluster Autoscaler adjusts the number of nodes in your cluster based on pending pods and node utilization.

Benefits

Benefit Description
Infrastructure Elasticity Automatic node provisioning/termination
Cost Optimization Scale down unused nodes
Hands-off Operation No manual node management
Cloud Integration Works with all major cloud providers
Predictable Behavior Well-understood algorithms

Ideal Workloads

Workload Type Why Cluster Autoscaler Works Well
Variable Traffic Applications Nodes scale with demand
Multi-tenant Clusters Shared infrastructure elasticity
Development/Staging Environments Scale down when unused
Traditional Enterprise Apps Stable, predictable scaling
Regulated Industries Well-audited, mature solution

Best Circumstances to Use

Circumstance Recommendation
Using node groups/ASGs Required approach
Multi-cloud or on-prem Recommended (supported everywhere)
Conservative scaling approach Recommended
Compliance requirements Recommended (mature, auditable)
Using GKE Standard mode Default choice

When NOT to Use

Circumstance Why Not Alternative
Need sub-minute provisioning 3-5 minute scale-up time Karpenter
Highly diverse workloads Node groups are rigid Karpenter
Spot instance optimization Basic spot handling Karpenter
Bin-packing optimization Limited consolidation Karpenter
AWS with dynamic requirements Slower than alternatives Karpenter

How It Works

  1. Scale Up: When pods can't be scheduled due to insufficient resources
  2. Scale Down: When nodes are underutilized for a configurable period

Configuration Example (AWS)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.0
          command:
            - ./cluster-autoscaler
            - --cloud-provider=aws
            - --nodes=1:10:my-node-group
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
            - --skip-nodes-with-local-storage=false

Key Parameters

Parameter Default Description
--scale-down-delay-after-add 10m Wait time after scale-up
--scale-down-unneeded-time 10m How long node must be underutilized
--scale-down-utilization-threshold 0.5 Utilization below which node is unneeded
--max-node-provision-time 15m Maximum time for node to become ready

5. Karpenter: Next-Generation Node Provisioning

Minimum Version: v1.21 (AWS), expanding to Azure and others

Karpenter represents a paradigm shift in Kubernetes node management. Unlike Cluster Autoscaler, it provisions nodes in seconds rather than minutes.

Benefits

Benefit Description
Speed 30-60 second node provisioning
Optimal Selection Chooses best instance type for workload
No Node Groups Eliminates rigid infrastructure
Bin-packing Advanced consolidation reduces waste
Spot Native Built-in interruption handling
Multi-arch Automatic ARM64/AMD64 selection

Ideal Workloads

Workload Type Why Karpenter Works Well
Microservices at Scale Rapid, diverse node needs
ML/AI Training GPU instance optimization
Batch Processing Quick provisioning, cost optimization
CI/CD Pipelines Fast runners, terminate when done
Spot-tolerant Workloads Native interruption handling
Multi-arch Applications Automatic Graviton selection

Best Circumstances to Use

Circumstance Recommendation
AWS EKS deployment Highly recommended
Need fast scaling (<1 min) Highly recommended
Cost optimization priority Highly recommended
Diverse workload requirements Recommended
Using Spot instances Recommended
Azure AKS (preview) Consider for new clusters

When NOT to Use

Circumstance Why Not Alternative
GCP GKE Not yet supported Cluster Autoscaler, Autopilot
On-premises Cloud-specific Cluster Autoscaler
Strict node group compliance No node group concept Cluster Autoscaler
Simple, stable workloads Over-engineered Cluster Autoscaler

Karpenter vs Cluster Autoscaler

Feature Cluster Autoscaler Karpenter
Provisioning Speed 3-5 minutes 30-60 seconds
Node Groups Required Not needed
Instance Selection Pre-defined Dynamic/optimal
Consolidation Basic Advanced bin-packing
Spot Handling Basic Native interruption handling

NodePool Configuration

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m5.large", "m5.xlarge", "m6g.large", "m6g.xlarge"]
  limits:
    cpu: 1000
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Real-World Results

Organizations report significant improvements with Karpenter:

  • 70% cost reduction through proper rightsizing
  • CPU utilization increased from 25% to 70%
  • 20% additional savings through ARM64/Graviton instances

6. Event-Driven Scaling with KEDA

Minimum Version: v1.16

KEDA (Kubernetes Event-Driven Autoscaling) is a CNCF graduated project that enables fine-grained autoscaling based on event sources.

Benefits

Benefit Description
Scale to Zero No running pods when idle
Event-Driven React to external triggers
70+ Scalers Extensive integration ecosystem
Works with HPA Extends native autoscaling
Lightweight Single-purpose, low overhead
Cost Savings Pay only when processing

Ideal Workloads

Workload Type Why KEDA Works Well
Message Queue Consumers Scale based on queue depth
Kafka Consumers Consumer lag-based scaling
Webhook Handlers Scale to zero between requests
Scheduled Jobs Cron-based activation
Database Triggers Scale on row count changes
Serverless Functions Event-driven activation
IoT Data Processors Device message volume

Best Circumstances to Use

Circumstance Recommendation
Message queue processing Highly recommended
Need scale-to-zero Highly recommended
Event-driven architecture Highly recommended
External metric sources Recommended
Asynchronous processing Recommended
Cost-sensitive workloads Recommended

When NOT to Use

Circumstance Why Not Alternative
Synchronous request handling No queue to measure HPA
Simple CPU-based scaling Over-engineered HPA
Stateful services Scale-to-zero problematic HPA, VPA
Real-time latency requirements Cold start on scale-from-zero Keep minReplicas > 0

Key Capabilities

  • Scale to zero - No pods when there's no work
  • 70+ built-in scalers - Message queues, databases, cloud services
  • Works with HPA - Extends rather than replaces native autoscaling

KEDA Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Event Source   │────▢│  KEDA Scaler    │────▢│      HPA        β”‚
β”‚  (Kafka, SQS,   β”‚     β”‚                 β”‚     β”‚                 β”‚
β”‚   Prometheus)   β”‚     β”‚  Metrics Server β”‚     β”‚  Scale Pods     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

ScaledObject Example (Kafka)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
spec:
  scaleTargetRef:
    name: kafka-consumer
  pollingInterval: 15
  cooldownPeriod: 300
  minReplicaCount: 0 # Scale to zero!
  maxReplicaCount: 50
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: my-consumer-group
        topic: my-topic
        lagThreshold: "100"
Use Case Scalers Trigger Metric
Async Messaging Kafka, RabbitMQ, AWS SQS, Azure Service Bus Queue depth, consumer lag
Database Polling PostgreSQL, MySQL, MongoDB Row count, query results
Monitoring Alerts Prometheus, Datadog, New Relic Custom queries
Cloud Events CloudWatch, Azure Monitor, GCP Cloud metrics
Scheduled Tasks Cron Time-based activation
HTTP Traffic HTTP Add-on, Prometheus Request rate

7. Scheduled and Cron-Based Scaling

For workloads with predictable traffic patterns, scheduled scaling provides proactive resource management.

Benefits

Benefit Description
Proactive Scale before demand hits
Predictable Consistent resource availability
Cost Control Scale down during known quiet periods
Simple Easy to understand and configure
Reliable Not dependent on metrics accuracy

Ideal Workloads

Workload Type Why Scheduled Scaling Works Well
E-commerce Sites Known traffic peaks (sales, promotions)
Business Applications Office hours usage patterns
Regional Services Timezone-based demand
Media Streaming Evening viewing peaks
Financial Systems Market hours operation
Educational Platforms School day patterns

Best Circumstances to Use

Circumstance Recommendation
Known traffic patterns Highly recommended
Business hours operation Highly recommended
Recurring events (sales, launches) Recommended
Complement to HPA Recommended
Cost optimization during off-hours Recommended

When NOT to Use

Circumstance Why Not Alternative
Unpredictable traffic Wrong capacity at wrong time HPA, KEDA
Global 24/7 services No off-peak period HPA
Highly variable demand Cron can't adapt HPA + KEDA
Real-time responsiveness needed Cron is pre-scheduled HPA

Implementation Options

Option Complexity Features Best For
CronJob + kubectl Low Native K8s Simple schedules
KEDA Cron Scaler Medium Scale-to-zero Event-driven apps
CronHPA Medium Dedicated controller Alibaba Cloud users
AHPA (Advanced) High Predictive + scheduled Enterprise

Option 1: Native CronJob + kubectl

apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-morning
spec:
  schedule: "0 8 * * 1-5" # 8 AM weekdays
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: kubectl
              image: bitnami/kubectl
              command:
                - /bin/sh
                - -c
                - kubectl scale deployment/web-app --replicas=10
          restartPolicy: OnFailure

Option 2: KEDA Cron Scaler

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: cron-scaler
spec:
  scaleTargetRef:
    name: web-app
  triggers:
    - type: cron
      metadata:
        timezone: America/New_York
        start: "0 8 * * 1-5" # Scale up at 8 AM
        end: "0 18 * * 1-5" # Scale down at 6 PM
        desiredReplicas: "10"

Option 3: CronHPA (Alibaba Cloud)

apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
  name: cronhpa-sample
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  jobs:
    - name: scale-up
      schedule: "0 8 * * *"
      targetSize: 10
    - name: scale-down
      schedule: "0 20 * * *"
      targetSize: 2

8. Serverless and Scale-to-Zero

Scale-to-zero is critical for cost optimization and resource efficiency, especially for infrequently-used services.

Benefits

Benefit Description
Maximum Cost Savings Zero cost when not in use
Resource Efficiency Free up cluster resources
Event-Driven Activate only when needed
Developer Experience Deploy without capacity planning
Environment Parity Same scaling in dev/staging/prod

Ideal Workloads

Workload Type Why Scale-to-Zero Works Well
Webhooks Sporadic incoming requests
Scheduled Reports Run once daily/weekly
Development APIs Infrequent testing
Internal Tools Occasional admin usage
Preview Environments Per-PR deployments
Seasonal Features Holiday-specific functionality
Low-traffic Microservices Cost optimization

Best Circumstances to Use

Circumstance Recommendation
Infrequent usage (<1 req/min) Highly recommended
Development/staging environments Highly recommended
Webhook receivers Recommended
Cost-sensitive projects Recommended
Event-driven architectures Recommended

When NOT to Use

Circumstance Why Not Alternative
Latency-sensitive APIs Cold start adds 2-5s Keep minReplicas >= 1
High-traffic services Constantly scaling Standard HPA
Stateful services State lost on scale-down VPA
Database connections Connection pool issues Keep warm replicas

Comparison of Serverless Platforms

Platform K8s Version Scale-to-Zero Cold Start Complexity Best For
Knative v1.26+ Native 2-5s High HTTP services
OpenFaaS v1.20+ Supported 1-3s Medium Functions
KEDA v1.16+ Native Varies Low Event-driven
Fission v1.19+ Supported <100ms Medium Fast cold starts

Knative Serving Example

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-world
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "0"
        autoscaling.knative.dev/max-scale: "100"
        autoscaling.knative.dev/target: "70"
    spec:
      containers:
        - image: gcr.io/my-project/hello-world
          resources:
            requests:
              cpu: 100m
              memory: 128Mi

Knative Pod Autoscaler (KPA) vs HPA

Feature KPA HPA
Scale-to-zero Native Not supported
Metric source Request concurrency CPU/Memory
Scale speed Fast (request-aware) Slower (metric polling)
Use case HTTP workloads General workloads

9. Burst Scaling with Virtual Nodes

Virtual nodes enable "burst" scaling to serverless infrastructure, handling traffic spikes without pre-provisioning nodes.

Benefits

Benefit Description
Instant Capacity No node provisioning wait
Unlimited Scale Cloud provider capacity
Pay-per-Use Only pay for container runtime
No Node Management Zero infrastructure overhead
Disaster Recovery Instant failover capacity

Ideal Workloads

Workload Type Why Burst Scaling Works Well
Marketing Campaigns Viral traffic handling
Flash Sales E-commerce spikes
Gaming Events Launch day capacity
CI/CD Runners Parallel test execution
Data Processing Large-scale batch jobs
Disaster Recovery Instant backup capacity
Video Rendering Parallel encoding

Best Circumstances to Use

Circumstance Recommendation
Unpredictable traffic spikes Highly recommended
Event-driven massive scale Highly recommended
CI/CD with variable parallelism Recommended
Disaster recovery capacity Recommended
Seasonal business peaks Recommended

When NOT to Use

Circumstance Why Not Alternative
Sustained high traffic Cost higher than dedicated Reserved capacity
GPU workloads Limited GPU in serverless Dedicated GPU nodes
Local storage needs No persistent volumes Regular nodes
Custom kernel requirements Standard containers only Dedicated nodes
Low-latency requirements Cold start overhead Pre-provisioned nodes

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Kubernetes Cluster                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Node 1    β”‚  β”‚   Node 2    β”‚  β”‚   Virtual Node      β”‚  β”‚
β”‚  β”‚  (Regular)  β”‚  β”‚  (Regular)  β”‚  β”‚   (ACI/Fargate)     β”‚  β”‚
β”‚  β”‚             β”‚  β”‚             β”‚  β”‚                     β”‚  β”‚
β”‚  β”‚  [Pod][Pod] β”‚  β”‚  [Pod][Pod] β”‚  β”‚  [Burst Pods...]    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
                                              β–Ό
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚  Cloud Serverlessβ”‚
                                    β”‚  (ACI, Fargate)  β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cloud Provider Options

Provider Service Provisioning Use Case
Azure ACI + Virtual Nodes ~10 seconds AKS burst
AWS Fargate ~30 seconds EKS serverless
GCP Cloud Run for Anthos ~5 seconds GKE burst

Azure AKS Virtual Nodes

# Pod with toleration for virtual node
apiVersion: v1
kind: Pod
metadata:
  name: burst-pod
spec:
  containers:
    - name: app
      image: my-app:latest
  tolerations:
    - key: virtual-kubelet.io/provider
      operator: Exists
    - key: azure.com/aci
      effect: NoSchedule
  nodeSelector:
    kubernetes.io/role: agent
    type: virtual-kubelet

10. Multi-Cluster and Geographic Scaling

For high availability, latency reduction, or regulatory compliance, scaling across multiple clusters is essential.

Benefits

Benefit Description
High Availability Survive entire cluster failures
Low Latency Serve users from nearest region
Compliance Data residency requirements
Blast Radius Limit impact of incidents
Infinite Scale Horizontal cluster scaling

Ideal Workloads

Workload Type Why Multi-Cluster Works Well
Global SaaS Applications Users worldwide
Financial Services Regional compliance
Healthcare Systems Data sovereignty
Gaming Platforms Low latency critical
E-commerce Regional availability
Government Services Jurisdictional requirements

Best Circumstances to Use

Circumstance Recommendation
Global user base Highly recommended
Regulatory compliance Highly recommended
Zero-downtime requirements Recommended
Multi-cloud strategy Recommended
Disaster recovery Recommended

When NOT to Use

Circumstance Why Not Alternative
Single region users Unnecessary complexity Single cluster
Limited budget High operational cost Single cluster + backup
Simple applications Over-engineered Single cluster
Tight coupling between services Latency issues Single cluster

Federation Options

Solution Approach Scale Best For
KubeAdmiral Control plane federation 10M+ pods Hyperscale (ByteDance)
K8GB DNS-based routing Medium Geographic routing
Istio Multi-cluster Service mesh Large Service-level routing
ArgoCD GitOps-based Any Configuration sync

K8GB for Geographic Load Balancing

apiVersion: k8gb.absa.oss/v1beta1
kind: Gslb
metadata:
  name: my-gslb
spec:
  ingress:
    rules:
      - host: app.example.com
        http:
          paths:
            - path: /
              backend:
                service:
                  name: my-app
                  port:
                    number: 80
  strategy:
    type: roundRobin # or failover, geoip
    splitBrainThresholdSeconds: 300

Multi-Cluster Patterns

Pattern Description Use Case Complexity
Active-Active All clusters serve traffic Global load distribution High
Active-Passive Standby cluster for failover Disaster recovery Medium
Follow-the-Sun Route to closest region Latency optimization High
Hybrid Mix cloud and on-prem Regulatory compliance Very High

11. Stateful Workload Scaling

Databases and stateful applications require special consideration when scaling.

Benefits

Benefit Description
Data Integrity Ordered scaling preserves consistency
Stable Identity Predictable pod names and storage
Controlled Scaling One pod at a time
Persistent Storage Data survives pod restarts
Operator Automation Complex logic automated

Ideal Workloads

Workload Type Why Stateful Scaling Works Well
Databases (PostgreSQL, MySQL) Read replica scaling
Distributed Caches (Redis Cluster) Shard scaling
Message Queues (Kafka, RabbitMQ) Broker scaling
Search Engines (Elasticsearch) Data node scaling
Distributed Storage Storage node scaling
Consensus Systems (etcd, Zookeeper) Careful member management

Best Circumstances to Use

Circumstance Recommendation
Database read replicas needed Horizontal StatefulSet scaling
Consistent resource needs VPA for databases
Complex topology management Operator-based scaling
Data sharding required Horizontal with operators

When NOT to Use

Circumstance Why Not Alternative
Frequent scaling needed Slow, ordered operations Caching layer
Unknown data patterns StatefulSet scaling is slow Managed database
No DBA expertise Complex failure handling Managed services

StatefulSet Scaling Approaches

Approach Method Use Case Complexity
Horizontal Add replicas Read replicas, sharding Medium
Vertical (VPA) Increase resources Single-instance performance Low
Operator-based Automated management Complex topologies High

Database Operators

Database Operator Capabilities
PostgreSQL Zalando, CrunchyData HA, backup, scaling
MySQL Percona, Vitess Sharding, replication
MongoDB MongoDB Enterprise Auto-scaling, sharding
Cassandra K8ssandra Multi-DC, repair
Redis Redis Enterprise Clustering, geo-replication

Scaling StatefulSets

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3 # Scale by changing this
  selector:
    matchLabels:
      app: postgres
  template:
    spec:
      containers:
        - name: postgres
          image: postgres:15
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi

Custom Metrics for Database Scaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: postgres-read-replicas
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: postgres-read
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: External
      external:
        metric:
          name: postgres_connections_active
        target:
          type: AverageValue
          averageValue: "50"

12. Batch Processing and Job Scaling

Large-scale batch processing requires specialized schedulers for efficiency.

Benefits

Benefit Description
Gang Scheduling All-or-nothing pod placement
Queue Fairness Resource sharing across teams
Priority Management Critical jobs run first
Resource Efficiency Maximize cluster utilization
Job Dependencies Complex workflow support

Ideal Workloads

Workload Type Why Batch Scaling Works Well
Spark Jobs Distributed data processing
ML Training GPU resource coordination
ETL Pipelines Scheduled data transformations
Scientific Computing HPC workloads
Video Encoding Parallel rendering
Financial Calculations End-of-day processing

Best Circumstances to Use

Circumstance Recommendation
Distributed computing (Spark, Flink) Gang scheduling required
Multi-tenant resource sharing Queue-based scheduling
Large-scale ML training YuniKorn or Volcano
Kubernetes-native jobs Kueue
Complex workflows Argo Workflows

When NOT to Use

Circumstance Why Not Alternative
Simple, single-pod jobs Over-engineered Native K8s Jobs
Long-running services Not job-oriented Deployments + HPA
Real-time processing Batch-oriented Streaming (Flink)

Batch Schedulers Comparison

Scheduler Key Feature Best For Complexity
YuniKorn Gang scheduling, queues Spark, ML training Medium
Volcano HPC workloads Scientific computing Medium
Kueue Native K8s integration General batch Low
Argo Workflows DAG workflows Data pipelines Medium

Gang Scheduling with YuniKorn

Gang scheduling ensures all pods of a job start together or not at all - critical for distributed computing.

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: spark-driver
    queue: root.default
  annotations:
    yunikorn.apache.org/task-group-name: spark-job
    yunikorn.apache.org/task-groups: |
      [{
        "name": "spark-driver",
        "minMember": 1,
        "minResource": {"cpu": "1", "memory": "2Gi"}
      },
      {
        "name": "spark-executor",
        "minMember": 10,
        "minResource": {"cpu": "2", "memory": "4Gi"}
      }]

Kueue for Job Queueing

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: team-a-queue
spec:
  clusterQueue: cluster-queue
---
apiVersion: batch/v1
kind: Job
metadata:
  name: my-batch-job
  labels:
    kueue.x-k8s.io/queue-name: team-a-queue
spec:
  parallelism: 10
  template:
    spec:
      containers:
        - name: worker
          image: my-batch-image

13. Streaming Workload Scaling

Real-time streaming applications have unique scaling requirements based on data throughput.

Benefits

Benefit Description
Throughput-Based Scale on actual data volume
Lag-Aware React to processing backlog
Backpressure Handling Prevent system overload
Real-time Adaptation Continuous adjustment
End-to-End Latency Maintain SLA targets

Ideal Workloads

Workload Type Why Streaming Scaling Works Well
Kafka Consumers Consumer lag-based scaling
Flink Applications Throughput-aware autoscaling
Event Processors Event rate scaling
Log Aggregators Volume-based scaling
Real-time Analytics Query load scaling
IoT Data Ingestion Device message volume

Best Circumstances to Use

Circumstance Recommendation
Apache Kafka-based systems KEDA Kafka scaler
Apache Flink workloads Flink Operator autoscaler
Variable event rates Event-driven scaling
SLA-bound latency Lag-based scaling

When NOT to Use

Circumstance Why Not Alternative
Consistent throughput No scaling needed Fixed replicas
Request-response patterns Not stream-oriented HPA
Batch processing Not real-time Batch schedulers
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: flink-streaming
spec:
  flinkVersion: v1_17
  flinkConfiguration:
    kubernetes.operator.job.autoscaler.enabled: "true"
    kubernetes.operator.job.autoscaler.stabilization.interval: "5m"
    kubernetes.operator.job.autoscaler.metrics.window: "10m"
    kubernetes.operator.job.autoscaler.target.utilization: "0.7"
  jobManager:
    resource:
      memory: "2Gi"
      cpu: 1
  taskManager:
    resource:
      memory: "4Gi"
      cpu: 2

KEDA Kafka Scaler

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-stream-processor
spec:
  scaleTargetRef:
    name: stream-processor
  minReplicaCount: 1
  maxReplicaCount: 100
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: stream-processor
        topic: events
        lagThreshold: "1000"
        activationLagThreshold: "10" # Scale from 0 when lag > 10

14. GPU and ML Workload Scaling

AI/ML workloads require specialized scaling strategies due to GPU scarcity and cost.

Benefits

Benefit Description
Resource Optimization Maximize expensive GPU usage
Throughput Maximization Process more requests
Cost Control Scale down when idle
Latency Management Meet inference SLAs
Queue Management Prevent request buildup

Ideal Workloads

Workload Type Scaling Approach Why It Works
LLM Inference Queue-based Request volume varies
Image Generation Batch-based Latency tolerance varies
Model Training Gang scheduling All GPUs needed together
Real-time Detection GPU utilization Consistent load patterns
Recommendation Systems Hybrid Traffic + processing

Best Circumstances to Use

Circumstance Recommended Approach
Variable inference load Queue-based autoscaling
Latency-sensitive inference Batch-based scaling
Training workloads Gang scheduling + priority
Mixed GPU/CPU workloads Priority classes
GPU cost optimization Pause pod pattern

When NOT to Use

Circumstance Why Not Alternative
Consistent GPU load No scaling needed Fixed allocation
Training with checkpoints Restart disrupts training Priority preemption
Real-time requirements Cold start unacceptable Keep minimum replicas

GPU Scaling Approaches

Approach Metric Best For Complexity
Queue-based Inference queue length Throughput optimization Medium
Batch-based Current batch size Latency sensitivity Low
GPU Utilization DCGM metrics Resource efficiency High
Pause Pod Pattern Pre-provisioned nodes Cold start reduction Medium

Queue-Based Autoscaling for Inference

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: gpu-inference-scaler
spec:
  scaleTargetRef:
    name: gpu-inference
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: inference_queue_length
        query: sum(inference_requests_pending)
        threshold: "50"

Pause Pod Pattern for GPU Pre-provisioning

# Low-priority pause pod to keep GPU nodes warm
apiVersion: v1
kind: Pod
metadata:
  name: gpu-placeholder
spec:
  priorityClassName: low-priority-preemptible
  containers:
    - name: pause
      image: registry.k8s.io/pause:3.9
      resources:
        limits:
          nvidia.com/gpu: 1
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists

When a real GPU workload arrives, the pause pod is preempted instantly, avoiding node provisioning delays.

GPU Metrics for Scaling (DCGM)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-workload-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gpu-workload
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "70"

15. Cost-Based and FinOps Scaling

FinOps integrates cost awareness into scaling decisions.

Benefits

Benefit Description
Cost Visibility See where money goes
Right-sizing Eliminate waste
Budget Control Prevent overspending
ROI Optimization Maximum value from spend
Showback/Chargeback Team accountability

Ideal Workloads

Workload Type FinOps Approach
All Production Workloads Right-sizing recommendations
Development Environments Aggressive scale-down
Non-critical Services Spot instances
Over-provisioned Apps VPA recommendations
Multi-tenant Platforms Cost allocation

Best Circumstances to Use

Circumstance Recommendation
High cloud bill Immediate priority
Unknown resource usage Implement visibility first
Multi-team clusters Enable cost allocation
Startup with limited budget Critical for survival
Enterprise FinOps initiative Full tooling investment

When NOT to Use

Circumstance Why Not Alternative
Performance is only priority Cost optimization may impact Performance-first approach
Very small clusters Tool overhead not worth it Manual monitoring
Proof of concept Premature optimization Simple monitoring

FinOps Tools Comparison

Tool Approach Key Feature Best For
Kubecost Visibility Cost allocation Getting started
Cast.ai Automation Multi-cloud optimization Hands-off savings
ScaleOps Real-time Continuous rightsizing Dynamic workloads
Goldilocks Recommendations VPA-based suggestions Manual optimization
Zesty AI-powered Up to 70% savings Maximum automation

Goldilocks for Resource Recommendations

Goldilocks uses VPA in recommendation-only mode:

# Install Goldilocks
helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks

# Enable for a namespace
kubectl label namespace default goldilocks.fairwinds.com/enabled=true

Access the dashboard to see recommendations:

kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80

Cost-Aware Scaling Strategy

  1. Right-size pods using VPA recommendations
  2. Use Spot/Preemptible instances for fault-tolerant workloads
  3. Scale to zero during off-hours
  4. Bin-pack nodes with Karpenter consolidation
  5. Monitor continuously with Kubecost alerts

16. Pod Priority and Preemption-Based Scaling

Priority classes enable mixed workloads to share resources efficiently.

Benefits

Benefit Description
Instant Preemption Faster than provisioning nodes
Resource Efficiency Utilize spare capacity
Cost Optimization Run low-priority work for free
Quality of Service Critical workloads always run
Flexible Scheduling Dynamic resource allocation

Ideal Workloads

Workload Type Priority Level Rationale
Production APIs High Business-critical
Background Jobs Low Can wait for resources
CI/CD Pipelines Medium Important but interruptible
ML Training Low-Medium Long-running, restartable
Log Processing Low Eventual consistency OK
Development Pods Low Non-production

Best Circumstances to Use

Circumstance Recommendation
Mixed criticality workloads Highly recommended
Cluster resource contention Recommended
Cost optimization Recommended (spare capacity)
Fast scaling needed Preemption faster than provisioning
Multi-tenant clusters Fair resource sharing

When NOT to Use

Circumstance Why Not Alternative
All workloads equal priority No differentiation Standard scheduling
Non-restartable workloads Preemption causes loss Dedicated resources
Single application cluster No priority needed Simple HPA

Built-in Priority Classes

Class Priority Value Use
system-node-critical 2000001000 Core system (etcd, kubelet)
system-cluster-critical 2000000000 Cluster services (coredns)

Custom Priority Configuration

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-production
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority-batch
value: 100
globalDefault: false
preemptionPolicy: Never # Don't preempt others
description: "Batch jobs that can wait"

Using Priority for Efficient Scaling

apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  template:
    spec:
      priorityClassName: low-priority-batch
      containers:
        - name: processor
          image: batch-processor:latest

Strategy: Run low-priority workloads to utilize spare capacity. When high-priority pods arrive, they preempt low-priority ones instantly - faster than provisioning new nodes.


17. Service Mesh Traffic Scaling

Service meshes provide traffic management independent of pod scaling.

Benefits

Benefit Description
Traffic Splitting Route percentages to versions
Canary Deployments Safe gradual rollouts
A/B Testing Feature experimentation
Circuit Breaking Prevent cascade failures
Load Balancing Advanced algorithms
Observability Deep traffic insights

Ideal Workloads

Workload Type Why Service Mesh Works Well
Microservices Service-to-service management
Canary Releases Traffic splitting by percentage
A/B Testing Header-based routing
Multi-version APIs Version-specific routing
Security-sensitive mTLS enforcement

Best Circumstances to Use

Circumstance Recommendation
Gradual rollouts needed Traffic splitting
Multiple API versions Version routing
Complex traffic rules Policy-based routing
Zero-downtime deployments Traffic management
Microservices architecture Full mesh benefits

When NOT to Use

Circumstance Why Not Alternative
Simple applications Overhead not justified Native K8s
High-performance requirements Sidecar latency Direct networking
Small teams Operational complexity Simple ingress
Monolithic applications No service-to-service Standard deployment

Istio Traffic Splitting

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-app
spec:
  hosts:
    - my-app
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: my-app
            subset: v2
    - route:
        - destination:
            host: my-app
            subset: v1
          weight: 90
        - destination:
            host: my-app
            subset: v2
          weight: 10

Benefits for Scaling

  • Canary deployments without changing replica counts
  • A/B testing based on headers or user attributes
  • Circuit breaking prevents cascade failures during scaling
  • Load balancing algorithms (round-robin, least connections)

18. Advanced Patterns

Multidimensional Pod Autoscaler (MPA)

Minimum Version: v1.27+

MPA combines HPA and VPA in a single controller, solving conflicts when both operate on similar metrics.

Benefits

Benefit Description
Unified Scaling One controller for both dimensions
No Conflicts Avoids HPA/VPA feedback loops
Optimal Resource Allocation Best of both approaches

Ideal Workloads

Workload Type Why MPA Works Well
Variable Traffic + Memory Apps Scale horizontally on CPU, vertically on memory
JVM Applications Heap + request scaling
Complex Resource Patterns Multiple scaling dimensions
apiVersion: autoscaling.gke.io/v1
kind: MultidimPodAutoscaler
metadata:
  name: my-app-mpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  goals:
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
  constraints:
    container:
      - name: "*"
        requests:
          minAllowed:
            cpu: 100m
            memory: 128Mi
          maxAllowed:
            cpu: 4
            memory: 8Gi

Descheduler for Rebalancing

The Descheduler finds pods that should be moved and evicts them, allowing fresh scheduling decisions.

Benefits

Benefit Description
Node Rebalancing Even distribution over time
Policy Enforcement Fix scheduling violations
Cost Optimization Consolidate to fewer nodes

Ideal Circumstances

Circumstance Use Descheduler When
Nodes added to cluster Redistribute pods to new nodes
Taints/labels changed Enforce affinity rules
Resource imbalance Even out utilization
Topology violations Fix spread constraints
apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy
data:
  policy.yaml: |
    apiVersion: descheduler/v1alpha1
    kind: DeschedulerPolicy
    strategies:
      LowNodeUtilization:
        enabled: true
        params:
          nodeResourceUtilizationThresholds:
            thresholds:
              cpu: 20
              memory: 20
            targetThresholds:
              cpu: 50
              memory: 50
      RemovePodsViolatingTopologySpreadConstraint:
        enabled: true

Cluster Proportional Autoscaler

Scales workloads based on cluster size rather than utilization:

Ideal Workloads

Workload Type Why Proportional Scaling Works
DNS (CoreDNS) More nodes = more DNS queries
Monitoring Agents Per-node data collection
Log Collectors Volume scales with cluster
Network Policies Cluster-wide enforcement
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns-autoscaler
spec:
  template:
    spec:
      containers:
        - name: autoscaler
          image: registry.k8s.io/cpa/cluster-proportional-autoscaler
          command:
            - /cluster-proportional-autoscaler
            - --namespace=kube-system
            - --configmap=coredns-autoscaler
            - --target=deployment/coredns
            - --logtostderr=true
            - --v=2

19. Spot Instance Scaling

Spot instances offer up to 90% cost savings but can be reclaimed by cloud providers.

Benefits

Benefit Description
Up to 90% Savings Massive cost reduction
Same Performance Identical to on-demand
High Availability With proper diversification
Flexible Capacity Access more instance types

Ideal Workloads

Workload Type Why Spot Works Well
Stateless Web Apps Easy to replace instances
CI/CD Pipelines Short-lived, restartable
Batch Processing Checkpointing handles interruptions
Development Environments Non-critical
ML Training With checkpointing
Data Processing Fault-tolerant frameworks

Best Circumstances to Use

Circumstance Recommendation
Fault-tolerant workloads Highly recommended
Cost optimization priority Highly recommended
Development/staging Recommended
Batch processing Recommended (with checkpoints)
CI/CD runners Recommended

When NOT to Use

Circumstance Why Not Alternative
Database primaries Data loss risk Reserved/On-demand
Single-replica critical services Availability risk On-demand
Long-running stateful jobs Interruption costly On-demand
Strict SLA requirements Unpredictable Reserved capacity

Karpenter Spot Configuration

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-pool
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - m5.large
            - m5.xlarge
            - m5a.large
            - m5a.xlarge
            - m6i.large
            - m6i.xlarge
  limits:
    cpu: 500
  disruption:
    consolidationPolicy: WhenEmpty

Azure AKS Spot Node Pool

az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name spotnodepool \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price -1 \
  --enable-cluster-autoscaler \
  --min-count 0 \
  --max-count 10

Best Practices for Spot

  1. Diversify instance types - Reduce simultaneous interruption risk
  2. Use Pod Disruption Budgets - Ensure minimum availability
  3. Implement graceful shutdown - Handle SIGTERM properly
  4. Mix Spot and On-Demand - Critical workloads on reliable nodes
  5. Use multiple availability zones - Further reduce interruption correlation

20. Version Compatibility Reference

Quick Reference Table

Feature Minimum Recommended GA Version
HPA v2 (multi-metric) v1.23 v1.30+ v1.23
VPA v1.9 v1.33+ External
In-Place Pod Resize v1.27 v1.35 v1.35
Cluster Autoscaler v1.8 v1.30+ External
Karpenter v1.21 v1.30+ External
KEDA v1.16 v1.30+ External
DRA Structured v1.32 v1.32+ Beta
HPA Configurable Tolerance v1.33 v1.34+ v1.35
VPA InPlaceOrRecreate v1.33 v1.35 v1.35

Managed Kubernetes Versions (January 2025)

Provider Supported Versions Default
AWS EKS 1.28 - 1.32 1.30
Azure AKS 1.28 - 1.32 1.30
GCP GKE 1.28 - 1.32 1.30

21. Choosing the Right Strategy

Decision Matrix by Workload Type

Workload Type Primary Strategy Secondary Strategy Why
Stateless Web App HPA Karpenter/CA Traffic-based scaling
REST API Service HPA + Custom Metrics KEDA Request patterns
Message Consumer KEDA Scale-to-zero Queue-based
Database VPA Operator-based Resource optimization
ML Training Gang Scheduling GPU Autoscaling Coordinated resources
ML Inference Queue-based HPA Spot instances Cost + throughput
Batch Jobs Kueue/YuniKorn Priority Classes Fair scheduling
Streaming Flink Autoscaler KEDA Kafka Lag-based
Unpredictable Traffic Virtual Nodes Karpenter Instant capacity
Global App Multi-cluster K8GB Geographic routing

Implementation Priority

  1. Start with HPA - Essential for any production workload
  2. Add VPA recommendations - Use Goldilocks to right-size
  3. Implement Cluster Autoscaler or Karpenter - Node-level elasticity
  4. Add KEDA for event-driven workloads - Queue-based scaling
  5. Consider cost optimization - Spot instances, scale-to-zero
  6. Advanced patterns - MPA, priority classes, service mesh

22. Comprehensive Strategy Selection Guide

By Traffic Pattern

Traffic Pattern Recommended Strategies Configuration Tips
Steady, predictable HPA + VPA (Off mode) Focus on right-sizing
Daily cycles Scheduled + HPA Pre-scale before peaks
Spiky, unpredictable HPA + Karpenter Aggressive scale-up
Event-driven bursts KEDA + Virtual Nodes Scale-to-zero capable
Seasonal (holidays) Scheduled + Burst Pre-provision + overflow

By Cost Sensitivity

Cost Priority Recommended Strategies Expected Savings
Maximum savings Spot + Scale-to-zero + VPA 50-80%
Balanced HPA + Karpenter + Right-sizing 30-50%
Performance-first HPA + Reserved capacity 10-20%
Development environments Scale-to-zero + Spot 70-90%

By Latency Requirements

Latency Requirement Recommended Strategies Trade-offs
Ultra-low (<10ms) Pre-scaled, no scale-to-zero Higher cost
Low (<100ms) HPA, warm pools Moderate cost
Moderate (<1s) Standard HPA + Karpenter Good balance
Tolerant (>1s) Scale-to-zero, Spot Maximum savings

By Team Expertise

Team Level Recommended Start Next Steps
Beginner HPA + Cluster Autoscaler Add VPA recommendations
Intermediate + KEDA, Karpenter Cost optimization
Advanced + MPA, Priority classes Multi-cluster
Expert Full stack optimization Custom operators

By Compliance Requirements

Requirement Recommended Strategies Considerations
Data residency Multi-cluster (regional) Cluster per region
Financial services Cluster Autoscaler (auditable) Avoid experimental features
Healthcare (HIPAA) Dedicated nodes, no Spot Premium for compliance
Startup/Agile Any Maximize innovation speed

Conclusion

Kubernetes scaling has evolved far beyond simple CPU-based autoscaling. Modern production environments require a multi-layered approach:

  • Pod Level: HPA for horizontal, VPA for vertical scaling
  • Node Level: Karpenter or Cluster Autoscaler for infrastructure
  • Event Level: KEDA for event-driven workloads
  • Cost Level: Spot instances, scale-to-zero, right-sizing

The key to success is understanding your workload characteristics and choosing the right combination of strategies. With Kubernetes v1.33+ bringing features like In-Place Pod Resize and configurable HPA tolerance to GA, 2025 offers more options than ever for building efficient, cost-effective, and responsive applications.

Key Takeaways

  1. No single strategy fits all - Combine approaches based on workload needs
  2. Start simple, iterate - Begin with HPA, add complexity as needed
  3. Measure before optimizing - Use Goldilocks/Kubecost for visibility
  4. Version matters - Newer K8s versions unlock powerful features
  5. Cost and performance balance - Define your priorities clearly

References


Last updated: January 2025

Share this post: