Skip to main content

The Complete Kubernetes Autoscaling Guide: HPA, VPA, and KEDA in Practice

Kubernetes autoscaling is the key to maintaining performance while optimizing costs. However, with multiple tools available — HPA, VPA, KEDA, and Cluster Autoscaler — knowing when and how to combine them remains a challenge for many teams.

According to CNCF 2025 benchmarks, properly combining these tools can achieve 30-50% cost reduction compared to statically provisioned clusters. This guide deep-dives into each tool's strengths and provides practical combination patterns.

Kubo is a managed Kubernetes platform from ¥48,000/month (~$320/month) that simplifies autoscaling configuration and operations.

HPA (Horizontal Pod Autoscaler): The Scaling Foundation

How HPA Works

According to the Kubernetes official documentation, HPA automatically adjusts the number of pod replicas based on observed metrics. CloudPilot AI's comparison rates HPA as "best for stateless workloads where CPU or memory is the bottleneck."

Basic Configuration

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5-minute stabilization
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60             # Max 10% reduction per minute
    scaleUp:
      stabilizationWindowSeconds: 0    # Immediate scale-up
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15             # Up to 100% increase in 15 seconds

Custom Metrics Scaling

Scale on Prometheus custom metrics beyond CPU and memory:

yaml
metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "1000"

HPA Best Practices

  • Set minReplicas to at least 2 based on availability requirements
  • Configure behavior with scale-down stabilization to prevent flapping
  • CPU utilization threshold of 70-80% is standard (see Sedai's guide)

VPA (Vertical Pod Autoscaler): Right-Sizing Resources

How VPA Works

According to Kubeify's analysis, VPA automatically adjusts pod CPU and memory requests and limits based on actual usage. It is best suited for stateful workloads, monoliths, and batch jobs with stable, predictable usage patterns.

Three Operating Modes

ModeBehaviorRecommended Use
OffGenerates recommendations without applyingAnalysis and planning
InitialApplies recommendations only at pod creationGradual adoption
AutoAutomatically applies recommendations (pod restart)Stable workloads

Visualization with Goldilocks

yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"     # Recommendation only
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "2"
        memory: "4Gi"

Use Goldilocks alongside VPA to view namespace-wide resource recommendations in a dashboard.

Critical VPA Warning

VPA and HPA conflict: Do not use VPA in Auto mode alongside HPA scaling on CPU or memory. The two controllers will compete. The correct combination is "VPA (Off/Initial mode) + HPA (custom metrics)" or "replace HPA with KEDA entirely."

For Captain.AI workloads on Kubo, AI worker resource requirements fluctuate dynamically, making continuous right-sizing with VPA recommendations essential.

KEDA: Event-Driven Autoscaling

What Makes KEDA Different

KEDA is a Kubernetes-based event-driven autoscaler that enables scaling based on external event sources beyond what HPA alone can handle. As of version 2.19, it provides 70+ built-in scalers.

The Killer Feature: Scale to Zero

According to Spectro Cloud's analysis, when no events are present, KEDA can scale deployments to zero pods — a massive cost saver for development environments and sporadic batch processing.

Message Queue-Based Scaling

Scaling based on RabbitMQ queue depth:

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: queue-worker
  pollingInterval: 10           # Check every 10 seconds
  cooldownPeriod: 60            # 60-second cooldown
  minReplicaCount: 0            # Scale to Zero!
  maxReplicaCount: 20
  triggers:
  - type: rabbitmq
    metadata:
      host: "amqp://rabbitmq.production.svc.cluster.local:5672"
      queueName: tasks
      queueLength: "5"          # 1 pod per 5 queue messages

Prometheus Metrics Scaling

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-scaler
spec:
  scaleTargetRef:
    name: api-server
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
      metricName: http_request_duration_seconds_p99
      threshold: "0.5"           # Scale up when P99 latency exceeds 500ms
      query: |
        histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="api"}[5m])) by (le))

Cron-Based Scaling

For predictable traffic patterns:

yaml
triggers:
- type: cron
  metadata:
    timezone: Asia/Tokyo
    start: "0 8 * * 1-5"        # Scale up at 8 AM weekdays
    end: "0 20 * * 1-5"         # Scale down at 8 PM weekdays
    desiredReplicas: "10"

Combination Patterns and Best Practices

Based on DEV Community comparisons and Tasrie IT benchmarks:

Workload TypeRecommended Setup
Web API (stateless)HPA (CPU/custom metrics) + Cluster Autoscaler
Batch processingKEDA (queue depth) + Scale to Zero
Database (stateful)VPA (Auto) + manual node management
AI inferenceKEDA (Prometheus metrics) + Karpenter
Development environmentsKEDA (Cron) + Scale to Zero

Karpenter Integration

Karpenter demonstrates 40% faster node provisioning than Cluster Autoscaler in CNCF 2025 benchmarks. Combining pod-level scaling (HPA/KEDA) with node-level scaling (Karpenter) delivers maximum efficiency.

Quantifying Cost Impact

Optimization ApproachExpected Reduction
HPA only10-20%
HPA + VPA (Off mode analysis)20-30%
KEDA (Scale to Zero)30-50% (non-production)
All tools integrated + Karpenter30-50% (including production)

With Kubo, you get transparent pricing from ¥48,000/month to fully leverage these autoscaling capabilities.

Conclusion: Choose the Right Tools for Cost and Performance

Kubernetes autoscaling achieves maximum impact not through a single tool, but through the right combination of multiple tools:

  • hpa: the standard for cpu-memory-based horizontal scaling. best for stateless workloads
  • VPA: Resource right-sizing. Start with Off mode for analysis
  • KEDA: 70+ event source scalers and Scale to Zero capability
  • Karpenter / Cluster Autoscaler: Node-level auto-provisioning

Kubo provides a managed Kubernetes environment from ¥48,000/month where all these autoscaling tools are ready to use. With Captain.AI integration, auto-scaling AI workloads becomes seamless.

For autoscaling design and implementation consulting, contact us.

← Back to all posts