Autoscaling: HPA and VPA Internals

Autoscaling in Kubernetes is a reactive control loop designed to reconcile the current workload demand with the desired performance state. It shifts the burden of resource management from manual operational intervention to automated policy enforcement.

1. HORIZONTAL POD AUTOSCALER (HPA)

HPA scales the number of replicas in a Deployment, StatefulSet, or ReplicaSet. It is the primary tool for handling variable traffic in stateless microservices.

1.1 The HPA Algorithm (Internal Logic)

The HPA controller operates on a simple ratio-based formula to calculate the desired number of replicas:

$$ \text{desiredReplicas} = \lceil \text{currentReplicas} \times \frac{\text{currentMetricValue}}{\text{desiredMetricValue}} \rceil $$

Example:

Current Replicas: 2
Current CPU Usage: 160m
Target CPU Usage: 100m
Calculation: $\lceil 2 \times (160 / 100) \rceil = \lceil 3.2 \rceil = 4$ Replicas.

1.2 Data Flow Architecture

The HPA doesn't "poll" pods directly. It relies on an aggregated metrics pipeline.

[ Pods ] --(cAdvisor)--> [ Kubelet ] --(scrape)--> [ Metrics Server ]
                                                          |
[ HPA Controller ] <--(query every 15s)--------------------|
       |
[ Deployment/RS ] <--(update .spec.replicas)---------------|

1.3 Production Manifest (autoscaling/v2)

Standardize on autoscaling/v2 (available since K8s 1.23+) as it supports multiple metrics and scaling behaviors (stabilization windows).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: billing-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: billing-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 512Mi
  
  # ADVANCED: Behavior Control (Stabilization Windows)
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300 # Wait 5m before scaling down (prevents flapping)
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleUp:
      stabilizationWindowSeconds: 0 # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

2. VERTICAL POD AUTOSCALER (VPA)

VPA adjusts the Resource Requests and Limits of individual containers. It is essential for workloads that cannot be scaled horizontally (e.g., legacy databases, specific batch jobs) or for "right-sizing" microservices.

2.1 The Three Components of VPA

Unlike HPA, VPA is composed of three distinct binaries:

Recommender: Analyzes historical metrics from the Metrics Server and suggests optimal CPU/RAM values.
Updater: Watches the recommendations and "evicts" (kills) Pods whose resources differ significantly from the recommendation.
Admission Controller: Intercepts Pod creation requests. If a VPA exists for that Pod, it mutates the YAML to inject the recommended requests/limits before the Pod is scheduled.

2.2 VPA Modes of Operation

Mode	Description	Production Use Case
Off	Recommendations only. No changes made.	Standard for Prod. Use this to gather data for weeks before trusting VPA.
Initial	Sets resources only during Pod creation.	Useful for CI/CD pipelines to set baseline requests.
Recreate	Evicts and recreates Pods to apply changes.	Non-critical background workers.
Auto	Identical to Recreate. (Standard mode).	Development/Testing environments only.

2.3 Production Manifest (Safe Mode)

Avoid using Auto in production unless you have high-availability and can tolerate pod restarts. Use Off to get "Bible-grade" recommendations.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: database-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: postgres-db
  updatePolicy:
    updateMode: "Off" # Recommendations only
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 250m
        memory: 512Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi
      controlledResources: ["cpu", "memory"]

3. ARCHITECT'S WARNING: HPA vs VPA CONFLICTS

Never use HPA and VPA together on the same metric (CPU or Memory).

The "Flapping" Loop:

Load increases. HPA sees high CPU and creates more Pods.
VPA sees high CPU on individual pods and increases the CPU Requests per pod.
HPA now sees lower CPU utilization (because the denominator/request increased) and deletes pods.
VPA sees the total load is still high and increases requests further.
Result: Your cluster enters an unstable state of constant restarts and scaling oscillations.

The Solution:

Use HPA for CPU/Memory scaling.
Use VPA only in Off mode for recommendations, OR use VPA for Memory and HPA for Custom Metrics (like Request-Per-Second).

4. TROUBLESHOOTING & NINJA COMMANDS

4.1 Auditing HPA Decisions

When HPA fails to scale, it is usually a metric aggregation issue.

kubectl describe hpa billing-api-hpa

Look for:

Conditions: AbleToScale (True/False), ScalingActive (True/False).
If ScalingActive is False, check if the Pods have resources.requests defined. HPA cannot scale pods without requests.

4.2 Interpreting VPA Recommendations

kubectl get vpa database-vpa -o yaml

Target: The value VPA wants to set.
Lower Bound: Minimum value before VPA triggers a resize.
Upper Bound: Maximum value before VPA triggers a resize.
Uncapped Target: What VPA would recommend if maxAllowed wasn't set. (Use this to identify if your pods are being starved by your own max limits).

4.3 Monitoring the Scaling Events

To see the history of when and why scaling happened:

kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler

4.4 The HPA "Tolerance"

HPA has a default tolerance of 0.1 (10%). If the ratio of current/desired is between 0.9 and 1.1, the HPA controller will do nothing. This prevents "micro-scaling" for negligible metric fluctuations.

# Internal HPA controller flag (usually not changeable on managed k8s)
--horizontal-pod-autoscaler-tolerance=0.1

1. HORIZONTAL POD AUTOSCALER (HPA)​

1.1 The HPA Algorithm (Internal Logic)​

1.2 Data Flow Architecture​

1.3 Production Manifest (autoscaling/v2)​

2. VERTICAL POD AUTOSCALER (VPA)​

2.1 The Three Components of VPA​

2.2 VPA Modes of Operation​

2.3 Production Manifest (Safe Mode)​

3. ARCHITECT'S WARNING: HPA vs VPA CONFLICTS​

4. TROUBLESHOOTING & NINJA COMMANDS​

4.1 Auditing HPA Decisions​

4.2 Interpreting VPA Recommendations​

4.3 Monitoring the Scaling Events​

4.4 The HPA "Tolerance"​