Autoscaling: HPA and VPA Internals
Autoscaling in Kubernetes is a reactive control loop designed to reconcile the current workload demand with the desired performance state. It shifts the burden of resource management from manual operational intervention to automated policy enforcement.
1. HORIZONTAL POD AUTOSCALER (HPA)
HPA scales the number of replicas in a Deployment, StatefulSet, or ReplicaSet. It is the primary tool for handling variable traffic in stateless microservices.
1.1 The HPA Algorithm (Internal Logic)
The HPA controller operates on a simple ratio-based formula to calculate the desired number of replicas:
$$ \text{desiredReplicas} = \lceil \text{currentReplicas} \times \frac{\text{currentMetricValue}}{\text{desiredMetricValue}} \rceil $$
Example:
- Current Replicas:
2 - Current CPU Usage:
160m - Target CPU Usage:
100m - Calculation: $\lceil 2 \times (160 / 100) \rceil = \lceil 3.2 \rceil = 4$ Replicas.
1.2 Data Flow Architecture
The HPA doesn't "poll" pods directly. It relies on an aggregated metrics pipeline.
[ Pods ] --(cAdvisor)--> [ Kubelet ] --(scrape)--> [ Metrics Server ]
|
[ HPA Controller ] <--(query every 15s)--------------------|
|
[ Deployment/RS ] <--(update .spec.replicas)---------------|
1.3 Production Manifest (autoscaling/v2)
Standardize on autoscaling/v2 (available since K8s 1.23+) as it supports multiple metrics and scaling behaviors (stabilization windows).
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: billing-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: billing-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 512Mi
# ADVANCED: Behavior Control (Stabilization Windows)
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5m before scaling down (prevents flapping)
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15
2. VERTICAL POD AUTOSCALER (VPA)
VPA adjusts the Resource Requests and Limits of individual containers. It is essential for workloads that cannot be scaled horizontally (e.g., legacy databases, specific batch jobs) or for "right-sizing" microservices.
2.1 The Three Components of VPA
Unlike HPA, VPA is composed of three distinct binaries:
- Recommender: Analyzes historical metrics from the Metrics Server and suggests optimal CPU/RAM values.
- Updater: Watches the recommendations and "evicts" (kills) Pods whose resources differ significantly from the recommendation.
- Admission Controller: Intercepts Pod creation requests. If a VPA exists for that Pod, it mutates the YAML to inject the recommended requests/limits before the Pod is scheduled.
2.2 VPA Modes of Operation
| Mode | Description | Production Use Case |
|---|---|---|
| Off | Recommendations only. No changes made. | Standard for Prod. Use this to gather data for weeks before trusting VPA. |
| Initial | Sets resources only during Pod creation. | Useful for CI/CD pipelines to set baseline requests. |
| Recreate | Evicts and recreates Pods to apply changes. | Non-critical background workers. |
| Auto | Identical to Recreate. (Standard mode). | Development/Testing environments only. |
2.3 Production Manifest (Safe Mode)
Avoid using Auto in production unless you have high-availability and can tolerate pod restarts. Use Off to get "Bible-grade" recommendations.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: database-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: postgres-db
updatePolicy:
updateMode: "Off" # Recommendations only
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 250m
memory: 512Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
3. ARCHITECT'S WARNING: HPA vs VPA CONFLICTS
Never use HPA and VPA together on the same metric (CPU or Memory).
The "Flapping" Loop:
- Load increases. HPA sees high CPU and creates more Pods.
- VPA sees high CPU on individual pods and increases the CPU Requests per pod.
- HPA now sees lower CPU utilization (because the denominator/request increased) and deletes pods.
- VPA sees the total load is still high and increases requests further.
- Result: Your cluster enters an unstable state of constant restarts and scaling oscillations.
The Solution:
- Use HPA for CPU/Memory scaling.
- Use VPA only in
Offmode for recommendations, OR use VPA for Memory and HPA for Custom Metrics (like Request-Per-Second).
4. TROUBLESHOOTING & NINJA COMMANDS
4.1 Auditing HPA Decisions
When HPA fails to scale, it is usually a metric aggregation issue.
kubectl describe hpa billing-api-hpa
Look for:
Conditions:AbleToScale(True/False),ScalingActive(True/False).- If
ScalingActiveis False, check if the Pods haveresources.requestsdefined. HPA cannot scale pods without requests.
4.2 Interpreting VPA Recommendations
kubectl get vpa database-vpa -o yaml
- Target: The value VPA wants to set.
- Lower Bound: Minimum value before VPA triggers a resize.
- Upper Bound: Maximum value before VPA triggers a resize.
- Uncapped Target: What VPA would recommend if
maxAllowedwasn't set. (Use this to identify if your pods are being starved by your own max limits).
4.3 Monitoring the Scaling Events
To see the history of when and why scaling happened:
kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler
4.4 The HPA "Tolerance"
HPA has a default tolerance of 0.1 (10%). If the ratio of current/desired is between 0.9 and 1.1, the HPA controller will do nothing. This prevents "micro-scaling" for negligible metric fluctuations.
# Internal HPA controller flag (usually not changeable on managed k8s)
--horizontal-pod-autoscaler-tolerance=0.1