Project Lab 15: Full-Stack Observability & Incident Response
Monitoring is the difference between knowing a system is down and knowing why it is down. This capstone lab demonstrates how to architect a telemetry pipeline that provides 360-degree visibility into your cluster and how to use that data to solve high-pressure production incidents.
Reference Material:
docs/14-monitoring-logging-troubleshooting/1-observability-stack.mddocs/14-monitoring-logging-troubleshooting/2-control-plane-troubleshooting.mddocs/14-monitoring-logging-troubleshooting/3-data-plane-troubleshooting.md
1. OBJECTIVE: THE OBSERVABLE PLATFORM
The goal is to move beyond "Reactive" firefighting to "Proactive" observability:
- Metric Orchestration: Deploy a Prometheus
ServiceMonitorto dynamically discover and scrape app metrics. - Log Consolidation: Configure Fluent-Bit to tail container logs and enrich them with Kubernetes metadata (Namespace/Labels).
- The Incident: Diagnose a simulated "Microservice Latency Spike" using the logs-to-metrics correlation.
- Forensics: Use
nsenterandcrictlto prove a kernel-level bottleneck.
2. PHASE 1: METRICS ORCHESTRATION (PROMETHEUS)
We assume the Prometheus Operator is installed. We need to tell Prometheus to scrape our "Payment API."
2.1 The ServiceMonitor (api-monitor.yaml)
This object instructs Prometheus to find any Service with the label tier: backend and scrape the /metrics endpoint every 15 seconds.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: backend-telemetry
namespace: monitoring
labels:
release: kube-prometheus-stack # Link to the Prometheus instance
spec:
selector:
matchLabels:
tier: backend
namespaceSelector:
any: true # Scrape across all namespaces
endpoints:
- port: http-metrics
interval: 15s
path: /metrics
3. PHASE 2: LOG ENRICHMENT (FLUENT-BIT)
We will configure the Fluent-Bit DaemonSet to ensure every log entry in the cluster is tagged with the Pod's real identity.
3.1 The Enrichment Pipeline (fluent-bit-config.yaml)
This configuration snippet ensures that when you see an error in Loki, you know exactly which Pod and Node it came from.
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
Keep_Log Off
# BIBLE SETTING: Add Node and Pod ID to every log line
Add_K8S_External_ID On
4. PHASE 3: INCIDENT RESPONSE (THE SIMULATION)
Scenario: Users report that the checkout-api is intermittently slow. kubectl get pods shows all pods are Running.
4.1 Step 1: Triage via Grafana/Prometheus
You check your dashboard and notice:
- Metric:
container_cpu_usage_seconds_totalis flat. - Metric:
container_memory_working_set_bytesis creeping toward the Limit.
4.2 Step 2: Log Forensics (Loki)
You query Loki for logs in the finance namespace:
{namespace="finance", app="checkout"} |= "error"
Observation: You find java.lang.OutOfMemoryError: Java heap space.
The Discovery: The process is crashing, but the Kubelet is restarting it so fast that the Pod status stays Running (CrashLoop but appearing active).
4.3 Step 3: JSONPath Root Cause Analysis
Use the "Ninja" commands to see the restart counts and last termination reason across all replicas.
kubectl get pods -n finance -o jsonpath='{range .items[*]}{.metadata.name}{"\t restarts: "}{.status.containerStatuses[0].restartCount}{"\t reason: "}{.status.containerStatuses[0].lastState.terminated.reason}{"\n"}{end}'
Output:
checkout-api-v1-abc restarts: 45 reason: OOMKilled
checkout-api-v1-xyz restarts: 42 reason: OOMKilled
5. PHASE 4: KERNEL-LEVEL VALIDATION (NSENTER)
To prove it isn't a CNI issue, we will enter the Pod's network namespace from the worker node.
- Find the Node and PID:
kubectl get pod checkout-api-v1-abc -o wide # Identify worker-1
# On worker-1:
CONTAINER_ID=$(crictl ps --name checkout-api -q)
PID=$(crictl inspect $CONTAINER_ID | jq .info.pid) - Execute Forensics:
# Enter the Pod's network and IPC namespace from the host
nsenter -t $PID -n -i -- ip addr
nsenter -t $PID -n -i -- netstat -plnt
Architect's Proof: If netstat shows the application is not listening on port 8080 inside the namespace, the process has died before reaching the "Ready" state, confirming the OOM failure.
6. THE RECOVERY (THE FIX)
The fix requires a Vertical Scaling adjustment (Ref: Chapter 04).
- Update Manifest: Increase
limits.memoryfrom512Mito1Gi. - Apply:
kubectl apply -f checkout-deploy.yaml. - Verify: Watch the restart count stop incrementing and the
Readystatus stay1/1.
7. TROUBLESHOOTING & NINJA COMMANDS
7.1 Auditing the Metrics Pipeline
If Prometheus isn't seeing your ServiceMonitor:
# Check for 'Configuration Error' in Prometheus Operator
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator
7.2 The "Silent Outage" Check
Find any Pod that has been OOMKilled in the last hour:
kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[].lastState.terminated.reason == "OOMKilled") | .metadata.name'
8. ARCHITECT'S FINAL SUMMARY
- Metrics tell you IF, Logs tell you WHY: A dashboard shows the spike; Loki shows the stack trace.
- Enrichment is Key: Without the Fluent-Bit Kubernetes filter, your logs are just raw text strings without context.
- JSONPath is the CLI Powerhouse: Use it to find patterns across hundreds of pods that
grepcannot easily catch. - The Host is the Ground Truth: When Kubernetes abstractions fail,
nsenterandcrictlare the only tools that reveal the reality of the Linux Kernel.