Project Lab 15: Full-Stack Observability & Incident Response

Monitoring is the difference between knowing a system is down and knowing why it is down. This capstone lab demonstrates how to architect a telemetry pipeline that provides 360-degree visibility into your cluster and how to use that data to solve high-pressure production incidents.

Reference Material:

docs/14-monitoring-logging-troubleshooting/1-observability-stack.md
docs/14-monitoring-logging-troubleshooting/2-control-plane-troubleshooting.md
docs/14-monitoring-logging-troubleshooting/3-data-plane-troubleshooting.md

1. OBJECTIVE: THE OBSERVABLE PLATFORM

The goal is to move beyond "Reactive" firefighting to "Proactive" observability:

Metric Orchestration: Deploy a Prometheus ServiceMonitor to dynamically discover and scrape app metrics.
Log Consolidation: Configure Fluent-Bit to tail container logs and enrich them with Kubernetes metadata (Namespace/Labels).
The Incident: Diagnose a simulated "Microservice Latency Spike" using the logs-to-metrics correlation.
Forensics: Use nsenter and crictl to prove a kernel-level bottleneck.

2. PHASE 1: METRICS ORCHESTRATION (PROMETHEUS)

We assume the Prometheus Operator is installed. We need to tell Prometheus to scrape our "Payment API."

2.1 The ServiceMonitor (`api-monitor.yaml`)

This object instructs Prometheus to find any Service with the label tier: backend and scrape the /metrics endpoint every 15 seconds.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: backend-telemetry
  namespace: monitoring
  labels:
    release: kube-prometheus-stack # Link to the Prometheus instance
spec:
  selector:
    matchLabels:
      tier: backend
  namespaceSelector:
    any: true # Scrape across all namespaces
  endpoints:
  - port: http-metrics
    interval: 15s
    path: /metrics

3. PHASE 2: LOG ENRICHMENT (FLUENT-BIT)

We will configure the Fluent-Bit DaemonSet to ensure every log entry in the cluster is tagged with the Pod's real identity.

3.1 The Enrichment Pipeline (`fluent-bit-config.yaml`)

This configuration snippet ensures that when you see an error in Loki, you know exactly which Pod and Node it came from.

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
    Keep_Log            Off
    # BIBLE SETTING: Add Node and Pod ID to every log line
    Add_K8S_External_ID On 

4. PHASE 3: INCIDENT RESPONSE (THE SIMULATION)

Scenario: Users report that the checkout-api is intermittently slow. kubectl get pods shows all pods are Running.

4.1 Step 1: Triage via Grafana/Prometheus

You check your dashboard and notice:

Metric: container_cpu_usage_seconds_total is flat.
Metric: container_memory_working_set_bytes is creeping toward the Limit.

4.2 Step 2: Log Forensics (Loki)

You query Loki for logs in the finance namespace: {namespace="finance", app="checkout"} |= "error"

Observation: You find java.lang.OutOfMemoryError: Java heap space. The Discovery: The process is crashing, but the Kubelet is restarting it so fast that the Pod status stays Running (CrashLoop but appearing active).

4.3 Step 3: JSONPath Root Cause Analysis

Use the "Ninja" commands to see the restart counts and last termination reason across all replicas.

kubectl get pods -n finance -o jsonpath='{range .items[*]}{.metadata.name}{"\t restarts: "}{.status.containerStatuses[0].restartCount}{"\t reason: "}{.status.containerStatuses[0].lastState.terminated.reason}{"\n"}{end}'

Output:

checkout-api-v1-abc   restarts: 45   reason: OOMKilled
checkout-api-v1-xyz   restarts: 42   reason: OOMKilled

5. PHASE 4: KERNEL-LEVEL VALIDATION (NSENTER)

To prove it isn't a CNI issue, we will enter the Pod's network namespace from the worker node.

Find the Node and PID:

kubectl get pod checkout-api-v1-abc -o wide # Identify worker-1
# On worker-1:
CONTAINER_ID=$(crictl ps --name checkout-api -q)
PID=$(crictl inspect $CONTAINER_ID | jq .info.pid)

Execute Forensics:

# Enter the Pod's network and IPC namespace from the host
nsenter -t $PID -n -i -- ip addr
nsenter -t $PID -n -i -- netstat -plnt

Architect's Proof: If netstat shows the application is not listening on port 8080 inside the namespace, the process has died before reaching the "Ready" state, confirming the OOM failure.

6. THE RECOVERY (THE FIX)

The fix requires a Vertical Scaling adjustment (Ref: Chapter 04).

Update Manifest: Increase limits.memory from 512Mi to 1Gi.
Apply: kubectl apply -f checkout-deploy.yaml.
Verify: Watch the restart count stop incrementing and the Ready status stay 1/1.

7. TROUBLESHOOTING & NINJA COMMANDS

7.1 Auditing the Metrics Pipeline

If Prometheus isn't seeing your ServiceMonitor:

# Check for 'Configuration Error' in Prometheus Operator
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator

7.2 The "Silent Outage" Check

Find any Pod that has been OOMKilled in the last hour:

kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[].lastState.terminated.reason == "OOMKilled") | .metadata.name'

8. ARCHITECT'S FINAL SUMMARY

Metrics tell you IF, Logs tell you WHY: A dashboard shows the spike; Loki shows the stack trace.
Enrichment is Key: Without the Fluent-Bit Kubernetes filter, your logs are just raw text strings without context.
JSONPath is the CLI Powerhouse: Use it to find patterns across hundreds of pods that grep cannot easily catch.
The Host is the Ground Truth: When Kubernetes abstractions fail, nsenter and crictl are the only tools that reveal the reality of the Linux Kernel.

1. OBJECTIVE: THE OBSERVABLE PLATFORM​

2. PHASE 1: METRICS ORCHESTRATION (PROMETHEUS)​

2.1 The ServiceMonitor (api-monitor.yaml)​

3. PHASE 2: LOG ENRICHMENT (FLUENT-BIT)​

3.1 The Enrichment Pipeline (fluent-bit-config.yaml)​

4. PHASE 3: INCIDENT RESPONSE (THE SIMULATION)​

4.1 Step 1: Triage via Grafana/Prometheus​

4.2 Step 2: Log Forensics (Loki)​

4.3 Step 3: JSONPath Root Cause Analysis​

5. PHASE 4: KERNEL-LEVEL VALIDATION (NSENTER)​

6. THE RECOVERY (THE FIX)​

7. TROUBLESHOOTING & NINJA COMMANDS​

7.1 Auditing the Metrics Pipeline​

7.2 The "Silent Outage" Check​

8. ARCHITECT'S FINAL SUMMARY​