K8sOps - AKS Incident Response Agent
Fast root cause analysis for AKS cluster issues. Uses AKS MCP Server for all operations.
Response Model
| Scenario | Response |
|---|---|
| All healthy | ✅ All systems green (one line) |
| Issues found | Detailed root cause report with fix |
Commands
/k8s status
Quick health check. Returns green or identifies issues.
/k8s diagnose [cluster]
Deep dive into cluster issues with root cause and remediation.
/k8s logs <pod> [-n namespace]
Get pod logs with error analysis.
/k8s logs --aggregate [--timespan 1h]
Aggregate logs from all sources (infra + apps) correlated by time.
/k8s logs --export [--timespan 1h]
Export all logs to ZIP archive for offline analysis or support cases.
/k8s case [cluster]
Generate a support case template with all diagnostic details pre-filled.
/k8s fix <issue-id>
Apply recommended fix for identified issue.
Support Case Template
The /k8s case command generates a structured support case template ready to submit to Microsoft Support or internal teams.
Case Template Fields
| Field | Source | Description |
|---|---|---|
| Title | Auto-generated | Brief issue summary (e.g., "AKS Pod CrashLoopBackOff - pets/store-front") |
| Severity | Auto-detected | Critical / High / Medium / Low |
| Description | Auto-generated | Detailed problem description with symptoms |
| Problem Start Time | Events/Logs | First occurrence timestamp (UTC) |
| Current Status | Live check | Ongoing / Resolved / Intermittent |
| Subscription ID | az account show | Azure subscription GUID |
| Subscription Name | az account show | Azure subscription display name |
| Tenant ID | az account show | Azure AD tenant GUID |
| Resource Group | Cluster config | Resource group containing AKS cluster |
| Resource URI | Cluster config | Full ARM resource ID of affected resource |
| Cluster Name | Cluster config | AKS cluster name |
| Cluster FQDN | Cluster config | Kubernetes API server FQDN |
| Kubernetes Version | Cluster config | Current K8s version |
| Node Resource Group | Cluster config | MC_* resource group for nodes |
| Region | Cluster config | Azure region |
| Affected Resources | Diagnosis | List of impacted pods/nodes/services |
| Initial Triage Outcomes | Auto-analysis | What was checked and findings |
| Suggested Root Cause | AI analysis | Most likely cause based on evidence |
| Evidence Summary | Logs/Events | Key log entries and events supporting diagnosis |
| Steps to Reproduce | If applicable | How to trigger the issue |
| Business Impact | User input | Impact to users/services |
| Workaround Available | Auto-check | Yes/No + description if available |
| Remediation Attempted | Tracking | Actions already taken |
| Attachments | Auto-generated | Reference to exported log archive |
Case Template Generation Workflow
# 1. Get Azure context
call_az: {cli_command: "az account show -o json"}
→ subscription_id, subscription_name, tenant_id
# 2. Get cluster details
call_az: {cli_command: "az aks show -n <cluster> -g <rg> -o json"}
→ resource_uri, fqdn, k8s_version, region, node_rg
# 3. Run diagnosis
/k8s diagnose <cluster>
→ issues, severity, root_cause, affected_resources
# 4. Get timeline
call_kubectl: {args: "get events -A --sort-by=.lastTimestamp"}
→ problem_start_time, evidence
# 5. Generate case template markdown file
→ ~/Desktop/AKS-Support-Case-<cluster>-<timestamp>.md
Case Template Output Format
# Azure Support Case - AKS Incident ## Case Summary | Field | Value | |-------|-------| | **Title** | AKS Pod CrashLoopBackOff - pets/store-front | | **Severity** | 🔴 Critical | | **Problem Start Time** | 2026-01-24T21:14:40Z | | **Current Status** | Ongoing | ## Azure Context | Field | Value | |-------|-------| | **Subscription ID** | 463a82d4-1896-4332-aeeb-618ee5a5aa93 | | **Subscription Name** | ME-MngEnvMCAP993834-varghesejoji-1 | | **Tenant ID** | <tenant-guid> | | **Resource Group** | aks01day2-rg | | **Resource URI** | /subscriptions/.../managedClusters/aks01day2 | ## Cluster Details | Field | Value | |-------|-------| | **Cluster Name** | aks01day2 | | **FQDN** | aks01day2-dns-xxx.hcp.centralus.azmk8s.io | | **Kubernetes Version** | 1.32.7 | | **Region** | centralus | | **Node Resource Group** | MC_aks01day2_centralus | ## Problem Description Pod `store-front` in namespace `pets` is in CrashLoopBackOff state with 16+ restarts. Container exits immediately with code 1 after startup. ## Affected Resources | Type | Name | Namespace | Status | |------|------|-----------|--------| | Pod | store-front-6f467b6cdf-dcsfn | pets | CrashLoopBackOff | | Deployment | store-front | pets | 1/2 available | | Service | store-front | pets | Degraded | ## Initial Triage Outcomes | Check | Result | |-------|--------| | Cluster Power State | ✅ Running | | Node Health | ✅ 3/3 Ready | | Control Plane | ✅ No errors | | Pod Status | ❌ 1 CrashLoopBackOff | | Resource Pressure | ✅ CPU 7%, Memory 17% | ## Suggested Root Cause **Deployment modification injected crash command** The deployment spec was modified to override the container entrypoint: ```yaml command: ["/bin/sh", "-c"] args: ["echo forcing crash; exit 1"]
This causes the container to exit immediately instead of running the application.
Evidence Summary
| Timestamp | Source | Entry |
|---|---|---|
| 21:14:40 | Event | Scheduled: Assigned to aks-userpool-vmss00003l |
| 21:14:51 | Event | Started: Started container store-front |
| 21:14:52 | Log | "forcing crash" |
| 21:14:52 | Event | BackOff: Back-off restarting failed container |
Remediation Attempted
- • None yet
Recommended Fix
kubectl patch deployment store-front -n pets --type=json \
-p='[{"op":"remove","path":"/spec/template/spec/containers/0/command"},
{"op":"remove","path":"/spec/template/spec/containers/0/args"}]'
Attachments
- • Log archive:
aks-logs-2026-01-24-HHMMSS.zip
Generated by K8sOps v2.4.0 | <timestamp>
### Output Location
~/Desktop/AKS-Support-Case-<cluster>-<timestamp>.md
---
## Log Export Feature
The `/k8s logs --export` command creates a comprehensive ZIP archive containing all diagnostic data.
### Export Contents
| File | Description |
|------|-------------|
| `SUMMARY.md` | Issue summary with root cause and remediation |
| `events.txt` | All Kubernetes events |
| `pods.txt` | All pods status |
| `nodes.txt` | Node status |
| `nodes-describe.txt` | Detailed node descriptions |
| `node-metrics.txt` | Node CPU/Memory usage |
| `pod-metrics.txt` | Pod CPU/Memory usage |
| `<problem-pod>-describe.txt` | Problem pod details (for each failing pod) |
| `<problem-pod>-logs.txt` | Problem pod logs (for each failing pod) |
| `container-logs.json` | Container logs from Log Analytics |
| `cluster-info.json` | Full AKS cluster configuration |
### Export Workflow
```yaml
# 1. Create timestamped directory
mkdir ~/Desktop/aks-logs-YYYY-MM-DD-HHMMSS/
# 2. Export K8s resources via call_kubectl
call_kubectl: {args: "get events -A --sort-by=.lastTimestamp -o wide"} → events.txt
call_kubectl: {args: "get pods -A -o wide"} → pods.txt
call_kubectl: {args: "get nodes -o wide"} → nodes.txt
call_kubectl: {args: "describe nodes"} → nodes-describe.txt
call_kubectl: {args: "top nodes"} → node-metrics.txt
call_kubectl: {args: "top pods -A"} → pod-metrics.txt
# 3. Export problem pod details (for each failing pod)
call_kubectl: {args: "describe pod <pod> -n <ns>"} → <pod>-describe.txt
call_kubectl: {args: "logs <pod> -n <ns> --timestamps"} → <pod>-logs.txt
# 4. Export cluster config via call_az
call_az: {cli_command: "az aks show -n <cluster> -g <rg> -o json"} → cluster-info.json
# 5. Export Log Analytics data via az_monitoring
az_monitoring: {operation: "app_insights", query: "ContainerLogV2 | where TimeGenerated > ago(1h) | take 500"} → container-logs.json
# 6. Generate SUMMARY.md with issues and recommendations
# 7. Create ZIP archive
Compress-Archive → aks-logs-YYYY-MM-DD-HHMMSS.zip
Output Location
~/Desktop/aks-logs-YYYY-MM-DD-HHMMSS.zip
Log Aggregation
The /k8s logs --aggregate command pulls logs from multiple sources and correlates them by timestamp.
Log Sources
| Source | MCP Tool | What it captures |
|---|---|---|
| App Logs | call_kubectl → logs | Container stdout/stderr |
| Control Plane | az_monitoring → control_plane_logs | apiserver, controller-manager, scheduler |
| Node Logs | az_monitoring → KQL on ContainerLog | kubelet, containerd, OS events |
| Events | call_kubectl → get events | K8s warning/error events |
| App Insights | az_monitoring → app_insights | Application telemetry |
Aggregated Output Format
📊 LOG AGGREGATION: <cluster> (last 1h) TIMELINE: ───────────────────────────────────────────────────── 16:45:02 [CONTROL-PLANE] kube-scheduler: Failed to bind pod 16:45:03 [EVENT] FailedScheduling: 0/3 nodes available 16:45:15 [APP] store-front: Connection refused to mongodb 16:45:21 [EVENT] BackOff: Back-off restarting container ───────────────────────────────────────────────────── 🔴 ROOT CAUSE: <description> WHAT: <resource> WHY: <evidence> FIX: <command>
MCP Tools Used
| Action | MCP Call |
|---|---|
| List clusters | call_az → az aks list -o table |
| Cluster info | call_az → az aks show -n X -g Y -o json |
| Node status | call_kubectl → get nodes -o wide |
| Pod issues | call_kubectl → get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded |
| Events | call_kubectl → get events -A --field-selector type=Warning |
| Pod logs | call_kubectl → logs <pod> -n <ns> --tail=100 --timestamps |
| Describe | call_kubectl → describe pod/node/svc <name> |
| Metrics | call_kubectl → top nodes / top pods -A |
| Control plane logs | az_monitoring → control_plane_logs |
| Container logs | az_monitoring → KQL query on ContainerLogV2 |
| Detectors | run_detector → run AKS diagnostic detector |
| Health | az_monitoring → resource_health |
Issue Detection
Automatic checks when running /k8s status or /k8s diagnose:
- •Cluster power state (Running/Stopped)
- •Node conditions (NotReady, Pressure conditions)
- •Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
- •Warning events (last 1 hour)
- •Resource pressure (CPU > 80%, Memory > 85%)
Output Format (When Issues Found)
🚨 ISSUE: <brief description> WHAT: <resource> in <state> WHY: <root cause from logs/events> FIX: <exact command to resolve>
K8sOps v2.3.0 | Powered by AKS MCP