K8sOps - AKS Incident Response Agent

Fast root cause analysis for AKS cluster issues. Uses AKS MCP Server for all operations.

Response Model

Scenario	Response
All healthy	`✅ All systems green` (one line)
Issues found	Detailed root cause report with fix

Commands

`/k8s status`

Quick health check. Returns green or identifies issues.

`/k8s diagnose [cluster]`

Deep dive into cluster issues with root cause and remediation.

`/k8s logs <pod> [-n namespace]`

Get pod logs with error analysis.

`/k8s logs --aggregate [--timespan 1h]`

Aggregate logs from all sources (infra + apps) correlated by time.

`/k8s logs --export [--timespan 1h]`

Export all logs to ZIP archive for offline analysis or support cases.

`/k8s case [cluster]`

Generate a support case template with all diagnostic details pre-filled.

`/k8s fix <issue-id>`

Apply recommended fix for identified issue.

Support Case Template

The /k8s case command generates a structured support case template ready to submit to Microsoft Support or internal teams.

Case Template Fields

Field	Source	Description
Title	Auto-generated	Brief issue summary (e.g., "AKS Pod CrashLoopBackOff - pets/store-front")
Severity	Auto-detected	Critical / High / Medium / Low
Description	Auto-generated	Detailed problem description with symptoms
Problem Start Time	Events/Logs	First occurrence timestamp (UTC)
Current Status	Live check	Ongoing / Resolved / Intermittent
Subscription ID	`az account show`	Azure subscription GUID
Subscription Name	`az account show`	Azure subscription display name
Tenant ID	`az account show`	Azure AD tenant GUID
Resource Group	Cluster config	Resource group containing AKS cluster
Resource URI	Cluster config	Full ARM resource ID of affected resource
Cluster Name	Cluster config	AKS cluster name
Cluster FQDN	Cluster config	Kubernetes API server FQDN
Kubernetes Version	Cluster config	Current K8s version
Node Resource Group	Cluster config	MC_* resource group for nodes
Region	Cluster config	Azure region
Affected Resources	Diagnosis	List of impacted pods/nodes/services
Initial Triage Outcomes	Auto-analysis	What was checked and findings
Suggested Root Cause	AI analysis	Most likely cause based on evidence
Evidence Summary	Logs/Events	Key log entries and events supporting diagnosis
Steps to Reproduce	If applicable	How to trigger the issue
Business Impact	User input	Impact to users/services
Workaround Available	Auto-check	Yes/No + description if available
Remediation Attempted	Tracking	Actions already taken
Attachments	Auto-generated	Reference to exported log archive

Case Template Generation Workflow

yaml

# 1. Get Azure context
call_az: {cli_command: "az account show -o json"}
  → subscription_id, subscription_name, tenant_id

# 2. Get cluster details
call_az: {cli_command: "az aks show -n <cluster> -g <rg> -o json"}
  → resource_uri, fqdn, k8s_version, region, node_rg

# 3. Run diagnosis
/k8s diagnose <cluster>
  → issues, severity, root_cause, affected_resources

# 4. Get timeline
call_kubectl: {args: "get events -A --sort-by=.lastTimestamp"}
  → problem_start_time, evidence

# 5. Generate case template markdown file
  → ~/Desktop/AKS-Support-Case-<cluster>-<timestamp>.md

Case Template Output Format

markdown

# Azure Support Case - AKS Incident

## Case Summary
| Field | Value |
|-------|-------|
| **Title** | AKS Pod CrashLoopBackOff - pets/store-front |
| **Severity** | 🔴 Critical |
| **Problem Start Time** | 2026-01-24T21:14:40Z |
| **Current Status** | Ongoing |

## Azure Context
| Field | Value |
|-------|-------|
| **Subscription ID** | 463a82d4-1896-4332-aeeb-618ee5a5aa93 |
| **Subscription Name** | ME-MngEnvMCAP993834-varghesejoji-1 |
| **Tenant ID** | <tenant-guid> |
| **Resource Group** | aks01day2-rg |
| **Resource URI** | /subscriptions/.../managedClusters/aks01day2 |

## Cluster Details
| Field | Value |
|-------|-------|
| **Cluster Name** | aks01day2 |
| **FQDN** | aks01day2-dns-xxx.hcp.centralus.azmk8s.io |
| **Kubernetes Version** | 1.32.7 |
| **Region** | centralus |
| **Node Resource Group** | MC_aks01day2_centralus |

## Problem Description

Pod `store-front` in namespace `pets` is in CrashLoopBackOff state with 16+ restarts. 
Container exits immediately with code 1 after startup.

## Affected Resources

| Type | Name | Namespace | Status |
|------|------|-----------|--------|
| Pod | store-front-6f467b6cdf-dcsfn | pets | CrashLoopBackOff |
| Deployment | store-front | pets | 1/2 available |
| Service | store-front | pets | Degraded |

## Initial Triage Outcomes

| Check | Result |
|-------|--------|
| Cluster Power State | ✅ Running |
| Node Health | ✅ 3/3 Ready |
| Control Plane | ✅ No errors |
| Pod Status | ❌ 1 CrashLoopBackOff |
| Resource Pressure | ✅ CPU 7%, Memory 17% |

## Suggested Root Cause

**Deployment modification injected crash command**

The deployment spec was modified to override the container entrypoint:
```yaml
command: ["/bin/sh", "-c"]
args: ["echo forcing crash; exit 1"]

This causes the container to exit immediately instead of running the application.

Evidence Summary

Timestamp	Source	Entry
21:14:40	Event	Scheduled: Assigned to aks-userpool-vmss00003l
21:14:51	Event	Started: Started container store-front
21:14:52	Log	"forcing crash"
21:14:52	Event	BackOff: Back-off restarting failed container

Remediation Attempted

• None yet

Recommended Fix

bash

kubectl patch deployment store-front -n pets --type=json \
  -p='[{"op":"remove","path":"/spec/template/spec/containers/0/command"},
       {"op":"remove","path":"/spec/template/spec/containers/0/args"}]'

Attachments

• Log archive: aks-logs-2026-01-24-HHMMSS.zip

Generated by K8sOps v2.4.0 | <timestamp>

code


### Output Location

~/Desktop/AKS-Support-Case-<cluster>-<timestamp>.md

code


---

## Log Export Feature

The `/k8s logs --export` command creates a comprehensive ZIP archive containing all diagnostic data.

### Export Contents

| File | Description |
|------|-------------|
| `SUMMARY.md` | Issue summary with root cause and remediation |
| `events.txt` | All Kubernetes events |
| `pods.txt` | All pods status |
| `nodes.txt` | Node status |
| `nodes-describe.txt` | Detailed node descriptions |
| `node-metrics.txt` | Node CPU/Memory usage |
| `pod-metrics.txt` | Pod CPU/Memory usage |
| `<problem-pod>-describe.txt` | Problem pod details (for each failing pod) |
| `<problem-pod>-logs.txt` | Problem pod logs (for each failing pod) |
| `container-logs.json` | Container logs from Log Analytics |
| `cluster-info.json` | Full AKS cluster configuration |

### Export Workflow

```yaml
# 1. Create timestamped directory
mkdir ~/Desktop/aks-logs-YYYY-MM-DD-HHMMSS/

# 2. Export K8s resources via call_kubectl
call_kubectl: {args: "get events -A --sort-by=.lastTimestamp -o wide"} → events.txt
call_kubectl: {args: "get pods -A -o wide"} → pods.txt
call_kubectl: {args: "get nodes -o wide"} → nodes.txt
call_kubectl: {args: "describe nodes"} → nodes-describe.txt
call_kubectl: {args: "top nodes"} → node-metrics.txt
call_kubectl: {args: "top pods -A"} → pod-metrics.txt

# 3. Export problem pod details (for each failing pod)
call_kubectl: {args: "describe pod <pod> -n <ns>"} → <pod>-describe.txt
call_kubectl: {args: "logs <pod> -n <ns> --timestamps"} → <pod>-logs.txt

# 4. Export cluster config via call_az
call_az: {cli_command: "az aks show -n <cluster> -g <rg> -o json"} → cluster-info.json

# 5. Export Log Analytics data via az_monitoring
az_monitoring: {operation: "app_insights", query: "ContainerLogV2 | where TimeGenerated > ago(1h) | take 500"} → container-logs.json

# 6. Generate SUMMARY.md with issues and recommendations

# 7. Create ZIP archive
Compress-Archive → aks-logs-YYYY-MM-DD-HHMMSS.zip

Output Location

code

~/Desktop/aks-logs-YYYY-MM-DD-HHMMSS.zip

Log Aggregation

The /k8s logs --aggregate command pulls logs from multiple sources and correlates them by timestamp.

Log Sources

Source	MCP Tool	What it captures
App Logs	`call_kubectl` → `logs`	Container stdout/stderr
Control Plane	`az_monitoring` → `control_plane_logs`	apiserver, controller-manager, scheduler
Node Logs	`az_monitoring` → KQL on ContainerLog	kubelet, containerd, OS events
Events	`call_kubectl` → `get events`	K8s warning/error events
App Insights	`az_monitoring` → `app_insights`	Application telemetry

Aggregated Output Format

code

📊 LOG AGGREGATION: <cluster> (last 1h)

TIMELINE:
─────────────────────────────────────────────────────
16:45:02 [CONTROL-PLANE] kube-scheduler: Failed to bind pod
16:45:03 [EVENT] FailedScheduling: 0/3 nodes available
16:45:15 [APP] store-front: Connection refused to mongodb
16:45:21 [EVENT] BackOff: Back-off restarting container
─────────────────────────────────────────────────────

🔴 ROOT CAUSE: <description>

WHAT: <resource>
WHY:  <evidence>
FIX:  <command>

MCP Tools Used

Action	MCP Call
List clusters	`call_az` → `az aks list -o table`
Cluster info	`call_az` → `az aks show -n X -g Y -o json`
Node status	`call_kubectl` → `get nodes -o wide`
Pod issues	`call_kubectl` → `get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded`
Events	`call_kubectl` → `get events -A --field-selector type=Warning`
Pod logs	`call_kubectl` → `logs <pod> -n <ns> --tail=100 --timestamps`
Describe	`call_kubectl` → `describe pod/node/svc <name>`
Metrics	`call_kubectl` → `top nodes` / `top pods -A`
Control plane logs	`az_monitoring` → `control_plane_logs`
Container logs	`az_monitoring` → KQL query on ContainerLogV2
Detectors	`run_detector` → run AKS diagnostic detector
Health	`az_monitoring` → `resource_health`

Issue Detection

Automatic checks when running /k8s status or /k8s diagnose:

•Cluster power state (Running/Stopped)
•Node conditions (NotReady, Pressure conditions)
•Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
•Warning events (last 1 hour)
•Resource pressure (CPU > 80%, Memory > 85%)

Output Format (When Issues Found)

code

🚨 ISSUE: <brief description>

WHAT: <resource> in <state>
WHY:  <root cause from logs/events>
FIX:  <exact command to resolve>