AgentSkillsCN

K8sOps

AKS事件响应代理。快速识别并诊断Azure Kubernetes Service集群中的各类问题。

SKILL.md
--- frontmatter
name: K8sOps
description: AKS incident response agent. Quickly identify and diagnose issues in Azure Kubernetes Service clusters.
version: 2.4.0
mcp_servers:
  - aks-mcp

K8sOps - AKS Incident Response Agent

Fast root cause analysis for AKS cluster issues. Uses AKS MCP Server for all operations.

Response Model

ScenarioResponse
All healthy✅ All systems green (one line)
Issues foundDetailed root cause report with fix

Commands

/k8s status

Quick health check. Returns green or identifies issues.

/k8s diagnose [cluster]

Deep dive into cluster issues with root cause and remediation.

/k8s logs <pod> [-n namespace]

Get pod logs with error analysis.

/k8s logs --aggregate [--timespan 1h]

Aggregate logs from all sources (infra + apps) correlated by time.

/k8s logs --export [--timespan 1h]

Export all logs to ZIP archive for offline analysis or support cases.

/k8s case [cluster]

Generate a support case template with all diagnostic details pre-filled.

/k8s fix <issue-id>

Apply recommended fix for identified issue.


Support Case Template

The /k8s case command generates a structured support case template ready to submit to Microsoft Support or internal teams.

Case Template Fields

FieldSourceDescription
TitleAuto-generatedBrief issue summary (e.g., "AKS Pod CrashLoopBackOff - pets/store-front")
SeverityAuto-detectedCritical / High / Medium / Low
DescriptionAuto-generatedDetailed problem description with symptoms
Problem Start TimeEvents/LogsFirst occurrence timestamp (UTC)
Current StatusLive checkOngoing / Resolved / Intermittent
Subscription IDaz account showAzure subscription GUID
Subscription Nameaz account showAzure subscription display name
Tenant IDaz account showAzure AD tenant GUID
Resource GroupCluster configResource group containing AKS cluster
Resource URICluster configFull ARM resource ID of affected resource
Cluster NameCluster configAKS cluster name
Cluster FQDNCluster configKubernetes API server FQDN
Kubernetes VersionCluster configCurrent K8s version
Node Resource GroupCluster configMC_* resource group for nodes
RegionCluster configAzure region
Affected ResourcesDiagnosisList of impacted pods/nodes/services
Initial Triage OutcomesAuto-analysisWhat was checked and findings
Suggested Root CauseAI analysisMost likely cause based on evidence
Evidence SummaryLogs/EventsKey log entries and events supporting diagnosis
Steps to ReproduceIf applicableHow to trigger the issue
Business ImpactUser inputImpact to users/services
Workaround AvailableAuto-checkYes/No + description if available
Remediation AttemptedTrackingActions already taken
AttachmentsAuto-generatedReference to exported log archive

Case Template Generation Workflow

yaml
# 1. Get Azure context
call_az: {cli_command: "az account show -o json"}
  → subscription_id, subscription_name, tenant_id

# 2. Get cluster details
call_az: {cli_command: "az aks show -n <cluster> -g <rg> -o json"}
  → resource_uri, fqdn, k8s_version, region, node_rg

# 3. Run diagnosis
/k8s diagnose <cluster>
  → issues, severity, root_cause, affected_resources

# 4. Get timeline
call_kubectl: {args: "get events -A --sort-by=.lastTimestamp"}
  → problem_start_time, evidence

# 5. Generate case template markdown file
  → ~/Desktop/AKS-Support-Case-<cluster>-<timestamp>.md

Case Template Output Format

markdown
# Azure Support Case - AKS Incident

## Case Summary
| Field | Value |
|-------|-------|
| **Title** | AKS Pod CrashLoopBackOff - pets/store-front |
| **Severity** | 🔴 Critical |
| **Problem Start Time** | 2026-01-24T21:14:40Z |
| **Current Status** | Ongoing |

## Azure Context
| Field | Value |
|-------|-------|
| **Subscription ID** | 463a82d4-1896-4332-aeeb-618ee5a5aa93 |
| **Subscription Name** | ME-MngEnvMCAP993834-varghesejoji-1 |
| **Tenant ID** | <tenant-guid> |
| **Resource Group** | aks01day2-rg |
| **Resource URI** | /subscriptions/.../managedClusters/aks01day2 |

## Cluster Details
| Field | Value |
|-------|-------|
| **Cluster Name** | aks01day2 |
| **FQDN** | aks01day2-dns-xxx.hcp.centralus.azmk8s.io |
| **Kubernetes Version** | 1.32.7 |
| **Region** | centralus |
| **Node Resource Group** | MC_aks01day2_centralus |

## Problem Description

Pod `store-front` in namespace `pets` is in CrashLoopBackOff state with 16+ restarts. 
Container exits immediately with code 1 after startup.

## Affected Resources

| Type | Name | Namespace | Status |
|------|------|-----------|--------|
| Pod | store-front-6f467b6cdf-dcsfn | pets | CrashLoopBackOff |
| Deployment | store-front | pets | 1/2 available |
| Service | store-front | pets | Degraded |

## Initial Triage Outcomes

| Check | Result |
|-------|--------|
| Cluster Power State | ✅ Running |
| Node Health | ✅ 3/3 Ready |
| Control Plane | ✅ No errors |
| Pod Status | ❌ 1 CrashLoopBackOff |
| Resource Pressure | ✅ CPU 7%, Memory 17% |

## Suggested Root Cause

**Deployment modification injected crash command**

The deployment spec was modified to override the container entrypoint:
```yaml
command: ["/bin/sh", "-c"]
args: ["echo forcing crash; exit 1"]

This causes the container to exit immediately instead of running the application.

Evidence Summary

TimestampSourceEntry
21:14:40EventScheduled: Assigned to aks-userpool-vmss00003l
21:14:51EventStarted: Started container store-front
21:14:52Log"forcing crash"
21:14:52EventBackOff: Back-off restarting failed container

Remediation Attempted

  • None yet

Recommended Fix

bash
kubectl patch deployment store-front -n pets --type=json \
  -p='[{"op":"remove","path":"/spec/template/spec/containers/0/command"},
       {"op":"remove","path":"/spec/template/spec/containers/0/args"}]'

Attachments

  • Log archive: aks-logs-2026-01-24-HHMMSS.zip

Generated by K8sOps v2.4.0 | <timestamp>

code

### Output Location

~/Desktop/AKS-Support-Case-<cluster>-<timestamp>.md

code

---

## Log Export Feature

The `/k8s logs --export` command creates a comprehensive ZIP archive containing all diagnostic data.

### Export Contents

| File | Description |
|------|-------------|
| `SUMMARY.md` | Issue summary with root cause and remediation |
| `events.txt` | All Kubernetes events |
| `pods.txt` | All pods status |
| `nodes.txt` | Node status |
| `nodes-describe.txt` | Detailed node descriptions |
| `node-metrics.txt` | Node CPU/Memory usage |
| `pod-metrics.txt` | Pod CPU/Memory usage |
| `<problem-pod>-describe.txt` | Problem pod details (for each failing pod) |
| `<problem-pod>-logs.txt` | Problem pod logs (for each failing pod) |
| `container-logs.json` | Container logs from Log Analytics |
| `cluster-info.json` | Full AKS cluster configuration |

### Export Workflow

```yaml
# 1. Create timestamped directory
mkdir ~/Desktop/aks-logs-YYYY-MM-DD-HHMMSS/

# 2. Export K8s resources via call_kubectl
call_kubectl: {args: "get events -A --sort-by=.lastTimestamp -o wide"} → events.txt
call_kubectl: {args: "get pods -A -o wide"} → pods.txt
call_kubectl: {args: "get nodes -o wide"} → nodes.txt
call_kubectl: {args: "describe nodes"} → nodes-describe.txt
call_kubectl: {args: "top nodes"} → node-metrics.txt
call_kubectl: {args: "top pods -A"} → pod-metrics.txt

# 3. Export problem pod details (for each failing pod)
call_kubectl: {args: "describe pod <pod> -n <ns>"} → <pod>-describe.txt
call_kubectl: {args: "logs <pod> -n <ns> --timestamps"} → <pod>-logs.txt

# 4. Export cluster config via call_az
call_az: {cli_command: "az aks show -n <cluster> -g <rg> -o json"} → cluster-info.json

# 5. Export Log Analytics data via az_monitoring
az_monitoring: {operation: "app_insights", query: "ContainerLogV2 | where TimeGenerated > ago(1h) | take 500"} → container-logs.json

# 6. Generate SUMMARY.md with issues and recommendations

# 7. Create ZIP archive
Compress-Archive → aks-logs-YYYY-MM-DD-HHMMSS.zip

Output Location

code
~/Desktop/aks-logs-YYYY-MM-DD-HHMMSS.zip

Log Aggregation

The /k8s logs --aggregate command pulls logs from multiple sources and correlates them by timestamp.

Log Sources

SourceMCP ToolWhat it captures
App Logscall_kubectllogsContainer stdout/stderr
Control Planeaz_monitoringcontrol_plane_logsapiserver, controller-manager, scheduler
Node Logsaz_monitoring → KQL on ContainerLogkubelet, containerd, OS events
Eventscall_kubectlget eventsK8s warning/error events
App Insightsaz_monitoringapp_insightsApplication telemetry

Aggregated Output Format

code
📊 LOG AGGREGATION: <cluster> (last 1h)

TIMELINE:
─────────────────────────────────────────────────────
16:45:02 [CONTROL-PLANE] kube-scheduler: Failed to bind pod
16:45:03 [EVENT] FailedScheduling: 0/3 nodes available
16:45:15 [APP] store-front: Connection refused to mongodb
16:45:21 [EVENT] BackOff: Back-off restarting container
─────────────────────────────────────────────────────

🔴 ROOT CAUSE: <description>

WHAT: <resource>
WHY:  <evidence>
FIX:  <command>

MCP Tools Used

ActionMCP Call
List clusterscall_azaz aks list -o table
Cluster infocall_azaz aks show -n X -g Y -o json
Node statuscall_kubectlget nodes -o wide
Pod issuescall_kubectlget pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
Eventscall_kubectlget events -A --field-selector type=Warning
Pod logscall_kubectllogs <pod> -n <ns> --tail=100 --timestamps
Describecall_kubectldescribe pod/node/svc <name>
Metricscall_kubectltop nodes / top pods -A
Control plane logsaz_monitoringcontrol_plane_logs
Container logsaz_monitoring → KQL query on ContainerLogV2
Detectorsrun_detector → run AKS diagnostic detector
Healthaz_monitoringresource_health

Issue Detection

Automatic checks when running /k8s status or /k8s diagnose:

  1. Cluster power state (Running/Stopped)
  2. Node conditions (NotReady, Pressure conditions)
  3. Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
  4. Warning events (last 1 hour)
  5. Resource pressure (CPU > 80%, Memory > 85%)

Output Format (When Issues Found)

code
🚨 ISSUE: <brief description>

WHAT: <resource> in <state>
WHY:  <root cause from logs/events>
FIX:  <exact command to resolve>

K8sOps v2.3.0 | Powered by AKS MCP