AgentSkillsCN

Skill

技能

SKILL.md

OA Skill

Observability Agent: read-only data gateway for logs, events, and metrics. Supports both K8s clusters and bare metal/VM servers (standalone mode). This document is served by OA at GET /skill.md.

Operating Modes

OA runs in one of two modes, auto-detected by the presence of KUBERNETES_SERVICE_HOST.

ModeDetectionTargetsLog SourceEventsMetrics Source
K8sKUBERNETES_SERVICE_HOST presentPods (namespace/selector)K8s container logs APIK8s EventsPod annotation-based scrape
StandaloneKUBERNETES_SERVICE_HOST absentServices (OA_SERVICES env)File tail + journalctlNoneDirect URL scrape

Base

  • Auth: Authorization: Bearer <JWT> (required on all requests)

Auth/JWT

OA verifies JWTs using an HS256 shared secret.

  • OA_JWT_SECRET (required, HS256 shared secret, min 32 chars)

JWT rules:

  • Algorithm: HS256
  • exp claim required (recommended 5–15 min)
  • Missing or invalid JWT → 401

The client (AI Agent) signs an HS256 JWT using OA_JWT_SECRET (env) and sends it with each request. The secret is used only in runtime memory — never expose it in logs, files, or output.


Primary Workflow (bundle-first)

  1. Create bundle: POST /v1/bundles
  2. Poll status: GET /v1/bundles/{bundleId} — every 1–2 s, up to 30 s until done
  3. Download: GET /v1/bundles/{bundleId}/downloadndjson.gz
  4. Analyze: stream-parse NDJSON, then AI analyzes

Target Discovery

K8s Mode: Pod Search

GET /v1/pods?ns=*&q=<substring>

  • ns: namespace (* = all)
  • selector: label selector
  • q: pod name substring search

Response: namespace, name, podIP, labels, annotations, containers[], status

Standalone Mode: Service List

GET /v1/services

Returns registered services configured via OA_SERVICES env.

Response example:

json
{
  "items": [
    { "name": "solana-validator", "logs": ["/var/log/solana/validator.log"], "journal": null, "metrics": "http://localhost:9090/metrics" },
    { "name": "rpc-node", "logs": ["/var/log/solana/rpc.log"], "journal": null, "metrics": null }
  ]
}

Bundle Request

timeWindow (relative / absolute)

OA supports two time window modes. Use only one at a time.

  1. Relative:
json
{ "timeWindow": { "sinceSeconds": 600 } }
  1. Absolute (UTC, ISO8601Z):
json
{
  "timeWindow": {
    "start": "2026-02-09T00:00:00Z",
    "end": "2026-02-09T00:10:00Z"
  }
}

Rules:

  • Using both sinceSeconds and start/end → 400
  • In absolute mode, OA parses lines and drops those outside the range

K8s Mode: selector-based (multiple Pods)

json
{
  "timeWindow": { "sinceSeconds": 600 },
  "target": {
    "namespace": "*",
    "selector": "app=web,tier=backend"
  },
  "include": {
    "logs": { "enabled": true, "tailLines": 2000, "previous": true, "timestamps": true },
    "events": { "enabled": true },
    "metrics": { "enabled": true }
  },
  "limits": {
    "maxPods": 20,
    "maxTotalLogLines": 50000,
    "metricsTimeoutMs": 2000
  }
}

K8s Mode: direct Pod targeting (single/specific Pods)

json
{
  "timeWindow": { "sinceSeconds": 600 },
  "target": {
    "pods": [
      { "namespace": "default", "pod": "my-app-pod-0" }
    ]
  },
  "include": {
    "logs": { "enabled": true, "tailLines": 2000, "previous": true, "timestamps": true },
    "events": { "enabled": true },
    "metrics": { "enabled": true }
  }
}

selector and pods[] are mutually exclusive. Providing both → 400.

Standalone Mode: service-based

json
{
  "timeWindow": { "sinceSeconds": 600 },
  "target": {
    "kind": "services",
    "services": ["solana-validator", "rpc-node"]
  },
  "include": {
    "logs": { "enabled": true, "excludePatterns": ["healthcheck"] },
    "metrics": { "enabled": true }
  },
  "limits": {
    "maxTotalLogLines": 50000,
    "metricsTimeoutMs": 2000
  }
}

Standalone rules:

  • target.services is a required array of service names registered in OA_SERVICES
  • kind is "services" (auto-inferred when a services array is present)
  • events not supported in standalone
  • previous, timestamps options not available (file tail always reads latest N lines)
  • Logs are collected via tail from file paths configured per service

Log Line Exclude Filter (excludePatterns)

include.logs.excludePatterns: string[] removes lines by substring match (like grep -v). Applied as a post-filter step alongside timeWindow filtering. Works the same in both K8s and standalone.

Example:

json
{
  "include": {
    "logs": {
      "enabled": true,
      "excludePatterns": ["GET /healthz", "healthcheck"]
    }
  }
}

NDJSON Record Types

Common Records

typeDescriptionKey Fields
metaBundle metadatabundleId, createdAt, params

K8s Mode Records

typeDescriptionKey Fields
logContainer lognamespace, pod, container, ts, line, previous?, skipped?, reason?
eventK8s eventnamespace, reason, message, ts, involvedObject
metrics_textPod metricsnamespace, pod, port, path, ts, ok/skipped/error, content

Standalone Mode Records

typeDescriptionKey Fields
logFile logservice, file, ts, line, skipped?, reason?
logJournal logservice, journal, ts, line, skipped?, reason?
metrics_textService metricsservice, url, ts, ok/skipped/error, content

Standalone log skip reasons:

  • file_not_found: log file does not exist
  • read_error: file read failed (permissions, etc.)
  • journalctl_not_found: journalctl binary not found
  • journal_read_error: journalctl execution failed (permissions, etc.)

Standalone metrics status:

StatusMeaningFields
SuccessScrape OKok: true, content: "# HELP ..."
Normal skipNo metrics URL configuredskipped: true, reason: "no_metrics_url"
TimeoutResponse timed outok: false, error: "timeout"
FailureConnection failedok: false, error: "fetch_failed"

K8s Previous Logs

If a pod has not restarted, previous=true logs may not exist and K8s may return 400/404. This is normal and must not fail the bundle. OA writes a skip record in this case:

json
{"type":"log","namespace":"ns","pod":"p","container":"c","ts":"...","previous":true,"skipped":true,"reason":"no_previous_container"}

K8s Metrics — 3 States

StatusMeaningFields
SuccessScrape OKok: true, content: "# HELP ..."
Normal skipNo annotation (pod does not expose metrics)skipped: true, reason: "annotation_missing"
FailureAnnotation present but connection failed (anomaly signal)ok: false, error: "timeout after 2000ms"

Analysis Guide (for AI Agents)

Priority

  1. Events (K8s only): OOMKilled, CrashLoopBackOff, FailedScheduling
  2. Logs: panic, fatal, segfault, timeout, connection refused
  3. Metrics: ok:false is an anomaly signal (service down / network issue), skipped:true is normal

Analysis Method

  • Group recurring errors by signature + count occurrences
  • Record first/last occurrence timestamps
  • Drill down: in K8s use narrower selector / single pod; in standalone use single service, shorter time window for follow-up bundles

Target Interpretation UX

K8s Mode

User InputAction
"Analyze backend logs"GET /v1/pods?q=backend → bundle all matching pods
"Only my-app pod 0"target.pods: [{pod: "my-app-pod-0"}]
"All cluster error logs"namespace: "*", logs only, cluster ERROR/WARN

Standalone Mode

User InputAction
"Analyze solana validator logs"GET /v1/servicestarget.services: ["solana-validator"]
"Check all service status"GET /v1/services → bundle all service names
"Only rpc-node metrics"target.services: ["rpc-node"], logs disabled, metrics only

Defaults

Common

FieldDefault
sinceSeconds600 (10 min)

K8s Mode

FieldDefault
tailLines2000
namespace* (all)
containersall
previoustrue
timestampstrue (forced true in absolute time mode)

Limits

Common

FieldValue
maxTotalLogLines50,000
sinceSecondsMax3,600 (1 hour)
metricsTimeoutMs2,000
bundle TTL60 min auto-delete

K8s Mode

FieldValue
maxPods20
maxMetricsPods20

Standalone Configuration

Standalone mode defines services via the OA_SERVICES env:

bash
export OA_JWT_SECRET="..."
export OA_SERVICES='[
  {"name":"solana-validator","logs":["/var/log/solana/validator.log"],"metrics":"http://localhost:9090/metrics"},
  {"name":"rpc-node","logs":["/var/log/solana/rpc.log"]}
]'
node dist/index.js

Service definition fields:

FieldRequiredDescription
nameYesUnique service identifier
logsNoArray of log file paths to collect
journalNosystemd unit name (journalctl log collection)
metricsNoPrometheus metrics URL

Notes

  • Always prefer the bundle API (raw endpoints are for small-scale debugging)
  • Use multiple smaller bundles to drill down rather than one large time range
  • metrics_text with ok:false is an anomaly signal by itself
  • skipped:true is normal (the service/pod does not expose metrics)