AgentSkillsCN

splunk-analysis

使用 SPL(搜索处理语言)进行 Splunk 日志分析。在通过 Splunk 日志、已保存的搜索或告警排查问题时使用。

SKILL.md
--- frontmatter
name: splunk-analysis
description: Splunk log analysis using SPL (Search Processing Language). Use when investigating issues via Splunk logs, saved searches, or alerts.
allowed-tools: Bash(python *)

Splunk Analysis

Authentication

IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for SPLUNK_HOST, SPLUNK_TOKEN, or other credentials in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.


MANDATORY: Statistics-First Investigation

NEVER dump raw logs. Always follow this pattern:

code
STATISTICS → SAMPLE → PATTERNS → CORRELATE
  1. Statistics First - Know volume, error rate, and top patterns before sampling
  2. Strategic Sampling - Choose the right strategy based on statistics
  3. Pattern Extraction - Cluster similar errors to find root causes
  4. Context Correlation - Investigate around anomaly timestamps

Available Scripts

All scripts are in .claude/skills/observability-splunk/scripts/

PRIMARY INVESTIGATION SCRIPTS

get_statistics.py - ALWAYS START HERE

Comprehensive statistics with pattern extraction.

bash
python .claude/skills/observability-splunk/scripts/get_statistics.py [--index INDEX] [--sourcetype SOURCETYPE] [--time-range MINUTES]

# Examples:
python .claude/skills/observability-splunk/scripts/get_statistics.py --time-range 60
python .claude/skills/observability-splunk/scripts/get_statistics.py --index main
python .claude/skills/observability-splunk/scripts/get_statistics.py --sourcetype access_combined

Output includes:

  • Total count, error count, error rate percentage
  • Status distribution (info, warn, error)
  • Top sourcetypes and hosts by log volume
  • Top error patterns (crucial for quick triage)
  • Actionable recommendation

sample_logs.py - Strategic Sampling

Choose the right sampling strategy based on statistics.

bash
python .claude/skills/observability-splunk/scripts/sample_logs.py --strategy STRATEGY [--index INDEX] [--sourcetype SOURCETYPE] [--limit N]

# Strategies:
#   errors_only   - Only error logs (default for incidents)
#   warnings_up   - Warning and error logs
#   around_time   - Logs around a specific timestamp
#   all           - All log levels

# Examples:
python .claude/skills/observability-splunk/scripts/sample_logs.py --strategy errors_only --index main
python .claude/skills/observability-splunk/scripts/sample_logs.py --strategy around_time --timestamp "2026-01-27T05:00:00" --window 5
python .claude/skills/observability-splunk/scripts/sample_logs.py --strategy all --sourcetype access_combined --limit 20

SPL (Search Processing Language)

Basic Search

spl
# Simple keyword search
error

# Index specific search (ALWAYS specify index for performance)
index=main error

# Multiple keywords (implicit AND)
index=main error connection

# Exact phrase
index=main "connection refused"

Field Searches

spl
# Exact field match
index=main host=web-01

# Wildcard
index=main host=web-*

# Numeric comparison
index=main status>=400

# NOT operator
index=main NOT status=200

# OR operator
index=main (status=500 OR status=503)

Time Range

spl
# Relative time (in tool call)
earliest=-15m latest=now

# Absolute time
earliest="01/15/2024:10:00:00" latest="01/15/2024:11:00:00"

# Natural time modifiers
earliest=-1h@h  # 1 hour ago, rounded to hour
earliest=-1d@d  # 1 day ago, rounded to day

Investigation Workflow

Standard Incident Investigation

code
┌─────────────────────────────────────────────────────────────┐
│ 1. STATISTICS FIRST (mandatory)                              │
│    python get_statistics.py --index <index>                  │
│    → Know volume, error rate, top patterns                   │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
                     High Error Rate?
               ┌─────────────┴─────────────┐
               │                           │
       YES (>5%)                           NO
               │                           │
               ▼                           ▼
┌─────────────────────────────┐  ┌───────────────────────────────────────────┐
│ 2. FAST PATH                │  │ 2. TARGETED INVESTIGATION                 │
│    Sample errors directly   │  │    Filter by specific criteria            │
│    python sample_logs.py    │  │    python sample_logs.py --strategy all   │
│    --strategy errors_only   │  │    → Look for anomalies                   │
└─────────────────────────────┘  └───────────────────────────────────────────┘

Quick Commands Reference

GoalCommand
Start investigationget_statistics.py --index X
Sample errors onlysample_logs.py --strategy errors_only --index X
Investigate spikesample_logs.py --strategy around_time --timestamp T
All logssample_logs.py --strategy all --index X --limit 20

SPL Commands Reference

Filtering Commands

CommandPurposeExample
searchFilter eventssearch error
whereFilter with expressionswhere status > 400
dedupRemove duplicatesdedup host
headFirst N resultshead 10
tailLast N resultstail 10

Transformation Commands

CommandPurposeExample
statsAggregate statisticsstats count by host
timechartTime-based aggregationtimechart span=5m count
chartPivot tablechart count by status, host
topTop valuestop 10 host
rareRare valuesrare message
tableSelect fieldstable _time, host, message

Field Operations

CommandPurposeExample
evalCalculate fieldseval duration_sec=duration/1000
rexRegex extractionrex field=message "error: (?<error_type>\w+)"
renameRename fieldsrename src_ip as source_ip
fieldsInclude/exclude fieldsfields host, message

Common Query Patterns

Error Rate Analysis

spl
# Error count per 5 minutes
index=main | timechart span=5m count(eval(level="ERROR")) as errors, count as total

# Error percentage over time
index=main
| timechart span=5m count(eval(level="ERROR")) as errors, count as total
| eval error_rate=errors/total*100

Top Errors by Service

spl
index=main level=ERROR
| stats count by service, message
| sort -count
| head 20

Response Time Analysis

spl
index=main sourcetype=access_combined
| stats avg(response_time) as avg_rt,
        p95(response_time) as p95_rt,
        max(response_time) as max_rt
    by uri_path
| sort -avg_rt

Anomaly Detection

spl
# Sudden spike detection
index=main
| timechart span=5m count as events
| eventstats avg(events) as avg_events, stdev(events) as stdev_events
| eval anomaly=if(events > avg_events + 2*stdev_events, 1, 0)
| where anomaly=1

Anti-Patterns to Avoid

  1. NEVER skip statistics - get_statistics.py is MANDATORY first step
  2. No index specified - Always use index=X for performance
  3. Unbounded time range - Always specify time ranges
  4. Fetching all logs - Use sampling strategies, not unbounded searches
  5. Ignoring error rate - High error rate means immediate investigation
  6. Complex rex on all events - Filter first, then extract