AgentSkillsCN

grafana-loki

当用户询问Loki、LogQL、日志查询、Grafana Loki配置、logcli,或运行“/grafana-loki”时使用此技能。通过logcli或HTTP API查询日志,编写LogQL查询,排查Loki查询问题,分析标签基数,以及Loki集群配置。可在“loki”“logql”“logcli”“loki查询”“日志查询”“grafana loki”“/grafana-loki”等指令触发。

SKILL.md
--- frontmatter
name: grafana-loki
description: >
  Use this skill when the user asks about Loki, LogQL, log queries, Grafana Loki configuration,
  logcli, or runs "/grafana-loki". Handles querying logs via logcli or HTTP API, writing LogQL queries,
  troubleshooting Loki queries, analyzing label cardinality, and Loki cluster configuration.
  Triggers on: "loki", "logql", "logcli", "loki query", "log query", "grafana loki", "/grafana-loki".
version: 1.0.0
allowed-tools: >
  Bash(logcli:*), Bash(curl:*), Bash(jq:*), Bash(python3:*), Bash(which:*), Bash(command:*),
  Bash(loki-query.sh:*), Read, Grep, WebFetch(domain:grafana.com)

Grafana Loki Query & Configuration Assistant

You are a Loki expert. Help users query logs, write LogQL, configure Loki, and troubleshoot performance.

Local References (read BEFORE making remote calls)

These files contain distilled reference material. Read them first to answer queries without network calls:

  • ~/.claude/skills/grafana-loki/references/logql-reference.md — Full LogQL syntax, operators, functions
  • ~/.claude/skills/grafana-loki/references/loki-api-reference.md — All HTTP API endpoints, params, auth
  • ~/.claude/skills/grafana-loki/references/query-optimization.md — Performance rules, anti-patterns, troubleshooting
  • ~/.claude/skills/grafana-loki/references/logcli-reference.md — logcli commands, flags, env vars

Only fetch from https://grafana.com/docs/loki/latest/... if the local references don't cover the user's question.

Initial Setup

The user MUST provide:

  • Loki endpoint (e.g., https://loki.example.com)
  • Tenant ID (X-Scope-OrgID)

Optionally:

  • Auth credentials (basic auth user/pass, bearer token)
  • CA cert path or TLS skip preference

Set these as environment variables for the session:

bash
export LOKI_ADDR="<endpoint>"
export LOKI_ORG_ID="<tenant>"
# Optional:
export LOKI_USERNAME="<user>"
export LOKI_PASSWORD="<pass>"

Tool Selection: logcli vs HTTP API

Step 1: Detect logcli availability

bash
command -v logcli >/dev/null 2>&1 && echo "logcli available" || echo "logcli not found"

If logcli IS available (preferred)

Use logcli directly. It handles auth, output formatting, and pagination:

bash
logcli query --since=1h --limit=100 '{app="nginx"} |= "error"'

Set env vars (LOKI_ADDR, LOKI_ORG_ID) and logcli reads them automatically.

If logcli is NOT available

Use the wrapper script at ~/.claude/skills/grafana-loki/loki-query.sh:

bash
~/.claude/skills/grafana-loki/loki-query.sh query_range '{app="nginx"} |= "error"' --since 1h --limit 100

Or fall back to direct curl:

bash
curl -sS -H "X-Scope-OrgID: ${LOKI_ORG_ID}" \
  "${LOKI_ADDR}/loki/api/v1/query_range?query=%7Bapp%3D%22nginx%22%7D&since=1h&limit=100" | jq .

Always pipe API JSON output through jq for readability.

CRITICAL: Query Optimization Rules

ALWAYS apply these rules to EVERY query you write or suggest. Non-negotiable.

1. Start with the narrowest stream selector possible

Every label in {...} narrows the search at the index level (free/fast). Missing labels means scanning more data.

2. Add line filters BEFORE parsers

|= "error" is a simple string scan on raw bytes — much faster than parsing JSON/logfmt first.

3. Use the shortest time range that answers the question

Default to --since=1h. Only go wider if needed. Ask the user before scanning > 24h.

4. Always set a limit

Prevents accidentally pulling millions of lines.

5. Check volume before expensive queries

bash
logcli stats '{app="nginx"}' --since=1h
# or
~/.claude/skills/grafana-loki/loki-query.sh stats '{app="nginx"}' --since 1h

If bytes/chunks are large, warn the user and suggest narrowing.

6. Parse only needed fields

logql
| json status, duration    # NOT just | json
| logfmt level, msg        # NOT just | logfmt

7. Structured metadata filters go BEFORE parsers

logql
# Correct (bloom-acceleratable)
{app="api"} | trace_id="abc123" | json

# Wrong (not accelerated)
{app="api"} | json | trace_id="abc123"

Note: Bloom filters may not be installed on the cluster. The query will still work correctly — it just won't benefit from bloom acceleration. Never assume bloom filters are available.

8. Prefer exact matches over regex

logql
{namespace="prod-us"}      # fast: index lookup
{namespace=~"prod-.*"}     # slow: scans all values

Workflow for User Queries

When asked to "find logs" or "query for X":

  1. Ask for context if not provided: app/service name, cluster, namespace, time range
  2. Check stats first for broad queries to estimate cost
  3. Build query incrementally: selector → line filter → parser → label filter
  4. Show the query to the user before executing
  5. Execute and show results
  6. Suggest refinements if results are too many/few

When asked to "investigate" or "debug":

  1. Start with labels to see what's available
  2. Use series --analyze-labels to understand cardinality
  3. Use detected-fields to discover log structure
  4. Build targeted queries based on findings
  5. Use --stats to monitor query cost

When asked about configuration:

  1. Read local references first
  2. For cluster-specific config, use config endpoint or loki-query.sh config
  3. For detailed config reference, fetch from https://grafana.com/docs/loki/latest/reference/loki-config-ref/

Common Recipes

Error investigation

logql
{cluster="prod", namespace="myapp"} |= "error" != "timeout" | json | line_format "{{.level}} {{.msg}}"

Rate of errors over time

logql
sum by (level) (rate({app="api"} | json level [5m]))

Top error messages

logql
topk(10, sum by (msg) (count_over_time({app="api"} |= "error" | json msg [1h])))

P99 latency from logs

logql
quantile_over_time(0.99, {app="api"} | json | unwrap duration [5m]) by (endpoint)

Label cardinality check

bash
logcli series '{app="api"}' --analyze-labels --since=1h

Data volume assessment

bash
logcli volume '{namespace="prod"}' --since=24h --targetLabels=app

Loki Architecture (context for troubleshooting)

  • Distributor → receives pushes, routes to ingesters
  • Ingester → accumulates logs in memory, flushes to storage
  • Querier → executes queries against ingesters + storage
  • Query Frontend → splits/schedules/caches queries
  • Compactor → optimizes index in object store
  • Index Gateway → serves index queries
  • Bloom Gateway → bloom filter lookups (if enabled)

Deployment modes: Single Binary | Simple Scalable (read/write/backend) | Microservices

Label Best Practices (when advising on config)

  • Labels should be static (region, cluster, namespace, app, env)
  • Labels should be low cardinality (<100 unique values ideally)
  • Never use as labels: timestamps, trace IDs, user IDs, pod names, request IDs
  • Use structured metadata for high-cardinality searchable fields
  • Use line filters or parsers for dynamic content
  • Target: <100K active streams, <1M streams/24h per tenant
  • Default limit: 15 index labels

Error Reference

ErrorLikely CauseAction
400 parse errorSyntax issueCheck brackets, quotes, duration format
400 max series>500 unique label combosNarrow selectors, reduce time
400 max entries>5000 log linesAdd limit, narrow query
504 timeoutQuery too expensive (>60s default)Narrow time, add line filters, simplify
"bytes read" limitToo much data scannedNarrow selectors + time range
"chunks limit">2M chunksReduce time range significantly

Remote Documentation (only when local refs insufficient)