AgentSkillsCN

analyze

开展可观测性分析——查询错误追踪+分析API,生成结构化的健康报告并附上行动建议

SKILL.md
--- frontmatter
name: analyze
description: Observability analysis — query error tracking + analytics APIs, produce structured health report with action items
argument-hint: "[week|month|sprint]"

Observability Analysis

You are the platform-engineer for {{PROJECT_NAME}}. Generate a structured observability report by querying error tracking and analytics APIs, then cross-reference findings with the codebase to identify instrumentation gaps.

Prerequisites

Validate required environment variables before proceeding. If any are missing, print the setup instructions and stop.

Configure your error tracking tool (e.g., Sentry) and analytics tool (e.g., PostHog) environment variables as needed for your project.

Period Argument

Parse the optional argument $ARGUMENTS (default: week):

ArgumentQuery PeriodLabel
week (default)7 daysLast 7 days
month30 daysLast 30 days
sprint7 daysCurrent sprint (same as week, labeled differently)

Step 0: Codebase Instrumentation Audit

Before querying external APIs, scan the codebase to build a current inventory of instrumentation. This ensures queries match reality.

Span/Trace Inventory

ALWAYS grep first. Do not rely on any static list — the codebase is the source of truth.

Search for performance measurement calls, custom spans, and trace instrumentation in your codebase. Build a table from the grep results:

Span NameFilePurpose

Also check for:

  • Error capture calls (e.g., captureException)
  • Custom tags/context
  • Breadcrumbs (navigation/state changes)
  • Performance measurements

Analytics Events Inventory

ALWAYS grep first. Do not rely on any static list — the codebase is the source of truth.

Search for analytics event definitions and their usage in non-test code. Mark each event as:

  • Active = capture call found in non-test code
  • Unwired = defined but no capture call found in non-test code

Gap Analysis

Flag any of these as observability gaps:

  1. Unwired events — defined but never called. These represent intended instrumentation that was never connected.
  2. Missing error capture — code paths with try/catch that swallow errors without reporting.
  3. Missing spans — significant async operations (database queries, network calls, file I/O) without performance tracking.
  4. Missing breadcrumbs — navigation transitions or state changes without breadcrumb logging.

Step 1: Query Error Tracking

Query your error tracking tool (Sentry, Bugsnag, etc.) for the configured period.

Collect:

  1. Crash-free rate — session health over the period
  2. Top unresolved issues — top 10 by event count
  3. Error trends — new vs resolved issues over the period
  4. Performance p50/p95 — for all instrumented spans found in Step 0
  5. Release health — recent releases, crash-free users, new issues per release
  6. Screen/page performance — Time to Initial Display or equivalent per screen/route
  7. App start performance — cold start and warm start times
  8. Frame rates / UI performance — slow and frozen frame rates per screen (if applicable)
  9. All transactions overview — query ALL transactions sorted by volume

If any query returns a rate limit error, wait 10 seconds and retry once. If it fails again, note "Rate limited" in that section.

Step 2: Query Analytics

Query your analytics tool (PostHog, Amplitude, Mixpanel, etc.) for the configured period.

Collect:

  1. Retention — D1, D7, D30 cohorts
  2. Engagement — weekly active usage metrics relevant to your product
  3. Funnel completion — key user journey completion rates
  4. Onboarding funnel — first-time user conversion
  5. Feature adoption — event counts for key features
  6. DAU/WAU — daily and weekly active users trend
  7. Error events — application-level error event counts and trends

If any query times out, fall back to simpler event count queries and note the limitation.

Step 3: Correlate and Analyze

Cross-reference error tracking and analytics data:

  1. Crash impact: Do top errors correlate with drop-offs in analytics funnels?
  2. Performance vs engagement: Do performance regressions (p95 increases) correlate with lower retention?
  3. Feature adoption vs errors: Are newly adopted features generating disproportionate errors?
  4. Instrumentation coverage: Compare the codebase inventory (Step 0) against what was actually received. If a span is instrumented but has zero events, something is broken.

If data is insufficient for correlation (pre-beta, low traffic), explicitly state: "Insufficient production data for cross-platform correlation. Analysis based on available development/testing traffic."

Step 4: Generate Report

Save the report to:

code
docs/reports/observability-report-YYYY-MM-DD.md

Where YYYY-MM-DD is today's date.

Step 5: Action Items

For each identified issue, create an action item with:

  • Title: Short description (suitable for a GitHub issue title)
  • Severity: P0 (critical), P1 (high), P2 (medium)
  • Evidence: Data points from error tracking/analytics supporting the finding
  • Recommendation: Specific next step with file paths
  • Backlog command: Ready-to-use /bug or /feature command

Severity guide:

  • P0: Crash-free rate <99%, data loss, security-related errors
  • P1: p95 >2x target, retention below target, error count trending up >50%
  • P2: Missing instrumentation, unwired events, minor performance degradation

Example:

code
### P1: Data processing p95 above target — 3.8s vs <2.5s target
**Evidence:** Error tracking span `data.process` p95 = 3800ms (target: <2500ms), 234 samples over 7 days
**File:** `{{APP_DIR}}services/pipeline/DataPipeline.ts:62`
**Recommendation:** Profile data processing — check if large inputs cause tail latency. Consider adding input size as a tag for segmentation.
**Backlog:** `/bug Data processing p95 is 3.8s vs 2.5s target — profile for tail latency.`

Step 6: Present Summary

Print a text summary to the user:

code
Observability Report — [period label]
======================================

App Health:
  Crash-free rate: XX.X%
  Sessions: N total, N crashed, N errored
  Top issue: [issue title] (N events)

Performance — Custom Spans:
  [span name] p95: X.Xs (target: <Xs)
  ...

Performance — Screen/Page Load:
  [screen] avg: Xs, p95: Xs (target: <2s)
  ...

Performance — App Start:
  Cold start avg: Xms, p95: Xms (target: <2s)
  Warm start avg: Xms, p95: Xms (target: <1s)

Engagement:
  DAU: N | WAU: N
  D7 retention: XX% (target: >40%)
  Key engagement metric: XX (target: >XX)

Instrumentation Gaps: N unwired events, N missing spans
Error Events: N total application-level errors

Action Items: N total (P0: N, P1: N, P2: N)
  [list top 3 action items]

Full report: docs/reports/observability-report-YYYY-MM-DD.md

Integration Points

With /sprint-close

When called from sprint-close, include a condensed "Observability Highlights" section:

  • Crash-free rate + trend
  • Top performance metric change
  • Key engagement metric change
  • Critical action items only

With /status

When called from status, provide these metrics for the dashboard:

  • crashFreeRate: percentage
  • p95ColdStart: seconds
  • dau: number
  • d7Retention: percentage