AgentSkillsCN

profiling-with-cpu-sampler

基于时间的采样剖析器,可直观呈现执行时间的消耗分布(以墙钟时间为基准,而非执行频率)。它提供自耗时间与总耗时的直方图,同时标注各编译层级(T0/T1/T2),并生成火焰图。作为识别耗时最多的热点函数的第一步,该工具具有低开销、适合长时间运行的特点。搭配执行次数追踪功能使用,可帮助您更全面地理解单次执行所需时间与执行频率之间的关系。

SKILL.md
--- frontmatter
name: profiling-with-cpu-sampler
description: Time-based sampling profiler showing WHERE execution time is spent (wall-clock time, not frequency). Provides histogram with self/total time, compilation tiers (T0/T1/T2), and flame graphs. Use as FIRST step to identify hot functions consuming most time. Low overhead, suitable for longer runs. Pair with tracing-execution-counts to understand time-per-execution vs execution frequency.

Profiling with CPU Sampler

Time-based sampling profiler that identifies WHERE your program spends execution time. Shows wall-clock time distribution across functions with compilation tier breakdown.

When to Use This Skill

Use FIRST when investigating any performance issue:

  • Identify hot functions consuming most time
  • Verify code is compiling (tier distribution)
  • Measure time spent in interpreter vs compiled code
  • Generate flame graphs for visualization

Quick Start

bash
# Basic profiling
<launcher> --cpusampler <program>

# With tier breakdown (RECOMMENDED)
<launcher> --cpusampler --cpusampler.ShowTiers=true <program>

# Skip warmup (recommended for steady-state analysis)
<launcher> --cpusampler --cpusampler.Delay=5000 --cpusampler.ShowTiers=true <program>

⚠️ REQUIRED: Fermi Verification (Every Tool Invocation)

Before running:

  • Pre-calculate: Expected hot functions (1-5 names), T0/T1/T2 split
  • Smoke test: <launcher> --cpusampler -c 'print 1;' → Verify output format

After running:

  • Validate: Actual vs estimate within 1 order of magnitude? YES / NO
  • If NO: STOP - Debug tool before proceeding (run --help:cpusampler, test on known-good input)
  • Save output: tool-outputs/cpu-sampler-[benchmark].txt

Gate: All boxes checked? → Proceed to analysis

Key Options

OptionDescriptionRecommended Value
--cpusampler.ShowTiers=trueShow T0/T1/T2 breakdownAlways use
--cpusampler.Delay=<ms>Skip warmup2000-10000
--cpusampler.Period=<ms>Sample interval10 (default)
--cpusampler.Output=calltreeOutput formatDefault: histogram
--cpusampler.OutputFile=<file>Save to fileFor later analysis
--cpusampler.SampleInternal=trueInclude internal framesFor Truffle debugging

Understanding Output

Tier Columns

TierMeaningTarget
T0Interpreter<10% for hot functions
T1First-tier compiledTransitional
T2Fully optimized>80% for hot functions

Sample Output

code
Sampling Histogram. Recorded 412 samples with period 10ms.
  Self Time: Time spent in function (excluding callees)
  Total Time: Time in function including callees

Name          || Total Time   || Self Time    || T0    | T1   | T2
queens        || 1850ms 88.0% || 1850ms 88.0% || 5.2%  | 3.1% | 91.7%
hasConflict   || 250ms 11.9%  || 250ms 11.9%  || 8.8%  | 4.2% | 87.0%

Interpretation Guidelines

Good Performance

  • ✅ Hot functions show >80% T2 time
  • ✅ <10% T0 (interpreter) time
  • ✅ Time concentrated in expected hot functions

Performance Problems

  • ⚠️ >30% T0 time → Compilation issues
  • ⚠️ High T1 but low T2 → Optimization barriers
  • ⚠️ Time in unexpected functions → Algorithm issues

Integration with Other Skills

Next steps based on findings:

FindingNext Skill
High T0 timetracing-compilation-events
Optimization barriersdetecting-performance-warnings
Memory issues suspectedprofiling-memory-allocations
Inlining problemstracing-inlining-decisions

Related Skills

  • tracing-execution-counts - Execution frequency (not time)
  • detecting-performance-warnings - Find optimization barriers
  • tracing-compilation-events - Compilation behavior
  • establishing-benchmark-baseline - Set up benchmarks first

Reference

bash
# Full help
<launcher> --help:cpusampler

See PATTERNS.md for common problem patterns and solutions.