AgentSkillsCN

perf-profiler

借助详细的报告,自动分析性能剖析数据,精准定位 CPU 热点、函数级注释,以及内存访问瓶颈。当用户提出“分析性能数据”“查找性能瓶颈”“哪些函数是热点”“分析内存访问模式”“剖析索引/查询性能”等需求时,此技能将为您快速响应,精准定位性能问题,助力您深入挖掘性能瓶颈,优化系统表现。

SKILL.md
--- frontmatter
name: perf-profiler
description: Automate performance analysis of perf profiling data with detailed reports on CPU hotspots, function-level annotations, and memory access bottlenecks. Use this skill when users request performance analysis of perf data, identifying hotspots, analyzing memory access patterns, or investigating performance bottlenecks in test/profile_output directory (specifically index_latest_profile_perf and query_latest_profile_perf files). Triggers on requests like "analyze perf data", "find performance bottlenecks", "what are the hot functions", "analyze memory access patterns", or "profile index/query performance".

Perf Profiler

Overview

Automate comprehensive performance analysis of Linux perf profiling data. This skill provides:

  • Hotspot identification: Find the top CPU-consuming functions
  • Source-level annotation: Map performance costs to specific code lines
  • Assembly-level analysis: Identify expensive instructions, especially memory operations
  • Memory access profiling: Pinpoint which memory accesses cause the most overhead
  • Optimization recommendations: Get actionable suggestions based on analysis

This skill is specifically configured to analyze profiling data from the PolarDB competition 2025 project, focusing on index_latest_profile_perf and query_latest_profile_perf in the test/profile_output directory.

Workflow

Step 1: Locate Perf Data Files

The project maintains two perf data files:

code
test/profile_output/
├── index_latest_profile_perf   # Index phase profiling data
└── query_latest_profile_perf   # Query phase profiling data

When to analyze each:

  • index_latest_profile_perf: For optimizing index building, insertion, or data structure operations
  • query_latest_profile_perf: For optimizing search, retrieval, or query execution

Step 2: Run Automated Analysis

Use the provided script for comprehensive analysis:

bash
python3 scripts/analyze_perf.py test/profile_output/index_latest_profile_perf

Key parameters:

  • -n <number>: Analyze top N hotspot functions (default: 10)
  • -o <file>: Save report to file instead of stdout

Examples:

bash
# Analyze index phase with detailed report
python3 scripts/analyze_perf.py test/profile_output/index_latest_profile_perf -n 15 -o index_report.txt

# Quick query phase analysis
python3 scripts/analyze_perf.py test/profile_output/query_latest_profile_perf

Step 3: Interpret the Report

The generated report contains three sections:

3.1 Hotspot Functions List

Shows top CPU-consuming functions sorted by overhead percentage:

code
1.  23.45%  hnswlib::HierarchicalNSW::searchBaseLayer
2.  15.23%  std::vector::operator[]
3.   8.91%  __memcpy_avx_unaligned
...

What to look for:

  • Functions with > 10% overhead are critical optimization targets
  • Surprisingly high overhead in utility functions (memcpy, vector access) suggests data structure issues
  • Multiple related functions in top 10 indicate a subsystem bottleneck

3.2 Function-Level Detailed Analysis

For each top hotspot function, see:

Source-level hotspots: Code lines with significant overhead

code
  5.23%  int dist = distance_func(query[i], data[i]);
  3.45%  if (dist < min_dist) {

Assembly-level hotspots: Expensive instructions

code
  8.34%  mov    0x10(%rax,%rcx,8), %rdx
  4.56%  vmovups  (%rsi), %ymm0

Memory access hotspots: Most expensive load/store operations

code
  8.34% (35.5% of function)  mov    0x10(%rax,%rcx,8), %rdx
         └─ 地址: rax+rcx*8+0x10

  💡 该函数中内存访问指令总开销: 18.45% (78.6% of function)

What to look for:

  • High memory access percentage (>50% of function) → Cache/data layout issue
  • Unaligned vector operations (vmovups) → Alignment problem
  • Pointer chasing patterns (sequential dependent loads) → Data structure redesign needed

3.3 Optimization Recommendations

Automated suggestions based on detected patterns:

code
**内存密集型函数(建议优化访存模式):**
  • searchBaseLayer: 23.45% CPU,其中 78.6% 是内存访问

💡 优化方向:
   - 考虑使用缓存友好的数据布局
   - 减少间接访问和指针追逐
   - 使用 SIMD 指令批量处理数据
   - 考虑数据预取(prefetch)

Step 4: Manual Deep Dive (Optional)

For cases requiring deeper investigation, use perf commands directly. See references/perf_guide.md for comprehensive command reference.

Common manual analysis scenarios:

  1. Investigate specific function behavior:

    bash
    perf annotate -i test/profile_output/index_latest_profile_perf -l <function_name> --stdio
    
  2. Understand call chains:

    bash
    perf report -i test/profile_output/index_latest_profile_perf -g --stdio
    
  3. Compare different runs:

    bash
    perf diff perf.data.old perf.data.new
    

Common Analysis Patterns

Pattern 1: Identifying Cache Misses

Symptoms in report:

  • High overhead on simple load instructions: mov (%rax), %rbx with >5% overhead
  • Memory operations dominating (>70% of function overhead)
  • Random access patterns: (%rax,%rcx,8) with varying rcx

Next steps:

  1. Check data structure layout (struct member ordering)
  2. Consider access patterns (sequential vs. random)
  3. Evaluate prefetching opportunities
  4. Review cache line utilization

Pattern 2: SIMD Optimization Opportunities

Symptoms in report:

  • Scalar operations in loops: addss, movss instead of addps, movaps
  • High overhead on element-by-element operations
  • No vector instructions in hot loops

Next steps:

  1. Verify data alignment (required for aligned SIMD)
  2. Use compiler intrinsics or auto-vectorization
  3. Check loop trip counts (SIMD efficiency improves with more iterations)

Pattern 3: Branch Misprediction

Symptoms in report:

  • High overhead on conditional jumps: je, jne, jg
  • Code with many if statements showing high cost
  • Function overhead doesn't match instruction complexity

Next steps:

  1. Profile with branch-misses event: perf record -e branch-misses
  2. Reorganize code to reduce branches
  3. Use branchless techniques (cmov, select)
  4. Apply profile-guided optimization (PGO)

Tips for Effective Analysis

Tip 1: Analyze Both Index and Query

Different operations have different performance characteristics:

bash
# Generate both reports
python3 scripts/analyze_perf.py test/profile_output/index_latest_profile_perf -o index_analysis.txt
python3 scripts/analyze_perf.py test/profile_output/query_latest_profile_perf -o query_analysis.txt

Then compare:

  • Are the bottlenecks in shared code or phase-specific code?
  • Does optimization for index affect query performance?
  • Which phase has more optimization headroom?

Tip 2: Correlate with Source Code

Always examine the actual source code when reviewing hotspots:

  1. Run analysis to get function names and line numbers
  2. Open source files identified in the report
  3. Understand why those lines are expensive
  4. Consider algorithmic changes, not just micro-optimizations

Tip 3: Iterate and Verify

Performance optimization is iterative:

  1. Analyze → Identify bottleneck
  2. Optimize → Make targeted change
  3. Profile → Collect new perf data
  4. Analyze → Verify improvement
  5. Repeat

Verification checklist:

  • Did hotspot overhead decrease?
  • Did new hotspots emerge?
  • What's the overall performance impact?

Tip 4: Watch for Optimization Side Effects

When optimizing based on perf data:

  • Cache optimization may increase code size → worse instruction cache
  • Vectorization may require data reorganization → memory overhead
  • Prefetching can pollute cache → harm other operations
  • Loop unrolling increases code size → instruction cache pressure

Always measure the overall impact, not just local improvements.

References

For detailed information on perf commands, assembly instructions, and optimization techniques:

  • references/perf_guide.md: Complete guide to perf report, perf annotate, assembly instruction reference, memory access patterns, and optimization strategies

Consult this reference when:

  • You need to run manual perf commands for deeper investigation
  • Interpreting specific assembly instructions
  • Understanding x86-64 memory addressing modes
  • Learning about performance optimization patterns

Troubleshooting

Issue: "No symbols available"

Cause: Binary was stripped or compiled without debug info

Solution:

bash
# Recompile with debug symbols
gcc -g -O3 source.c -o program

# Or with more detailed debug info
gcc -g3 -O3 source.c -o program

Issue: "Permission denied" running perf

Cause: Insufficient permissions for perf_event_paranoid

Solution:

bash
# Check current setting
cat /proc/sys/kernel/perf_event_paranoid

# Allow perf for normal users (requires root)
sudo sysctl -w kernel.perf_event_paranoid=1

Issue: Script shows no hotspots

Possible causes:

  1. Perf data file is empty or corrupt
  2. All overhead below threshold (--percent-limit)
  3. No debug symbols available

Debug steps:

bash
# Check file size
ls -lh test/profile_output/index_latest_profile_perf

# Try manual perf report
perf report -i test/profile_output/index_latest_profile_perf --stdio

# Lower threshold in script (edit analyze_perf.py)
'--percent-limit', '0.01'  # Instead of '0.1'

Issue: Assembly instructions unclear

Solution: Refer to references/perf_guide.md for:

  • Common x86-64 instruction explanations
  • Memory addressing mode examples
  • SIMD instruction reference
  • Performance characteristics of different instruction types