Perf Profiler

Overview

Automate comprehensive performance analysis of Linux perf profiling data. This skill provides:

•Hotspot identification: Find the top CPU-consuming functions
•Source-level annotation: Map performance costs to specific code lines
•Assembly-level analysis: Identify expensive instructions, especially memory operations
•Memory access profiling: Pinpoint which memory accesses cause the most overhead
•Optimization recommendations: Get actionable suggestions based on analysis

This skill is specifically configured to analyze profiling data from the PolarDB competition 2025 project, focusing on index_latest_profile_perf and query_latest_profile_perf in the test/profile_output directory.

Workflow

Step 1: Locate Perf Data Files

The project maintains two perf data files:

code

test/profile_output/
├── index_latest_profile_perf   # Index phase profiling data
└── query_latest_profile_perf   # Query phase profiling data

When to analyze each:

•index_latest_profile_perf: For optimizing index building, insertion, or data structure operations
•query_latest_profile_perf: For optimizing search, retrieval, or query execution

Step 2: Run Automated Analysis

Use the provided script for comprehensive analysis:

bash

python3 scripts/analyze_perf.py test/profile_output/index_latest_profile_perf

Key parameters:

•-n <number>: Analyze top N hotspot functions (default: 10)
•-o <file>: Save report to file instead of stdout

Examples:

bash

# Analyze index phase with detailed report
python3 scripts/analyze_perf.py test/profile_output/index_latest_profile_perf -n 15 -o index_report.txt

# Quick query phase analysis
python3 scripts/analyze_perf.py test/profile_output/query_latest_profile_perf

Step 3: Interpret the Report

The generated report contains three sections:

3.1 Hotspot Functions List

Shows top CPU-consuming functions sorted by overhead percentage:

code

1.  23.45%  hnswlib::HierarchicalNSW::searchBaseLayer
2.  15.23%  std::vector::operator[]
3.   8.91%  __memcpy_avx_unaligned
...

What to look for:

•Functions with > 10% overhead are critical optimization targets
•Surprisingly high overhead in utility functions (memcpy, vector access) suggests data structure issues
•Multiple related functions in top 10 indicate a subsystem bottleneck

3.2 Function-Level Detailed Analysis

For each top hotspot function, see:

Source-level hotspots: Code lines with significant overhead

code

  5.23%  int dist = distance_func(query[i], data[i]);
  3.45%  if (dist < min_dist) {

Assembly-level hotspots: Expensive instructions

code

  8.34%  mov    0x10(%rax,%rcx,8), %rdx
  4.56%  vmovups  (%rsi), %ymm0

Memory access hotspots: Most expensive load/store operations

code

  8.34% (35.5% of function)  mov    0x10(%rax,%rcx,8), %rdx
         └─ 地址: rax+rcx*8+0x10

  💡 该函数中内存访问指令总开销: 18.45% (78.6% of function)

What to look for:

•High memory access percentage (>50% of function) → Cache/data layout issue
•Unaligned vector operations (vmovups) → Alignment problem
•Pointer chasing patterns (sequential dependent loads) → Data structure redesign needed

3.3 Optimization Recommendations

Automated suggestions based on detected patterns:

code

**内存密集型函数（建议优化访存模式）：**
  • searchBaseLayer: 23.45% CPU，其中 78.6% 是内存访问

💡 优化方向：
   - 考虑使用缓存友好的数据布局
   - 减少间接访问和指针追逐
   - 使用 SIMD 指令批量处理数据
   - 考虑数据预取（prefetch）

Step 4: Manual Deep Dive (Optional)

For cases requiring deeper investigation, use perf commands directly. See references/perf_guide.md for comprehensive command reference.

Common manual analysis scenarios:

•

Investigate specific function behavior:

bash

perf annotate -i test/profile_output/index_latest_profile_perf -l <function_name> --stdio

•

Understand call chains:

bash

perf report -i test/profile_output/index_latest_profile_perf -g --stdio

•
Compare different runs:
bash
```
perf diff perf.data.old perf.data.new
```

Common Analysis Patterns

Pattern 1: Identifying Cache Misses

Symptoms in report:

•High overhead on simple load instructions: mov (%rax), %rbx with >5% overhead
•Memory operations dominating (>70% of function overhead)
•Random access patterns: (%rax,%rcx,8) with varying rcx

Next steps:

•Check data structure layout (struct member ordering)
•Consider access patterns (sequential vs. random)
•Evaluate prefetching opportunities
•Review cache line utilization

Pattern 2: SIMD Optimization Opportunities

Symptoms in report:

•Scalar operations in loops: addss, movss instead of addps, movaps
•High overhead on element-by-element operations
•No vector instructions in hot loops

Next steps:

•Verify data alignment (required for aligned SIMD)
•Use compiler intrinsics or auto-vectorization
•Check loop trip counts (SIMD efficiency improves with more iterations)

Pattern 3: Branch Misprediction

Symptoms in report:

•High overhead on conditional jumps: je, jne, jg
•Code with many if statements showing high cost
•Function overhead doesn't match instruction complexity

Next steps:

•Profile with branch-misses event: perf record -e branch-misses
•Reorganize code to reduce branches
•Use branchless techniques (cmov, select)
•Apply profile-guided optimization (PGO)

Tips for Effective Analysis

Tip 1: Analyze Both Index and Query

Different operations have different performance characteristics:

bash

# Generate both reports
python3 scripts/analyze_perf.py test/profile_output/index_latest_profile_perf -o index_analysis.txt
python3 scripts/analyze_perf.py test/profile_output/query_latest_profile_perf -o query_analysis.txt

Then compare:

•Are the bottlenecks in shared code or phase-specific code?
•Does optimization for index affect query performance?
•Which phase has more optimization headroom?

Tip 2: Correlate with Source Code

Always examine the actual source code when reviewing hotspots:

•Run analysis to get function names and line numbers
•Open source files identified in the report
•Understand why those lines are expensive
•Consider algorithmic changes, not just micro-optimizations

Tip 3: Iterate and Verify

Performance optimization is iterative:

•Analyze → Identify bottleneck
•Optimize → Make targeted change
•Profile → Collect new perf data
•Analyze → Verify improvement
•Repeat

Verification checklist:

•Did hotspot overhead decrease?
•Did new hotspots emerge?
•What's the overall performance impact?

Tip 4: Watch for Optimization Side Effects

When optimizing based on perf data:

•Cache optimization may increase code size → worse instruction cache
•Vectorization may require data reorganization → memory overhead
•Prefetching can pollute cache → harm other operations
•Loop unrolling increases code size → instruction cache pressure

Always measure the overall impact, not just local improvements.

References

For detailed information on perf commands, assembly instructions, and optimization techniques:

•references/perf_guide.md: Complete guide to perf report, perf annotate, assembly instruction reference, memory access patterns, and optimization strategies

Consult this reference when:

•You need to run manual perf commands for deeper investigation
•Interpreting specific assembly instructions
•Understanding x86-64 memory addressing modes
•Learning about performance optimization patterns

Troubleshooting

Issue: "No symbols available"

Cause: Binary was stripped or compiled without debug info

Solution:

bash

# Recompile with debug symbols
gcc -g -O3 source.c -o program

# Or with more detailed debug info
gcc -g3 -O3 source.c -o program

Issue: "Permission denied" running perf

Cause: Insufficient permissions for perf_event_paranoid

Solution:

bash

# Check current setting
cat /proc/sys/kernel/perf_event_paranoid

# Allow perf for normal users (requires root)
sudo sysctl -w kernel.perf_event_paranoid=1

Issue: Script shows no hotspots

Possible causes:

•Perf data file is empty or corrupt
•All overhead below threshold (--percent-limit)
•No debug symbols available

Debug steps:

bash

# Check file size
ls -lh test/profile_output/index_latest_profile_perf

# Try manual perf report
perf report -i test/profile_output/index_latest_profile_perf --stdio

# Lower threshold in script (edit analyze_perf.py)
'--percent-limit', '0.01'  # Instead of '0.1'

Issue: Assembly instructions unclear

Solution: Refer to references/perf_guide.md for:

•Common x86-64 instruction explanations
•Memory addressing mode examples
•SIMD instruction reference
•Performance characteristics of different instruction types