Perf Profiler
Overview
Automate comprehensive performance analysis of Linux perf profiling data. This skill provides:
- •Hotspot identification: Find the top CPU-consuming functions
- •Source-level annotation: Map performance costs to specific code lines
- •Assembly-level analysis: Identify expensive instructions, especially memory operations
- •Memory access profiling: Pinpoint which memory accesses cause the most overhead
- •Optimization recommendations: Get actionable suggestions based on analysis
This skill is specifically configured to analyze profiling data from the PolarDB competition 2025 project, focusing on index_latest_profile_perf and query_latest_profile_perf in the test/profile_output directory.
Workflow
Step 1: Locate Perf Data Files
The project maintains two perf data files:
test/profile_output/ ├── index_latest_profile_perf # Index phase profiling data └── query_latest_profile_perf # Query phase profiling data
When to analyze each:
- •index_latest_profile_perf: For optimizing index building, insertion, or data structure operations
- •query_latest_profile_perf: For optimizing search, retrieval, or query execution
Step 2: Run Automated Analysis
Use the provided script for comprehensive analysis:
python3 scripts/analyze_perf.py test/profile_output/index_latest_profile_perf
Key parameters:
- •
-n <number>: Analyze top N hotspot functions (default: 10) - •
-o <file>: Save report to file instead of stdout
Examples:
# Analyze index phase with detailed report python3 scripts/analyze_perf.py test/profile_output/index_latest_profile_perf -n 15 -o index_report.txt # Quick query phase analysis python3 scripts/analyze_perf.py test/profile_output/query_latest_profile_perf
Step 3: Interpret the Report
The generated report contains three sections:
3.1 Hotspot Functions List
Shows top CPU-consuming functions sorted by overhead percentage:
1. 23.45% hnswlib::HierarchicalNSW::searchBaseLayer 2. 15.23% std::vector::operator[] 3. 8.91% __memcpy_avx_unaligned ...
What to look for:
- •Functions with > 10% overhead are critical optimization targets
- •Surprisingly high overhead in utility functions (memcpy, vector access) suggests data structure issues
- •Multiple related functions in top 10 indicate a subsystem bottleneck
3.2 Function-Level Detailed Analysis
For each top hotspot function, see:
Source-level hotspots: Code lines with significant overhead
5.23% int dist = distance_func(query[i], data[i]);
3.45% if (dist < min_dist) {
Assembly-level hotspots: Expensive instructions
8.34% mov 0x10(%rax,%rcx,8), %rdx 4.56% vmovups (%rsi), %ymm0
Memory access hotspots: Most expensive load/store operations
8.34% (35.5% of function) mov 0x10(%rax,%rcx,8), %rdx
└─ 地址: rax+rcx*8+0x10
💡 该函数中内存访问指令总开销: 18.45% (78.6% of function)
What to look for:
- •High memory access percentage (>50% of function) → Cache/data layout issue
- •Unaligned vector operations (vmovups) → Alignment problem
- •Pointer chasing patterns (sequential dependent loads) → Data structure redesign needed
3.3 Optimization Recommendations
Automated suggestions based on detected patterns:
**内存密集型函数(建议优化访存模式):** • searchBaseLayer: 23.45% CPU,其中 78.6% 是内存访问 💡 优化方向: - 考虑使用缓存友好的数据布局 - 减少间接访问和指针追逐 - 使用 SIMD 指令批量处理数据 - 考虑数据预取(prefetch)
Step 4: Manual Deep Dive (Optional)
For cases requiring deeper investigation, use perf commands directly. See references/perf_guide.md for comprehensive command reference.
Common manual analysis scenarios:
- •
Investigate specific function behavior:
bashperf annotate -i test/profile_output/index_latest_profile_perf -l <function_name> --stdio
- •
Understand call chains:
bashperf report -i test/profile_output/index_latest_profile_perf -g --stdio
- •
Compare different runs:
bashperf diff perf.data.old perf.data.new
Common Analysis Patterns
Pattern 1: Identifying Cache Misses
Symptoms in report:
- •High overhead on simple load instructions:
mov (%rax), %rbxwith >5% overhead - •Memory operations dominating (>70% of function overhead)
- •Random access patterns:
(%rax,%rcx,8)with varying rcx
Next steps:
- •Check data structure layout (struct member ordering)
- •Consider access patterns (sequential vs. random)
- •Evaluate prefetching opportunities
- •Review cache line utilization
Pattern 2: SIMD Optimization Opportunities
Symptoms in report:
- •Scalar operations in loops:
addss,movssinstead ofaddps,movaps - •High overhead on element-by-element operations
- •No vector instructions in hot loops
Next steps:
- •Verify data alignment (required for aligned SIMD)
- •Use compiler intrinsics or auto-vectorization
- •Check loop trip counts (SIMD efficiency improves with more iterations)
Pattern 3: Branch Misprediction
Symptoms in report:
- •High overhead on conditional jumps:
je,jne,jg - •Code with many
ifstatements showing high cost - •Function overhead doesn't match instruction complexity
Next steps:
- •Profile with branch-misses event:
perf record -e branch-misses - •Reorganize code to reduce branches
- •Use branchless techniques (cmov, select)
- •Apply profile-guided optimization (PGO)
Tips for Effective Analysis
Tip 1: Analyze Both Index and Query
Different operations have different performance characteristics:
# Generate both reports python3 scripts/analyze_perf.py test/profile_output/index_latest_profile_perf -o index_analysis.txt python3 scripts/analyze_perf.py test/profile_output/query_latest_profile_perf -o query_analysis.txt
Then compare:
- •Are the bottlenecks in shared code or phase-specific code?
- •Does optimization for index affect query performance?
- •Which phase has more optimization headroom?
Tip 2: Correlate with Source Code
Always examine the actual source code when reviewing hotspots:
- •Run analysis to get function names and line numbers
- •Open source files identified in the report
- •Understand why those lines are expensive
- •Consider algorithmic changes, not just micro-optimizations
Tip 3: Iterate and Verify
Performance optimization is iterative:
- •Analyze → Identify bottleneck
- •Optimize → Make targeted change
- •Profile → Collect new perf data
- •Analyze → Verify improvement
- •Repeat
Verification checklist:
- •Did hotspot overhead decrease?
- •Did new hotspots emerge?
- •What's the overall performance impact?
Tip 4: Watch for Optimization Side Effects
When optimizing based on perf data:
- •Cache optimization may increase code size → worse instruction cache
- •Vectorization may require data reorganization → memory overhead
- •Prefetching can pollute cache → harm other operations
- •Loop unrolling increases code size → instruction cache pressure
Always measure the overall impact, not just local improvements.
References
For detailed information on perf commands, assembly instructions, and optimization techniques:
- •references/perf_guide.md: Complete guide to perf report, perf annotate, assembly instruction reference, memory access patterns, and optimization strategies
Consult this reference when:
- •You need to run manual perf commands for deeper investigation
- •Interpreting specific assembly instructions
- •Understanding x86-64 memory addressing modes
- •Learning about performance optimization patterns
Troubleshooting
Issue: "No symbols available"
Cause: Binary was stripped or compiled without debug info
Solution:
# Recompile with debug symbols gcc -g -O3 source.c -o program # Or with more detailed debug info gcc -g3 -O3 source.c -o program
Issue: "Permission denied" running perf
Cause: Insufficient permissions for perf_event_paranoid
Solution:
# Check current setting cat /proc/sys/kernel/perf_event_paranoid # Allow perf for normal users (requires root) sudo sysctl -w kernel.perf_event_paranoid=1
Issue: Script shows no hotspots
Possible causes:
- •Perf data file is empty or corrupt
- •All overhead below threshold (--percent-limit)
- •No debug symbols available
Debug steps:
# Check file size ls -lh test/profile_output/index_latest_profile_perf # Try manual perf report perf report -i test/profile_output/index_latest_profile_perf --stdio # Lower threshold in script (edit analyze_perf.py) '--percent-limit', '0.01' # Instead of '0.1'
Issue: Assembly instructions unclear
Solution: Refer to references/perf_guide.md for:
- •Common x86-64 instruction explanations
- •Memory addressing mode examples
- •SIMD instruction reference
- •Performance characteristics of different instruction types