CPU Profiling for Valkyria

Profile CPU usage to find performance bottlenecks, hot functions, and optimization opportunities.

Linux: perf

perf is the standard Linux profiler with low overhead.

Basic Profiling

bash

# Record CPU samples (99 Hz to avoid lockstep sampling)
perf record -F 99 -g build/valk src/prelude.valk test/test_prelude.valk

# View report
perf report

# Text report
perf report --stdio

# View annotated source (requires debug symbols)
perf annotate valk_lval_eval

Profiling Options

bash

# System-wide profiling
perf record -F 99 -a -g -- sleep 30

# Profile specific process
perf record -F 99 -p $(pgrep valk) -g -- sleep 30

# Profile with call graph (frame pointer)
perf record -F 99 -g --call-graph fp build/valk src/prelude.valk

# Profile with DWARF unwinding (works without frame pointers)
perf record -F 99 -g --call-graph dwarf build/valk src/prelude.valk

# Profile with LBR (Last Branch Record, Intel)
perf record -F 99 -g --call-graph lbr build/valk src/prelude.valk

perf stat (Counters)

bash

# Basic CPU counters
perf stat build/valk src/prelude.valk test/test_prelude.valk

# Detailed counters
perf stat -d build/valk src/prelude.valk

# Specific events
perf stat -e cycles,instructions,cache-misses,branch-misses build/valk src/prelude.valk

# IPC (Instructions Per Cycle)
perf stat -e cycles,instructions build/valk src/prelude.valk
# IPC = instructions / cycles (higher is better, max ~4-6 on modern CPUs)

perf top (Live View)

bash

# Live profiling (system-wide)
perf top

# For specific process
perf top -p $(pgrep valk)

# With call graph
perf top -g

Advanced Events

bash

# Cache analysis
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses \
  build/valk src/prelude.valk

# Branch prediction
perf stat -e branches,branch-misses build/valk src/prelude.valk

# Memory bandwidth (needs PMU support)
perf stat -e mem_load_retired.l3_miss build/valk src/prelude.valk

# List available events
perf list

Specific Analysis

bash

# Find cache-miss hotspots
perf record -e cache-misses -g build/valk src/prelude.valk
perf report

# Find branch misprediction hotspots
perf record -e branch-misses -g build/valk src/prelude.valk
perf report

# Memory access patterns
perf mem record build/valk src/prelude.valk
perf mem report

macOS: Instruments

Instruments is Apple's profiling tool with a rich GUI.

Time Profiler

bash

# GUI
open -a Instruments

# Or launch from Xcode: Product -> Profile (Cmd+I)

In Instruments:

•Choose "Time Profiler" template
•Click the target dropdown, choose "Other..."
•Navigate to build/valk
•Add arguments: src/prelude.valk test/test_prelude.valk
•Click Record

Command-Line Profiling

bash

# sample command (built-in)
sample valk -wait -file profile.txt

# Attach to running process
sample $(pgrep valk) 10 -file profile.txt  # 10 seconds

# View profile
cat profile.txt

xctrace (Modern CLI)

bash

# Record with Time Profiler
xcrun xctrace record --template "Time Profiler" \
  --output profile.trace \
  --launch -- build/valk src/prelude.valk

# View in Instruments
open profile.trace

# Export symbols
xcrun xctrace export --input profile.trace --output profile.xml

Activity Monitor

For quick checks, Activity Monitor (Applications/Utilities) shows:

•CPU usage per process
•Threads
•Energy impact
•Sample process (double-click process -> Sample)

Interpreting Results

perf report Output

code

Overhead  Command  Shared Object     Symbol
  25.00%  valk     valk              [.] valk_lval_eval
  15.00%  valk     valk              [.] valk_gc_mark
  10.00%  valk     libc.so.6         [.] malloc
   5.00%  valk     [kernel.vmlinux]  [k] copy_page

•Overhead: Percentage of samples in this function
•Command: Process name
•Shared Object: Binary/library
•Symbol: Function name ([.] = user, [k] = kernel)

Hot Spots

•Look for functions with high overhead (>5%)
•Check if kernel time is significant (may indicate I/O or syscall overhead)
•Look at call graph to understand context

Instruments Analysis

•Call Tree: Shows time spent in each function including callees
•Heaviest Stack Trace: Path taking most time
•Self Weight: Time in function itself (not callees)
•Total Weight: Time including callees

Profiling Scenarios

Find CPU Hotspots

bash

# Linux
perf record -F 99 -g build/valk src/prelude.valk test/test_prelude.valk
perf report --sort=overhead --stdio | head -20

# macOS
sample valk 10 -file profile.txt
grep -A2 "Sort by" profile.txt | head -20

Check IPC (Instruction Efficiency)

bash

# Linux
perf stat -e cycles,instructions,cache-references,cache-misses build/valk src/prelude.valk

# Interpret:
# IPC < 1.0 = likely memory-bound
# IPC > 2.0 = compute-efficient
# High cache-miss ratio = memory access pattern issues

Profile Memory Allocation

bash

# Linux - trace malloc/free
perf probe -x /lib/x86_64-linux-gnu/libc.so.6 --add malloc
perf probe -x /lib/x86_64-linux-gnu/libc.so.6 --add free
perf record -e probe_libc:malloc,probe_libc:free -g build/valk src/prelude.valk
perf report

# Or use BPF (if available)
# sudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc { @[ustack] = count(); }'

Compare Before/After

bash

# Baseline
perf stat -r 5 build/valk src/prelude.valk test/test_prelude.valk 2>&1 | tee baseline.txt

# After optimization
perf stat -r 5 build/valk src/prelude.valk test/test_prelude.valk 2>&1 | tee optimized.txt

# Compare
diff baseline.txt optimized.txt

Project-Specific Profiling

Profile GC

bash

# Focus on GC functions
perf record -g build/valk src/prelude.valk test/test_prelude.valk
perf report --symbol-filter='gc'

Profile Parser

bash

# Use perf with source annotation
perf record -g build/valk src/prelude.valk
perf annotate valk_lval_read
perf annotate valk_lval_eval

Profile Event Loop

bash

# Focus on libuv/event loop
perf record -g build/valk src/prelude.valk
perf report --symbol-filter='uv_'

Building for Profiling

bash

# Ensure frame pointers are preserved (for call graphs)
# Add to CMakeLists.txt or CFLAGS:
-fno-omit-frame-pointer

# Or use DWARF unwinding with perf:
perf record --call-graph dwarf build/valk src/prelude.valk

Tips

•Use 99Hz or 997Hz sampling to avoid lockstep artifacts
•Profile realistic workloads, not micro-benchmarks
•Check kernel vs user time - high kernel may mean I/O bound
•Use -g for call graphs - essential for understanding context
•Profile multiple runs for statistical significance (perf stat -r 5)
•Look at IPC first - low IPC suggests memory/cache issues
•Use perf annotate to see hot instructions
•Generate flame graphs for visualization (see flamegraph skill)
•On macOS, use Instruments for detailed analysis
•Compare profiles before/after changes quantitatively