AgentSkillsCN

performance-analysis

利用tgs_tagger、TraceLens与IRLens分析MaxText训练任务的性能。当用户希望分析一次训练运行、剖析追踪信息、查看HLO IR、TGS指标、GPU利用率,或提及tag_tgs、TraceLens、IRLens、xplane,或进行性能分析时,可使用此功能。

SKILL.md
--- frontmatter
name: performance-analysis
description: Analyze MaxText training job performance using tgs_tagger, TraceLens, and IRLens. Use when the user asks to analyze a training run, profile traces, HLO IR, TGS metrics, GPU utilization, or mentions tag_tgs, TraceLens, IRLens, xplane, or performance analysis.

MaxText Performance Analysis

Post-training (or mid-training) analysis pipeline. Follow the workflow below from top to bottom.

Multi-job comparisons: If comparing two or more jobs (e.g., "why is job B slower than job A?"), start with skills/tsdb-diagnosis/SKILL.md (Multi-Job Comparison workflow) before running TraceLens. The TSDB reveals system-level root causes — CPU contention from RCCL resource leaks, network errors, I/O pressure, thermal throttling — that TraceLens cannot observe (it only sees GPU-side kernel timings). Only proceed to TraceLens here if the TSDB comparison is inconclusive.

Workflow

Step 1: Run the dispatcher

bash
python3 utils/analyze_job.py /outputs/<job>.log
python3 utils/analyze_job.py /outputs/<job_dir>/
python3 utils/analyze_job.py /outputs/local_2026*

For running jobs, pass -f to force re-analysis (bypasses staleness check):

bash
python3 utils/analyze_job.py -f /outputs/<job>.log

The dispatcher auto-detects available artifacts and runs only the relevant tools:

  • Log with TGS datatgs_tagger.py
  • *.xplane.pbTraceLens_generate_perf_report_jax
  • xla_dump/*.gpu_after_optimizations.txtIRLens_analyze_hlo_ir.py

Step 2: Handle TraceLens if needed

If the dispatcher output says "TraceLens not installed" and xplane traces exist:

  1. Check if TraceLens is already installed and patched before doing anything:

    bash
    python3 -c "
    import TraceLens.util, inspect
    src = inspect.getsource(TraceLens.util.DataLoader.load_data)
    assert 'xprof' in src, 'not patched'
    print('TraceLens: installed and patched')
    "
    
    • Succeeds → TraceLens is ready. Just re-run: python3 utils/analyze_job.py -f /outputs/<job>.log
    • ImportError → not installed. Install then patch (see below).
    • AssertionError → installed but unpatched. Patch only (see below).
  2. Install (only if import failed):

    bash
    pip install git+https://github.com/AMD-AGI/TraceLens.git
    
  3. Patch (only if the xprof assertion failed). Apply all patches from tracelens-patches.md — 6 files, ~13 patches. Key fixes:

    • protobuf/xprof import errors (TF 2.19+ renamed tensorboard_plugin_profile to xprof)
    • GPU PID remapping (xprof remaps device PIDs to 1001+; code filtering pid < 100 misses all GPU events)
    • metadata_events not passed to build_tree()
    • KeyError on gpu_kernel_op_cat and missing parent events for launch latency
  4. Re-run the dispatcher with -f:

    bash
    python3 utils/analyze_job.py -f /outputs/<job>.log
    

This is one-time per environment. Always check before patching to avoid redundant work.

Step 3: Read results

Read the generated analysis.json — but do NOT try to read the raw file (it can be 40K+ lines due to per-step arrays). Extract key metrics programmatically:

bash
python3 -c "
import json, sys
with open('<job_dir>/analysis.json') as f:
    d = json.load(f)
print(f'Job: {d[\"job_id\"]} | Model: {d[\"model\"]} | Nodes: {d[\"num_nodes\"]} | Status: {d[\"job_status\"][\"status\"]}')
tgs = d['tgs']
print(f'Steady TGS: {tgs[\"steady\"][\"mean\"]:.1f} (std={tgs[\"steady\"][\"std\"]:.1f}, steps {tgs[\"steady\"][\"range\"]})')
print(f'Tail   TGS: {tgs[\"tail\"][\"mean\"]:.1f} (std={tgs[\"tail\"][\"std\"]:.1f}, steps {tgs[\"tail\"][\"range\"]})')
tl = d.get('tracelens_summary', {})
if tl:
    print(f'Compute: {tl[\"computation_time\"]:.1f}% | Exposed comm: {tl[\"exposed_comm_time\"]:.1f}% | Idle: {tl[\"idle_time\"]:.2f}% | Total comm: {tl[\"total_comm_time\"]:.1f}%')
"

For deeper TraceLens analysis, read the CSVs in <job_dir>/tracelens/<timestamp>/csvs/:

  • gpu_events_averages.csv — per-GPU compute/comm/idle breakdown (averages)
  • gpu_timeline.csv — per-GPU breakdown with pid
  • kernel_launchers_summary_by_category.csv — time by kernel category (GEMM, NCCL, XLA fusions, etc.)
  • kernel_launchers_summary.csv — time by individual kernel name

Step 4: Summarize findings

Present results using this structure:

MetricSourceWhat to look for
TGS (steady-state)analysis.jsontgs.steadyPrimary throughput metric
MFUanalysis.jsonmfu_per_stepModel FLOPS utilization (if available)
GPU compute %tracelens_summary.computation_timeTime on actual compute kernels
Exposed comm %tracelens_summary.exposed_comm_timeCommunication NOT overlapped with compute (lower is better)
Idle %tracelens_summary.idle_timeGPU doing nothing (should be near 0)
Kernel breakdownkernel_launchers_summary_by_category.csvGEMM vs NCCL vs fusion time
Comm ops per stepdispatcher IRLens outputCount of all-reduce, all-gather, all-to-all, reduce-scatter

Interpretation guidelines:

  • High exposed comm % → opportunities for better comm/compute overlap
  • Large per-GPU variance in compute % → load imbalance
  • High idle % → scheduling or synchronization issues
  • Tail TGS std much larger than steady std → periodic overhead (checkpointing, profiling)

Step 5: Ensure dashboard is running

Check the dispatcher output first — it prints a Dashboard: line at the end. If it shows a URL with (running), use that URL.

If the dashboard is not running, start it:

bash
pip install fastapi uvicorn   # one-time
utils/perf_server.py --host 0.0.0.0 &

Always tell the user the dashboard URL: http://<host>:<PORT>

The server auto-detects a free port starting from 8080 and auto-reloads analysis.json on each request.

Reference

Job output layout

code
/outputs/<JOB_ID>-<JOB_NAME>[-TGS_<VALUE>]/
  log -> ../<log_file>                          # symlink to log file
  analysis.json                                 # structured metrics
  xla_dump/                                     # if _env_ENABLE_XLA_DUMP=1
    module_NNNN.jit_train_step.*_gpu_after_optimizations.txt
  <run_name>/tensorboard/plugins/profile/<ts>/
    <hostname>.xplane.pb                        # if profiler=xplane
  tracelens/<ts>/csvs/*.csv                     # created by TraceLens

The .log file sits alongside the directory in /outputs/.

When enable_checkpointing=true, profiler traces may end up in a shared directory outside the job dir. analyze_job.py parses Config param tensorboard_dir from the log to locate these. The dispatcher and perf_server.py filter profiles by job execution time window and node-0 hostname to disambiguate.

Running individual tools directly

These are rarely needed — analyze_job.py orchestrates them. Use only for targeted re-runs.

bash
# TGS tagging
utils/tag_tgs.sh <log_file_or_glob>
utils/tag_tgs.sh -f <log_file>       # force on running job

# IRLens
utils/IRLens_analyze_hlo_ir.py <hlo_file>
utils/IRLens_analyze_hlo_ir.py <hlo_file> --op communication
utils/IRLens_analyze_hlo_ir.py <hlo_file> --op computation

# TraceLens
TraceLens_generate_perf_report_jax \
    --profile_path <xplane.pb> \
    --output_csvs_dir <output_dir>/csvs

Running jobs

  • The dispatcher detects running jobs via the JOB SUMMARY log marker and file modification time (15 min threshold).
  • analyze_job.py -f bypasses the staleness check but never renames files for running jobs. Renames happen automatically on the next analysis after the job finishes.
  • TraceLens needs a completed profiler trace; skipped if *.xplane.pb doesn't exist yet.
  • IRLens works on running jobs if xla_dump/ is already populated.