MaxText Performance Analysis
Post-training (or mid-training) analysis pipeline. Follow the workflow below from top to bottom.
Multi-job comparisons: If comparing two or more jobs (e.g., "why is job B slower than job A?"), start with skills/tsdb-diagnosis/SKILL.md (Multi-Job Comparison workflow) before running TraceLens. The TSDB reveals system-level root causes — CPU contention from RCCL resource leaks, network errors, I/O pressure, thermal throttling — that TraceLens cannot observe (it only sees GPU-side kernel timings). Only proceed to TraceLens here if the TSDB comparison is inconclusive.
Workflow
Step 1: Run the dispatcher
python3 utils/analyze_job.py /outputs/<job>.log python3 utils/analyze_job.py /outputs/<job_dir>/ python3 utils/analyze_job.py /outputs/local_2026*
For running jobs, pass -f to force re-analysis (bypasses staleness check):
python3 utils/analyze_job.py -f /outputs/<job>.log
The dispatcher auto-detects available artifacts and runs only the relevant tools:
- •Log with TGS data →
tgs_tagger.py - •
*.xplane.pb→TraceLens_generate_perf_report_jax - •
xla_dump/*.gpu_after_optimizations.txt→IRLens_analyze_hlo_ir.py
Step 2: Handle TraceLens if needed
If the dispatcher output says "TraceLens not installed" and xplane traces exist:
- •
Check if TraceLens is already installed and patched before doing anything:
bashpython3 -c " import TraceLens.util, inspect src = inspect.getsource(TraceLens.util.DataLoader.load_data) assert 'xprof' in src, 'not patched' print('TraceLens: installed and patched') "- •Succeeds → TraceLens is ready. Just re-run:
python3 utils/analyze_job.py -f /outputs/<job>.log - •ImportError → not installed. Install then patch (see below).
- •AssertionError → installed but unpatched. Patch only (see below).
- •Succeeds → TraceLens is ready. Just re-run:
- •
Install (only if import failed):
bashpip install git+https://github.com/AMD-AGI/TraceLens.git
- •
Patch (only if the
xprofassertion failed). Apply all patches from tracelens-patches.md — 6 files, ~13 patches. Key fixes:- •protobuf/xprof import errors (TF 2.19+ renamed
tensorboard_plugin_profiletoxprof) - •GPU PID remapping (
xprofremaps device PIDs to 1001+; code filteringpid < 100misses all GPU events) - •
metadata_eventsnot passed tobuild_tree() - •
KeyErrorongpu_kernel_op_catand missing parent events for launch latency
- •protobuf/xprof import errors (TF 2.19+ renamed
- •
Re-run the dispatcher with
-f:bashpython3 utils/analyze_job.py -f /outputs/<job>.log
This is one-time per environment. Always check before patching to avoid redundant work.
Step 3: Read results
Read the generated analysis.json — but do NOT try to read the raw file (it can be 40K+ lines due to per-step arrays). Extract key metrics programmatically:
python3 -c "
import json, sys
with open('<job_dir>/analysis.json') as f:
d = json.load(f)
print(f'Job: {d[\"job_id\"]} | Model: {d[\"model\"]} | Nodes: {d[\"num_nodes\"]} | Status: {d[\"job_status\"][\"status\"]}')
tgs = d['tgs']
print(f'Steady TGS: {tgs[\"steady\"][\"mean\"]:.1f} (std={tgs[\"steady\"][\"std\"]:.1f}, steps {tgs[\"steady\"][\"range\"]})')
print(f'Tail TGS: {tgs[\"tail\"][\"mean\"]:.1f} (std={tgs[\"tail\"][\"std\"]:.1f}, steps {tgs[\"tail\"][\"range\"]})')
tl = d.get('tracelens_summary', {})
if tl:
print(f'Compute: {tl[\"computation_time\"]:.1f}% | Exposed comm: {tl[\"exposed_comm_time\"]:.1f}% | Idle: {tl[\"idle_time\"]:.2f}% | Total comm: {tl[\"total_comm_time\"]:.1f}%')
"
For deeper TraceLens analysis, read the CSVs in <job_dir>/tracelens/<timestamp>/csvs/:
- •
gpu_events_averages.csv— per-GPU compute/comm/idle breakdown (averages) - •
gpu_timeline.csv— per-GPU breakdown with pid - •
kernel_launchers_summary_by_category.csv— time by kernel category (GEMM, NCCL, XLA fusions, etc.) - •
kernel_launchers_summary.csv— time by individual kernel name
Step 4: Summarize findings
Present results using this structure:
| Metric | Source | What to look for |
|---|---|---|
| TGS (steady-state) | analysis.json → tgs.steady | Primary throughput metric |
| MFU | analysis.json → mfu_per_step | Model FLOPS utilization (if available) |
| GPU compute % | tracelens_summary.computation_time | Time on actual compute kernels |
| Exposed comm % | tracelens_summary.exposed_comm_time | Communication NOT overlapped with compute (lower is better) |
| Idle % | tracelens_summary.idle_time | GPU doing nothing (should be near 0) |
| Kernel breakdown | kernel_launchers_summary_by_category.csv | GEMM vs NCCL vs fusion time |
| Comm ops per step | dispatcher IRLens output | Count of all-reduce, all-gather, all-to-all, reduce-scatter |
Interpretation guidelines:
- •High exposed comm % → opportunities for better comm/compute overlap
- •Large per-GPU variance in compute % → load imbalance
- •High idle % → scheduling or synchronization issues
- •Tail TGS std much larger than steady std → periodic overhead (checkpointing, profiling)
Step 5: Ensure dashboard is running
Check the dispatcher output first — it prints a Dashboard: line at the end. If it shows a URL with (running), use that URL.
If the dashboard is not running, start it:
pip install fastapi uvicorn # one-time utils/perf_server.py --host 0.0.0.0 &
Always tell the user the dashboard URL: http://<host>:<PORT>
The server auto-detects a free port starting from 8080 and auto-reloads analysis.json on each request.
Reference
Job output layout
/outputs/<JOB_ID>-<JOB_NAME>[-TGS_<VALUE>]/
log -> ../<log_file> # symlink to log file
analysis.json # structured metrics
xla_dump/ # if _env_ENABLE_XLA_DUMP=1
module_NNNN.jit_train_step.*_gpu_after_optimizations.txt
<run_name>/tensorboard/plugins/profile/<ts>/
<hostname>.xplane.pb # if profiler=xplane
tracelens/<ts>/csvs/*.csv # created by TraceLens
The .log file sits alongside the directory in /outputs/.
When enable_checkpointing=true, profiler traces may end up in a shared directory outside the job dir. analyze_job.py parses Config param tensorboard_dir from the log to locate these. The dispatcher and perf_server.py filter profiles by job execution time window and node-0 hostname to disambiguate.
Running individual tools directly
These are rarely needed — analyze_job.py orchestrates them. Use only for targeted re-runs.
# TGS tagging
utils/tag_tgs.sh <log_file_or_glob>
utils/tag_tgs.sh -f <log_file> # force on running job
# IRLens
utils/IRLens_analyze_hlo_ir.py <hlo_file>
utils/IRLens_analyze_hlo_ir.py <hlo_file> --op communication
utils/IRLens_analyze_hlo_ir.py <hlo_file> --op computation
# TraceLens
TraceLens_generate_perf_report_jax \
--profile_path <xplane.pb> \
--output_csvs_dir <output_dir>/csvs
Running jobs
- •The dispatcher detects running jobs via the
JOB SUMMARYlog marker and file modification time (15 min threshold). - •
analyze_job.py -fbypasses the staleness check but never renames files for running jobs. Renames happen automatically on the next analysis after the job finishes. - •TraceLens needs a completed profiler trace; skipped if
*.xplane.pbdoesn't exist yet. - •IRLens works on running jobs if
xla_dump/is already populated.