MaxText Performance Analysis

Post-training (or mid-training) analysis pipeline. Follow the workflow below from top to bottom.

Multi-job comparisons: If comparing two or more jobs (e.g., "why is job B slower than job A?"), start with skills/tsdb-diagnosis/SKILL.md (Multi-Job Comparison workflow) before running TraceLens. The TSDB reveals system-level root causes — CPU contention from RCCL resource leaks, network errors, I/O pressure, thermal throttling — that TraceLens cannot observe (it only sees GPU-side kernel timings). Only proceed to TraceLens here if the TSDB comparison is inconclusive.

Workflow

Step 1: Run the dispatcher

bash

python3 utils/analyze_job.py /outputs/<job>.log
python3 utils/analyze_job.py /outputs/<job_dir>/
python3 utils/analyze_job.py /outputs/local_2026*

For running jobs, pass -f to force re-analysis (bypasses staleness check):

bash

python3 utils/analyze_job.py -f /outputs/<job>.log

The dispatcher auto-detects available artifacts and runs only the relevant tools:

•Log with TGS data → tgs_tagger.py
•*.xplane.pb → TraceLens_generate_perf_report_jax
•xla_dump/*.gpu_after_optimizations.txt → IRLens_analyze_hlo_ir.py

Step 2: Handle TraceLens if needed

If the dispatcher output says "TraceLens not installed" and xplane traces exist:

•
Check if TraceLens is already installed and patched before doing anything:
bash
```
python3 -c "
import TraceLens.util, inspect
src = inspect.getsource(TraceLens.util.DataLoader.load_data)
assert 'xprof' in src, 'not patched'
print('TraceLens: installed and patched')
"
```
- •Succeeds → TraceLens is ready. Just re-run: python3 utils/analyze_job.py -f /outputs/<job>.log
- •ImportError → not installed. Install then patch (see below).
- •AssertionError → installed but unpatched. Patch only (see below).

•

Install (only if import failed):

bash

pip install git+https://github.com/AMD-AGI/TraceLens.git

•
Patch (only if the xprof assertion failed). Apply all patches from tracelens-patches.md — 6 files, ~13 patches. Key fixes:
- •protobuf/xprof import errors (TF 2.19+ renamed tensorboard_plugin_profile to xprof)
- •GPU PID remapping (xprof remaps device PIDs to 1001+; code filtering pid < 100 misses all GPU events)
- •metadata_events not passed to build_tree()
- •KeyError on gpu_kernel_op_cat and missing parent events for launch latency

•

Re-run the dispatcher with -f:

bash

python3 utils/analyze_job.py -f /outputs/<job>.log

This is one-time per environment. Always check before patching to avoid redundant work.

Step 3: Read results

Read the generated analysis.json — but do NOT try to read the raw file (it can be 40K+ lines due to per-step arrays). Extract key metrics programmatically:

bash

python3 -c "
import json, sys
with open('<job_dir>/analysis.json') as f:
    d = json.load(f)
print(f'Job: {d[\"job_id\"]} | Model: {d[\"model\"]} | Nodes: {d[\"num_nodes\"]} | Status: {d[\"job_status\"][\"status\"]}')
tgs = d['tgs']
print(f'Steady TGS: {tgs[\"steady\"][\"mean\"]:.1f} (std={tgs[\"steady\"][\"std\"]:.1f}, steps {tgs[\"steady\"][\"range\"]})')
print(f'Tail   TGS: {tgs[\"tail\"][\"mean\"]:.1f} (std={tgs[\"tail\"][\"std\"]:.1f}, steps {tgs[\"tail\"][\"range\"]})')
tl = d.get('tracelens_summary', {})
if tl:
    print(f'Compute: {tl[\"computation_time\"]:.1f}% | Exposed comm: {tl[\"exposed_comm_time\"]:.1f}% | Idle: {tl[\"idle_time\"]:.2f}% | Total comm: {tl[\"total_comm_time\"]:.1f}%')
"

For deeper TraceLens analysis, read the CSVs in <job_dir>/tracelens/<timestamp>/csvs/:

•gpu_events_averages.csv — per-GPU compute/comm/idle breakdown (averages)
•gpu_timeline.csv — per-GPU breakdown with pid
•kernel_launchers_summary_by_category.csv — time by kernel category (GEMM, NCCL, XLA fusions, etc.)
•kernel_launchers_summary.csv — time by individual kernel name

Step 4: Summarize findings

Present results using this structure:

Metric	Source	What to look for
TGS (steady-state)	`analysis.json` → `tgs.steady`	Primary throughput metric
MFU	`analysis.json` → `mfu_per_step`	Model FLOPS utilization (if available)
GPU compute %	`tracelens_summary.computation_time`	Time on actual compute kernels
Exposed comm %	`tracelens_summary.exposed_comm_time`	Communication NOT overlapped with compute (lower is better)
Idle %	`tracelens_summary.idle_time`	GPU doing nothing (should be near 0)
Kernel breakdown	`kernel_launchers_summary_by_category.csv`	GEMM vs NCCL vs fusion time
Comm ops per step	dispatcher IRLens output	Count of all-reduce, all-gather, all-to-all, reduce-scatter

Interpretation guidelines:

•High exposed comm % → opportunities for better comm/compute overlap
•Large per-GPU variance in compute % → load imbalance
•High idle % → scheduling or synchronization issues
•Tail TGS std much larger than steady std → periodic overhead (checkpointing, profiling)

Step 5: Ensure dashboard is running

Check the dispatcher output first — it prints a Dashboard: line at the end. If it shows a URL with (running), use that URL.

If the dashboard is not running, start it:

bash

pip install fastapi uvicorn   # one-time
utils/perf_server.py --host 0.0.0.0 &

Always tell the user the dashboard URL: http://<host>:<PORT>

The server auto-detects a free port starting from 8080 and auto-reloads analysis.json on each request.

Reference

Job output layout

code

/outputs/<JOB_ID>-<JOB_NAME>[-TGS_<VALUE>]/
  log -> ../<log_file>                          # symlink to log file
  analysis.json                                 # structured metrics
  xla_dump/                                     # if _env_ENABLE_XLA_DUMP=1
    module_NNNN.jit_train_step.*_gpu_after_optimizations.txt
  <run_name>/tensorboard/plugins/profile/<ts>/
    <hostname>.xplane.pb                        # if profiler=xplane
  tracelens/<ts>/csvs/*.csv                     # created by TraceLens

The .log file sits alongside the directory in /outputs/.

When enable_checkpointing=true, profiler traces may end up in a shared directory outside the job dir. analyze_job.py parses Config param tensorboard_dir from the log to locate these. The dispatcher and perf_server.py filter profiles by job execution time window and node-0 hostname to disambiguate.

Running individual tools directly

These are rarely needed — analyze_job.py orchestrates them. Use only for targeted re-runs.

bash

# TGS tagging
utils/tag_tgs.sh <log_file_or_glob>
utils/tag_tgs.sh -f <log_file>       # force on running job

# IRLens
utils/IRLens_analyze_hlo_ir.py <hlo_file>
utils/IRLens_analyze_hlo_ir.py <hlo_file> --op communication
utils/IRLens_analyze_hlo_ir.py <hlo_file> --op computation

# TraceLens
TraceLens_generate_perf_report_jax \
    --profile_path <xplane.pb> \
    --output_csvs_dir <output_dir>/csvs

Running jobs

•The dispatcher detects running jobs via the JOB SUMMARY log marker and file modification time (15 min threshold).
•analyze_job.py -f bypasses the staleness check but never renames files for running jobs. Renames happen automatically on the next analysis after the job finishes.
•TraceLens needs a completed profiler trace; skipped if *.xplane.pb doesn't exist yet.
•IRLens works on running jobs if xla_dump/ is already populated.