Remote Server Management with tm
Use tm CLI (/Users/chenyl/project/tm/tm) for all remote server operations. Each command is a stateless SSH call — no persistent connections, no tmux. Jobs survive SSH disconnection and local shutdown.
Rules
- •Always sync before remote execution — code is developed locally, never edited on the server.
- •Always use
uvfor Python —uv sync,uv run python. Neverpip installorconda. - •Check GPUs before experiments —
tm gpu <server> --free. Reserve at least 2 GPUs for others when >2 available. - •Use
execfor short tasks,run+waitfor long tasks — anything over a few minutes should be a background job. - •Subagents execute, main agent fixes — subagents (haiku) never modify code; they only run and report errors back.
Task Decision
code
Is the task quick (<2 min)?
├── Yes → tm exec server "command"
└── No → tm run server job-name "command"
tm wait server job-name (with run_in_background=True)
Workflows
Debug cycle
bash
tm sync ./project server23:~/project tm exec server23 "cd ~/project && uv sync" tm exec server23 "cd ~/project && uv run python test.py" # error → fix locally → sync → re-run
Single training run
bash
tm sync ./project server23:~/project GPUS=$(tm gpu server23 --free) # e.g. "0,1" tm run server23 train-v1 "cd ~/project && uv run python train.py" -g $GPUS # then use Bash(run_in_background=True): tm wait server23 train-v1 # Claude Code continues other work; TaskOutput retrieves result when done
Parallel experiments (subagents)
bash
tm sync ./project server23:~/project tm run server23 exp-lr1 "cd ~/project && uv run python train.py --lr 1e-3" -g 0,1 tm run server23 exp-lr2 "cd ~/project && uv run python train.py --lr 1e-4" -g 2,3
Spawn subagents (haiku, run_in_background=True) each running tm wait server23 exp-*.
If a subagent reports an error: fix locally → tm sync → new tm run.
Error handling flow
- •Subagent reports error with log output
- •Main agent analyzes traceback, fixes code locally
- •
tm sync ./project server23:~/project - •
tm run server23 train-v2 "..."(new job name) - •Spawn new subagent to wait
Command Quick Reference
| Command | Purpose | Blocks? |
|---|---|---|
tm exec <server> <cmd> | Run short command | Yes |
tm run <server> <name> <cmd> | Start background job | No |
tm wait <server> <name> | Wait for job completion | Yes (use run_in_background) |
tm status <server> <name> | Check job state | No |
tm logs <server> <name> | View job output | No |
tm stop <server> <name> | Kill a job | No |
tm jobs <server> | List jobs | No |
tm sync <src> <server>:<dst> | Rsync (respects .gitignore) | Yes |
tm gpu <server> | GPU status | No |
For full command options and output formats, read references/commands.md.
tm wait Failure Reasons
| reason | exit_code | meaning |
|---|---|---|
process_exited | non-zero | Command failed — check logs for traceback |
oom_killed | 137 | Out of memory — reduce batch size or use more GPUs |
user_stopped | 9/143 | Stopped by tm stop |
process_disappeared | -1 | Killed externally or server rebooted |
server_unreachable | - | Cannot SSH after max retries |
Notes
- •Servers configured via
~/.ssh/config— use SSH Host names directly - •Job metadata stored on server at
~/.remote-jobs/<job-name>/ - •
tm syncauto-applies.gitignorerules via--filter=':- .gitignore' - •Job names: letters, numbers, underscore, dash, dot only