wandb-monitor
Monitor and analyze Weights & Biases training runs.
Installation
npx clawhub@latest install wandb-monitorView the full skill documentation and source below.
Documentation
Weights & Biases
Monitor, analyze, and compare W&B training runs.
Setup
wandb login
# Or set WANDB_API_KEY in environment
Scripts
Characterize a Run (Full Health Analysis)
~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/characterize_run.py ENTITY/PROJECT/RUN_ID
Analyzes:
- Loss curve trend (start → current, % change, direction)
- Gradient norm health (exploding/vanishing detection)
- Eval metrics (if present)
- Stall detection (heartbeat age)
- Progress & ETA estimate
- Config highlights
- Overall health verdict
Options:
--json for machine-readable output.
Watch All Running Jobs
~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/watch_runs.py ENTITY [--projects p1,p2]
Quick health summary of all running jobs plus recent failures/completions. Ideal for morning briefings.
Options:
--projects p1,p2— Specific projects to check--all-projects— Check all projects--hours N— Hours to look back for finished runs (default: 24)--json— Machine-readable output
Compare Two Runs
~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/compare_runs.py ENTITY/PROJECT/RUN_A ENTITY/PROJECT/RUN_B
Side-by-side comparison:
- Config differences (highlights important params)
- Loss curves at same steps
- Gradient norm comparison
- Eval metrics
- Performance (tokens/sec, steps/hour)
- Winner verdict
Python API Quick Reference
import wandb
api = wandb.Api()
# Get runs
runs = api.runs("entity/project", {"state": "running"})
# Run properties
run.state # running | finished | failed | crashed | canceled
run.name # display name
run.id # unique identifier
run.summary # final/current metrics
run.config # hyperparameters
run.heartbeat_at # stall detection
# Get history
history = list(run.scan_history(keys=["train/loss", "train/grad_norm"]))
Metric Key Variations
Scripts handle these automatically:
- Loss:
train/loss,loss,train_loss,training_loss - Gradients:
train/grad_norm,grad_norm,gradient_norm - Steps:
train/global_step,global_step,step,_step - Eval:
eval/loss,eval_loss,eval/accuracy,eval_acc
Health Thresholds
- Gradients > 10: Exploding (critical)
- Gradients > 5: Spiky (warning)
- Gradients < 0.0001: Vanishing (warning)
- Heartbeat > 30min: Stalled (critical)
- Heartbeat > 10min: Slow (warning)
Integration Notes
For morning briefings, use watch_runs.py --json and parse the output.
For detailed analysis of a specific run, use characterize_run.py.
For A/B testing or hyperparameter comparisons, use compare_runs.py.