DevOps & CloudDocumentedScanned

wandb-monitor

Monitor and analyze Weights & Biases training runs.

Share:

Installation

npx clawhub@latest install wandb-monitor

View the full skill documentation and source below.

Documentation

Weights & Biases

Monitor, analyze, and compare W&B training runs.

Setup

wandb login
# Or set WANDB_API_KEY in environment

Scripts

Characterize a Run (Full Health Analysis)

~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/characterize_run.py ENTITY/PROJECT/RUN_ID

Analyzes:

  • Loss curve trend (start → current, % change, direction)

  • Gradient norm health (exploding/vanishing detection)

  • Eval metrics (if present)

  • Stall detection (heartbeat age)

  • Progress & ETA estimate

  • Config highlights

  • Overall health verdict


Options: --json for machine-readable output.

Watch All Running Jobs

~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/watch_runs.py ENTITY [--projects p1,p2]

Quick health summary of all running jobs plus recent failures/completions. Ideal for morning briefings.

Options:

  • --projects p1,p2 — Specific projects to check

  • --all-projects — Check all projects

  • --hours N — Hours to look back for finished runs (default: 24)

  • --json — Machine-readable output


Compare Two Runs

~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/compare_runs.py ENTITY/PROJECT/RUN_A ENTITY/PROJECT/RUN_B

Side-by-side comparison:

  • Config differences (highlights important params)

  • Loss curves at same steps

  • Gradient norm comparison

  • Eval metrics

  • Performance (tokens/sec, steps/hour)

  • Winner verdict


Python API Quick Reference

import wandb
api = wandb.Api()

# Get runs
runs = api.runs("entity/project", {"state": "running"})

# Run properties
run.state      # running | finished | failed | crashed | canceled
run.name       # display name
run.id         # unique identifier
run.summary    # final/current metrics
run.config     # hyperparameters
run.heartbeat_at # stall detection

# Get history
history = list(run.scan_history(keys=["train/loss", "train/grad_norm"]))

Metric Key Variations

Scripts handle these automatically:

  • Loss: train/loss, loss, train_loss, training_loss

  • Gradients: train/grad_norm, grad_norm, gradient_norm

  • Steps: train/global_step, global_step, step, _step

  • Eval: eval/loss, eval_loss, eval/accuracy, eval_acc


Health Thresholds

  • Gradients > 10: Exploding (critical)
  • Gradients > 5: Spiky (warning)
  • Gradients < 0.0001: Vanishing (warning)
  • Heartbeat > 30min: Stalled (critical)
  • Heartbeat > 10min: Slow (warning)

Integration Notes

For morning briefings, use watch_runs.py --json and parse the output.

For detailed analysis of a specific run, use characterize_run.py.

For A/B testing or hyperparameter comparisons, use compare_runs.py.