DevOps & CloudDocumentedScanned

wandb-monitor

Monitor and analyze Weights & Biases training runs.

Installation

npx clawhub@latest install wandb-monitor

View the full skill documentation and source below.

Documentation

Weights & Biases

Monitor, analyze, and compare W&B training runs.

Setup

wandb login
# Or set WANDB_API_KEY in environment

Scripts

Characterize a Run (Full Health Analysis)

~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/characterize_run.py ENTITY/PROJECT/RUN_ID

Analyzes:

Loss curve trend (start → current, % change, direction)

Gradient norm health (exploding/vanishing detection)

Eval metrics (if present)

Stall detection (heartbeat age)

Progress & ETA estimate

Config highlights

Overall health verdict

Options: --json for machine-readable output.

Watch All Running Jobs

~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/watch_runs.py ENTITY [--projects p1,p2]

Quick health summary of all running jobs plus recent failures/completions. Ideal for morning briefings.

Options:

--projects p1,p2 — Specific projects to check

--all-projects — Check all projects

--hours N — Hours to look back for finished runs (default: 24)

--json — Machine-readable output

Compare Two Runs

~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/compare_runs.py ENTITY/PROJECT/RUN_A ENTITY/PROJECT/RUN_B

Side-by-side comparison:

Config differences (highlights important params)

Loss curves at same steps

Gradient norm comparison

Eval metrics

Performance (tokens/sec, steps/hour)

Winner verdict

Python API Quick Reference

import wandb
api = wandb.Api()

# Get runs
runs = api.runs("entity/project", {"state": "running"})

# Run properties
run.state      # running | finished | failed | crashed | canceled
run.name       # display name
run.id         # unique identifier
run.summary    # final/current metrics
run.config     # hyperparameters
run.heartbeat_at # stall detection

# Get history
history = list(run.scan_history(keys=["train/loss", "train/grad_norm"]))

Metric Key Variations

Scripts handle these automatically:

Loss: train/loss, loss, train_loss, training_loss

Gradients: train/grad_norm, grad_norm, gradient_norm

Steps: train/global_step, global_step, step, _step

Eval: eval/loss, eval_loss, eval/accuracy, eval_acc

Health Thresholds

Gradients > 10: Exploding (critical)
Gradients > 5: Spiky (warning)
Gradients < 0.0001: Vanishing (warning)
Heartbeat > 30min: Stalled (critical)
Heartbeat > 10min: Slow (warning)

Integration Notes

For morning briefings, use watch_runs.py --json and parse the output.

For detailed analysis of a specific run, use characterize_run.py.

For A/B testing or hyperparameter comparisons, use compare_runs.py.

Back to Skills Directory

wandb-monitor

Installation

Documentation

Weights & Biases

Setup

Scripts

Characterize a Run (Full Health Analysis)

Watch All Running Jobs

Compare Two Runs

Python API Quick Reference

Metric Key Variations

Health Thresholds

Integration Notes

Related Skills in DevOps & Cloud

adguard

agent-directory

agent-framework-azure-ai-py

agent-news

agentguard