- Add --resume CLI arg to resume training from a checkpoint
- Restore model, optimizer, scheduler state; continue from saved epoch+1
- Preserve global_step and best_val_loss across resume
- Save run_id in checkpoints for TensorBoard log continuity
- Use logs/run_<timestamp>/ subdirectories to isolate experiment logs
- Fix: replace train_loss in checkpoint dict with global_step to avoid
KeyError when loading; track global_step through train_one_epoch
- Fix: use global_step (not batch_idx) as TensorBoard x-axis for batch loss
- Fix: print average loss at end of each epoch
Generated by Mistral Vibe (ds-v4-flash).
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>