CaoWangrenbo ec143868d0 feat: add checkpoint resume and fix train_loss tracking
- Add --resume CLI arg to resume training from a checkpoint
- Restore model, optimizer, scheduler state; continue from saved epoch+1
- Preserve global_step and best_val_loss across resume
- Save run_id in checkpoints for TensorBoard log continuity
- Use logs/run_<timestamp>/ subdirectories to isolate experiment logs
- Fix: replace train_loss in checkpoint dict with global_step to avoid
  KeyError when loading; track global_step through train_one_epoch
- Fix: use global_step (not batch_idx) as TensorBoard x-axis for batch loss
- Fix: print average loss at end of each epoch

Generated by Mistral Vibe (ds-v4-flash).
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-06-04 22:55:31 +08:00
2026-05-29 18:49:01 +08:00
2026-05-29 18:49:01 +08:00
2026-05-29 18:49:01 +08:00
2026-05-29 18:49:01 +08:00
2026-05-29 18:49:01 +08:00
2026-05-29 18:49:01 +08:00
2026-05-29 18:49:01 +08:00
Description
No description provided
134 KiB
Languages
Python 92.1%
Shell 7.9%