uzh-fpv-sv-test

hexone2086/uzh-fpv-sv-test

Fork 0

Commit Graph

Author	SHA1	Message	Date
CaoWangrenbo	ec143868d0	feat: add checkpoint resume and fix train_loss tracking - Add --resume CLI arg to resume training from a checkpoint - Restore model, optimizer, scheduler state; continue from saved epoch+1 - Preserve global_step and best_val_loss across resume - Save run_id in checkpoints for TensorBoard log continuity - Use logs/run_<timestamp>/ subdirectories to isolate experiment logs - Fix: replace train_loss in checkpoint dict with global_step to avoid KeyError when loading; track global_step through train_one_epoch - Fix: use global_step (not batch_idx) as TensorBoard x-axis for batch loss - Fix: print average loss at end of each epoch Generated by Mistral Vibe (ds-v4-flash). Co-Authored-By: Mistral Vibe <vibe@mistral.ai>	2026-06-04 22:55:31 +08:00
CaoWangrenbo	9f0321eff8	initial commit	2026-05-29 18:49:01 +08:00

Author

SHA1

Message

Date

CaoWangrenbo

ec143868d0

feat: add checkpoint resume and fix train_loss tracking

- Add --resume CLI arg to resume training from a checkpoint
- Restore model, optimizer, scheduler state; continue from saved epoch+1
- Preserve global_step and best_val_loss across resume
- Save run_id in checkpoints for TensorBoard log continuity
- Use logs/run_<timestamp>/ subdirectories to isolate experiment logs
- Fix: replace train_loss in checkpoint dict with global_step to avoid
  KeyError when loading; track global_step through train_one_epoch
- Fix: use global_step (not batch_idx) as TensorBoard x-axis for batch loss
- Fix: print average loss at end of each epoch

Generated by Mistral Vibe (ds-v4-flash).
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>

2026-06-04 22:55:31 +08:00

CaoWangrenbo

9f0321eff8

initial commit

2026-05-29 18:49:01 +08:00

2 Commits