Skip to content

Conversation

@daviswer
Copy link
Collaborator

@daviswer daviswer commented May 23, 2025

A collection of dataloader updates and fixes mirrored from the torchtitan repo. Changes include:

FIXES FOR HANGS AND FREEZES:

  • Truncate long text docs to 1M characters
  • Allow LCG to advance after reaching an empty doc

FIXES FOR CRASHES:

  • Check dataset paths for typos rather than returning empty subdatasets
  • Handle edge case where rescaling upward can cause an epoch to finish and resulting data shard to appear empty
  • Every worker now owns at least one doc from each owned file, unless file is empty (preventing empty shards on small datasets)
  • Skip empty data files

FIXES CONVERGENCE BEHAVIOR:

  • File sharding now accounts for file size (BREAKS BACKWARD COMPATIBILITY)

QUALITY OF LIFE:

  • Add support for FIM training (default off, enable by setting cfg.spm_rate or cfg.psm_rate to nonzero values). Precludes Add support for FIM training #125
  • Add support for multiple column names when reading from data files
  • Add manual alert message for completing an epoch without returning any data
  • User can specify how many document chunks to yield before manually inserting EOS and breaking the doc. Does not fully truncate, as the doc is returned to on the next iter() call.
  • More informative error reporting when a set of empty shards is detected

@daviswer daviswer requested a review from lchu6 May 23, 2025 19:14
@daviswer daviswer marked this pull request as ready for review May 23, 2025 19:15
daviswer added 9 commits May 23, 2025 15:16
Signed-off-by: Davis Wertheimer <[email protected]>
Signed-off-by: Davis Wertheimer <[email protected]>
Signed-off-by: Davis Wertheimer <[email protected]>
Signed-off-by: Davis Wertheimer <[email protected]>
Signed-off-by: Davis Wertheimer <[email protected]>
Signed-off-by: Davis Wertheimer <[email protected]>
Signed-off-by: Davis Wertheimer <[email protected]>
Signed-off-by: Davis Wertheimer <[email protected]>
Signed-off-by: Davis Wertheimer <[email protected]>
daviswer added 2 commits May 23, 2025 15:19
Signed-off-by: Davis Wertheimer <[email protected]>
Signed-off-by: Davis Wertheimer <[email protected]>
Copy link
Contributor

@lchu6 lchu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG.

only comment is this PR is a little fat given both dataloader fix and FIM. separating the two might be better. Not a big deal though, approving the changes.

daviswer added 2 commits May 27, 2025 11:10
Signed-off-by: Davis Wertheimer <[email protected]>
daviswer added a commit that referenced this pull request May 27, 2025
Signed-off-by: Davis Wertheimer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants