some questions about the MultiModalIterableDataset

Great open-source work! I have some questions about MultiModalIterableDataset. According to the code [http::www.com](https://github.com/EvolvingLMMs-Lab/lmms-engine/blob/main/src/lmms_engine/datasets/iterable/multimodal_iterable_dataset.py#L139), MultiModalIterableDataset is a sharded version of HFDataset. HFDataset actually reads the entire list of files at once. Wouldn't this lead to insufficient memory when reading a very large training set?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some questions about the MultiModalIterableDataset #102

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

some questions about the MultiModalIterableDataset #102

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions