Skip to content

Better undestanding how data is loded in datamodule setup method for multi GPU setting in NLP #7186

Discussion options

You must be logged in to vote

I have prepared the DataModule which process json line file with pairs of sentence for translation task. The file contains 10M lines.
In prepare_data I open the data file, read it to memory, do some basic filtering (remove to long sentences and do some sorting based on length in order to group similar length sentences together) then I write it to another file (filtered_data.json).

Do all of that either offline in a different script, or do it in the prepare_data hook.

Next in the setup method I read filtered_data.json and split it to train and valid.

Sounds good. Each GPU/node will run the same, so you will have the same train and val split in all of them (initially). Don't split the d…

Replies: 1 comment 7 replies

Comment options

You must be logged in to vote
7 replies
@Sere-Fu
Comment options

@awaelchli
Comment options

@Sere-Fu
Comment options

@awaelchli
Comment options

@Sere-Fu
Comment options

Answer selected by ksopyla
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants