How does validation loop work ? #4133
Unanswered
Honzys
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment 1 reply
-
TL;DR - assuming you only look at the progress bar, it is the correct behavior, they don't run in parallel. In more details - AFAIK the progress bar shows the status of a full training + validation cycle, when both have the same size it does look like the validation starts in the middle of an epoch and that the validation set is half the size. Moreover, in the case of 2 gpus the number of steps for the full cycle will be the number of samples in your dataset however only half of the steps are required for full training epoch and the rest for validation. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
i am using latest 1.0.0 pytorch lightning and I have stepped into problem with validation.
I have moved from version 0.7.5. Now the validation is somehow ran in parallel with the training. When I am in the middle of epoch, validation starts. Can I disable this behaviour somehow? I don't want to run those 2 stages in parallel because of resources (I want to fully utilize gpus for training and then for validation).
Also I noticed that the validation dataset is only half of the size of training dataset (its dummy case, where training and validation datasets are equal).
I am using LightningDataModule for getting the dataloaders. Running the test on 2 gpus in ddp acceleration mode.
Also I have another question:
Is it possible to run the training not epoch-wise but step-wise? For example I want to do 10_000 steps (easy
max_steps=10_000
) and I don't care about how many epochs that would be. Further more I would like to run validation at every 1_000th global step (not inside of epoch). So there would be only 10 validations per the whole training. And what if the length of the dataset is 900 samples (I mean I can't useval_check_interval
because it only uses number of iterations inside of each epoch and I would never reach 1000 steps inside one epoch). Is there a way to use a global step counter instead of step counter inside of each epoch for this?I know I can use Iterable Datasets, but downside of that is, that I can not be sure, that every epoch whole dataset is iterated through.
Thanks for your great work!
Beta Was this translation helpful? Give feedback.
All reactions