Skip to content

Conversation

@galrotem
Copy link
Contributor

Summary:
In some cases users don't have enough items in their dataloader to get a batch on all ranks. In these cases, when using predict, the job will succeed but the output table will be empty.

We have a callback to detect two subsequent train epochs with no steps, but for predict there's no such detection. In predict, there's only one epoch so we should raise if no steps were performed.

Differential Revision: D87656691

Summary:
In some cases users don't have enough items in their dataloader to get a batch on all ranks. In these cases, when using predict, the job will succeed but the output table will be empty. 

We have a callback to detect two subsequent train epochs with no steps, but for predict there's no such detection. In predict, there's only one epoch so we should raise if no steps were performed.

Differential Revision: D87656691
@meta-cla meta-cla bot added the cla signed label Nov 24, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 24, 2025

@galrotem has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87656691.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant