-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Description
nanochat/scripts/base_train.py
Line 46 in c6b7ab7
| total_batch_size = 524288 # total desired batch size, in #tokens |
total_batch_size is hardcoded to 524888, which comes from 32 * 2048 * 8 (32 device_batch_size, 2048 max_seq_len, 8 world_size).
In my script, I have modified the batch size and sequence length and the world size is also 1 (single gpu). i.e. All three params making the total_batch_size are different.
@karpathy is there a better way to set total_batch_size ? Should it be equal to my device_batch_size * max_seq_len * world_size ?
Is there a industry standard for optimal total batch size, as a function of number of parameters ? Similar to how total_tokens is calculated ?
I think there is a range of ideal batch size of a model. Like for a d10 model, we shouldn't be using the same total_batch_size as d30 model right? Not sure. How is it done currently ? Any research papers for reference ?
Would love to hear your thoughts on this. Aarigatoh!