OOM experiments
| epoch |
framework_config |
gradient_accumulation_steps |
mem_nvidia_mem_reserved |
model_name_or_path |
num_gpus |
per_device_train_batch_size |
torch_dtype |
train_loss |
train_runtime |
train_samples_per_second |
train_steps_per_second |
train_tokens_per_second |
| |
none |
16 |
78783.5 |
mistralai/Mixtral-8x7B-Instruct-v0.1 |
8 |
1 |
bfloat16 |
|
|
|
|
|
Failed experiments - number of gpus not divisible by ep degree
| epoch |
framework_config |
gradient_accumulation_steps |
mem_nvidia_mem_reserved |
model_name_or_path |
num_gpus |
per_device_train_batch_size |
torch_dtype |
train_loss |
train_runtime |
train_samples_per_second |
train_steps_per_second |
train_tokens_per_second |
| |
moe-scattermoe-granite-ep2 |
16 |
0 |
ibm-granite/granite-3.0-3b-a800m-instruct |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4 |
16 |
0 |
ibm-granite/granite-3.0-3b-a800m-instruct |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep2-padding-free-foak |
16 |
0 |
ibm-research/moe-7b-1b-active-shared-experts |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep2-padding-free |
16 |
0 |
ibm-granite/granite-3.0-3b-a800m-instruct |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep2-padding-free-foak |
16 |
0 |
ibm-granite/granite-3.0-3b-a800m-instruct |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free |
16 |
0 |
ibm-granite/granite-3.0-3b-a800m-instruct |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free-foak |
16 |
0 |
ibm-granite/granite-3.0-3b-a800m-instruct |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep2 |
16 |
0 |
ibm-research/moe-7b-1b-active-shared-experts |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep2-padding-free |
16 |
0 |
ibm-research/moe-7b-1b-active-shared-experts |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4 |
8 |
2276 |
ibm-granite/granite-3.0-3b-a800m-instruct |
2 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free |
8 |
2276 |
ibm-granite/granite-3.0-3b-a800m-instruct |
2 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free-foak |
8 |
2277 |
ibm-granite/granite-3.0-3b-a800m-instruct |
2 |
8 |
bfloat16 |
|
|
|
|
|
Failed experiments - number of experts not divisible by ep degree
| epoch |
framework_config |
gradient_accumulation_steps |
mem_nvidia_mem_reserved |
model_name_or_path |
num_gpus |
per_device_train_batch_size |
torch_dtype |
train_loss |
train_runtime |
train_samples_per_second |
train_steps_per_second |
train_tokens_per_second |
| |
moe-scattermoe-granite-ep4 |
16 |
0 |
ibm-research/moe-7b-1b-active-shared-experts |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4 |
8 |
2276 |
ibm-research/moe-7b-1b-active-shared-experts |
2 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4 |
4 |
2564 |
ibm-research/moe-7b-1b-active-shared-experts |
4 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free |
16 |
0 |
ibm-research/moe-7b-1b-active-shared-experts |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free |
8 |
2276 |
ibm-research/moe-7b-1b-active-shared-experts |
2 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free |
4 |
2564 |
ibm-research/moe-7b-1b-active-shared-experts |
4 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free-foak |
16 |
0 |
ibm-research/moe-7b-1b-active-shared-experts |
1 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free-foak |
8 |
2277 |
ibm-research/moe-7b-1b-active-shared-experts |
2 |
8 |
bfloat16 |
|
|
|
|
|
| |
moe-scattermoe-granite-ep4-padding-free-foak |
4 |
2564.5 |
ibm-research/moe-7b-1b-active-shared-experts |
4 |
8 |
bfloat16 |
|
|
|
|
|
Delta with previous experiments on OOM
| epoch |
framework_config |
gradient_accumulation_steps |
mem_nvidia_mem_reserved |
model_name_or_path |
num_gpus |
per_device_train_batch_size |
torch_dtype |
train_loss |
train_runtime |
train_samples_per_second |
train_steps_per_second |
train_tokens_per_second |
| |
none |
16 |
78783.5 |
mistralai/Mixtral-8x7B-Instruct-v0.1 |
8 |
1 |
bfloat16 |
|
|
|
|
|
Regression test is done as part of the PR - #126. The change in metrics may be attributed to transformers==4.49 but needs investigation.