Skip to content

test: Regression tests show increased use of memory causing 1 additional experiment to OOM over last tests #128

@kmehant

Description

@kmehant

OOM experiments

epoch framework_config gradient_accumulation_steps mem_nvidia_mem_reserved model_name_or_path num_gpus per_device_train_batch_size torch_dtype train_loss train_runtime train_samples_per_second train_steps_per_second train_tokens_per_second
  none 16 78783.5 mistralai/Mixtral-8x7B-Instruct-v0.1 8 1 bfloat16          

Failed experiments - number of gpus not divisible by ep degree

epoch framework_config gradient_accumulation_steps mem_nvidia_mem_reserved model_name_or_path num_gpus per_device_train_batch_size torch_dtype train_loss train_runtime train_samples_per_second train_steps_per_second train_tokens_per_second
  moe-scattermoe-granite-ep2 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep4 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep2-padding-free-foak 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep2-padding-free 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep2-padding-free-foak 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 16 0 ibm-granite/granite-3.0-3b-a800m-instruct 1 8 bfloat16          
  moe-scattermoe-granite-ep2 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep2-padding-free 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep4 8 2276 ibm-granite/granite-3.0-3b-a800m-instruct 2 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 8 2276 ibm-granite/granite-3.0-3b-a800m-instruct 2 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 8 2277 ibm-granite/granite-3.0-3b-a800m-instruct 2 8 bfloat16          

Failed experiments - number of experts not divisible by ep degree

epoch framework_config gradient_accumulation_steps mem_nvidia_mem_reserved model_name_or_path num_gpus per_device_train_batch_size torch_dtype train_loss train_runtime train_samples_per_second train_steps_per_second train_tokens_per_second
  moe-scattermoe-granite-ep4 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep4 8 2276 ibm-research/moe-7b-1b-active-shared-experts 2 8 bfloat16          
  moe-scattermoe-granite-ep4 4 2564 ibm-research/moe-7b-1b-active-shared-experts 4 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 8 2276 ibm-research/moe-7b-1b-active-shared-experts 2 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free 4 2564 ibm-research/moe-7b-1b-active-shared-experts 4 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 16 0 ibm-research/moe-7b-1b-active-shared-experts 1 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 8 2277 ibm-research/moe-7b-1b-active-shared-experts 2 8 bfloat16          
  moe-scattermoe-granite-ep4-padding-free-foak 4 2564.5 ibm-research/moe-7b-1b-active-shared-experts 4 8 bfloat16          

Delta with previous experiments on OOM

epoch framework_config gradient_accumulation_steps mem_nvidia_mem_reserved model_name_or_path num_gpus per_device_train_batch_size torch_dtype train_loss train_runtime train_samples_per_second train_steps_per_second train_tokens_per_second
  none 16 78783.5 mistralai/Mixtral-8x7B-Instruct-v0.1 8 1 bfloat16          

Regression test is done as part of the PR - #126. The change in metrics may be attributed to transformers==4.49 but needs investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions