Out of Memory (OOM)

Out of Memory (OOM)#

Introduction#

Out of Memory (OOM) errors are a common challenge when working with large language models and datasets.

In this guide, we will discuss a few strategies to reduce GPU memory requirements.

Best Practices

Always monitor memory usage and performance metrics when applying these optimizations, using nvidia-smi and Oumi’s telemetry output.
Combine multiple techniques for best results, but introduce changes gradually to isolate their effects.
Some techniques may trade off speed and model accuracy for memory efficiency. Choose the right balance for your specific use case.

Parameter-Efficient Fine-Tuning (PEFT)#

Enable LoRA:

PYTHON

from oumi.core.configs import PeftParams

config = TrainingConfig(
    training=TrainingParams(use_peft=True),
    peft=PeftParams(
        lora_r=16,
        lora_alpha=32,
        lora_dropout=0.05,
    ),
)

YAML

training:
    use_peft: true

peft:
    lora_r: 16
    lora_alpha: 32
    lora_dropout: 0.05

Distributed Training with FSDP#

If you have access to multiple GPUs, you can leverage FSDP to distribute the training process across multiple GPUs. To run FSDP jobs, make sure to invoke your training job with torchrun to run on multiple GPUs/nodes. We also provide the oumi distributed wrapper to automatically try to set the flags needed for torchrun. For example, you can simply run oumi distributed torchrun -m oumi train -c path/to/train.yaml.

Enable distributed training:

PYTHON

from oumi.core.configs import FSDPParams
from oumi.core.configs.params.fsdp_params import ShardingStrategy

config = TrainingConfig(
    fsdp=FSDPParams(
        enable_fsdp=True,
        sharding_strategy=ShardingStrategy.FULL_SHARD,
    ),
)

YAML

fsdp:
    enable_fsdp: true
    sharding_strategy: FULL_SHARD

Enable CPU offloading:

PYTHON

config = TrainingConfig(
    fsdp=FSDPParams(
        enable_fsdp=True,
        cpu_offload=True,
    ),
)

YAML

fsdp:
    enable_fsdp: true
    cpu_offload: true

Disable Forward Prefetch:

PYTHON

config = TrainingConfig(
    fsdp=FSDPParams(
        enable_fsdp=True,
        forward_prefetch=False,
    ),
)

YAML

fsdp:
    enable_fsdp: true
    forward_prefetch: false

Disable Backward Prefetch:

PYTHON

config = TrainingConfig(
    fsdp=FSDPParams(
        enable_fsdp=True,
        backward_prefetch=BackwardPrefetch.NO_PREFETCH,
    ),
)

YAML

fsdp:
    enable_fsdp: true
    backward_prefetch: NO_PREFETCH

Attention

Disabling FSDP’s forward and backward prefetch can lead to significant slower training times, use with caution.

Out of Memory (OOM)

Contents

Out of Memory (OOM)#

Introduction#

Training Optimizations#

Model Configuration#

Parameter-Efficient Fine-Tuning (PEFT)#

Distributed Training with FSDP#