Training

Training#

Overview#

Oumi provides an end-to-end training framework designed to handle everything from small fine-tuning experiments to large-scale pre-training runs.

Oumi enables you to start small—in a notebook or local machine—and easily scale up as your needs grow while maintaining a consistent interface across different training scenarios and environments.

Key features include:

Multiple Training Methods: Supervised Fine-Tuning (SFT) to adapt models to your specific tasks, Vision-Language SFT for multimodal models, Pretraining for training from scratch, Direct Preference Optimization (DPO) for preference-based fine-tuning, and Group Relative Policy Optimization (GRPO) for preference-based fine-tuning
Parameter-Efficient Fine-Tuning (PEFT) & Full Fine-Tuning (FFT): Support for multiple PEFT methods including LoRA for efficient adapter training, QLoRA for quantized fine-tuning with 4-bit precision, and full fine-tuning for maximum performance
Flexible Environments: Train on local machines, with VSCode integration, in Jupyter notebooks, or in a cloud environment
Production-Ready: Ensure reproducibility through YAML-based configurations and gain insights with comprehensive monitoring & debugging tools
Scalable Training: Scale from single-GPU training to multi-node distributed training using Distributed Data Parallel (DDP) or Fully Sharded Data Parallel (FSDP)

Quick Start#

The fastest way to get started with training is using one of our pre-configured recipes.

For example, to train a small model (SmolLM-135M) on a sample dataset (tatsu-lab/alpaca), you can use the following command:

BASH

# Train a small model (SmolLM-135M)

oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml

PYTHON

from oumi import train
from oumi.core.configs import TrainingConfig

# Load config from file

config = TrainingConfig.from_yaml("configs/recipes/smollm/sft/135m/quickstart_train.yaml")

# Start training

train(config)

Running this config will:

Download a small pre-trained model: SmolLM-135M
Load a sample dataset: tatsu-lab/alpaca
Run supervised fine-tuning using the TRL_SFT trainer
Save the trained model to config.output_dir

Configuration Guide#

At the heart of Oumi’s training system is a YAML-based configuration framework. This allows you to define all aspects of your training run in a single, version-controlled file.

Here’s a basic example with key parameters explained:

model:
  model_name: "HuggingFaceTB/SmolLM2-135M-Instruct"  # Base model to fine-tune
  trust_remote_code: true  # Required for some model architectures
  dtype: "bfloat16"  # Training precision (float32, float16, or bfloat16)

data:
  train:  # Training dataset mixture
    datasets:
      - dataset_name: "tatsu-lab/alpaca"  # Training dataset
        split: "train"  # Dataset split to use

training:
  output_dir: "output/my_training_run" # Where to save outputs
  num_train_epochs: 3 # Number of training epochs
  learning_rate: 5e-5 # Learning rate
  save_steps: 100  # Checkpoint frequency

You can override any value either through the CLI or programmatically:

BASH

oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml \
  --training.learning_rate 1e-4 \
  --training.max_steps 30

PYTHON

from oumi import train
from oumi.core.configs import TrainingConfig

# Load base config

config = TrainingConfig.from_yaml("configs/recipes/smollm/sft/135m/quickstart_train.yaml")

# Override specific values

config.training.learning_rate = 1e-4
config.training.max_steps = 30

# Start training

train(config)

Common Workflows#

In the following sections, we’ll cover some common workflows for training.

Fine-tuning a Pre-trained Model#

The simplest workflow is to fine-tune a pre-trained model on a dataset. The following will fully finetune the model using SFT (supervised fine-tuning).

model:
  model_name: "meta-llama/Llama-3.2-3B-Instruct"  # Replace with your model
  trust_remote_code: true
  dtype: "bfloat16"

data:
  train:  # Training dataset mixture, can be a single dataset or a list of datasets
    datasets:
      - dataset_name: "yahma/alpaca-cleaned" # Replace with your dataset, or add more datasets
        split: "train"

training:
  output_dir: "output/llama-finetuned"  # Where to save outputs
  optimizer: "adamw_torch_fused"
  learning_rate: 2e-5
  max_steps: 10  # Number of training steps

Using Parameter-Efficient Fine-tuning (PEFT)#

Excellent results can be achieved at a fraction of the computational cost by fine-tuning your network with Low Rank (LoRA) adapters instead of updating all original parameters. The following adaptation enables parameter efficient fine-tuning with very few additions:

model:
  model_name: "meta-llama/Llama-3.2-3B-Instruct"  # Replace with your model
  trust_remote_code: true
  dtype: "bfloat16"

data:
  train:  # Training dataset mixture, can be a single dataset or a list of datasets
    datasets:
      - dataset_name: "yahma/alpaca-cleaned" # Replace with your dataset, or add more datasets
        split: "train"

training:
  output_dir: "output/llama-finetuned"  # Where to save outputs
  optimizer: "adamw_torch_fused"
  learning_rate: 2e-5
  max_steps: 10  # Number of training steps
  use_peft: True  # Activate Parameter Efficient Fine-Tuning

peft: # Control key hyper-parameters of the PEFT training process
  lora_r: 64
  lora_alpha: 128
  lora_target_modules: # Select the modules for which adapters will be added
    - "q_proj"
    - "v_proj"
    - "o_proj"
    - "gate_proj"
    - "up_proj"
    - "down_proj"

Fine-tuning a Vision-Language Model#

Multimodal support in Oumi is similar to support for text-only models with few config changes e.g., data collation. You can find more details in Vision-Language SFT, VL SFT Datasets, Multi-modal Inference, and Multi-modal Benchmarks.

Multi-GPU Training#

To train with multiple GPUs, we can extend that same configuration to use distributed training, using either DDP or FSDP:

# Using DDP (DistributedDataParallel)
oumi distributed torchrun \
  -m oumi train \
  -c configs/recipes/llama3_2/sft/3b_full/train.yaml

# Using FSDP (Fully Sharded Data Parallel)
oumi distributed torchrun \
  -m oumi train \
  -c configs/recipes/llama3_2/sft/3b_full/train.yaml \
  --fsdp.enable_fsdp true \
  --fsdp.sharding_strategy FULL_SHARD

Launch Remote Training#

To kick off a training run on a cloud environment, you can use the launcher system.

This will create a GCP job with the specified configuration and start training:

oumi launch up -c configs/recipes/llama3_2/sft/3b_full/gcp_job.yaml --cluster llama3b-sft

Thanks to the integration with Skypilot, most cloud providers are supported – make sure to check out Running Jobs on Clusters for more details.

Multi-node Training#

To train with multiple nodes using the Oumi launcher, set num_nodes to your desired number of nodes.

Using Custom Datasets#

To use your own datasets, you can specify the path to the dataset in the configuration.

data:
  train:
    datasets:
      - dataset_name: "text_sft"
        dataset_path: "/path/to/dataset.jsonl"

In this case, the dataset is expected to be in the conversation format. See Chat Formats for all the supported formats.

Training Output#

Throughout the training process, we generate logs and artifacts to help you track progress and debug issues in the config.output_dir directory.

This includes model checkpoints for resuming training, detailed training logs, TensorBoard events for visualization, and a backup of the training configuration.

Next Steps#

Now that we covered the basics, as a next step you can:

Learn about different training methods
Set up your training environment and get started training
Explore configuration options