Local Training#

This guide covers how to train models on your local machine or server using Oumi’s command-line interface. Whether you’re working on a laptop or a multi-GPU server, this guide will help you get started with local training.

For cloud-based training options, see Running Jobs on Clusters.

Prerequisites#

Before starting local training, ensure you have:

  1. Hardware Requirements

    • CUDA-capable GPU(s) recommended

    • Sufficient RAM (16GB minimum)

    • Adequate disk space for storing your models and datasets

  2. Software Setup

    • Python environment configured & oumi installed

For detailed installation instructions, refer to our Installation guide.

Basic Usage#

Command Line Interface#

The main command for training is oumi train. The CLI provides a flexible way to configure your training runs through both YAML configs and command-line parameter overrides.

# Basic usage
oumi train -c path/to/config.yaml

# With parameter overrides
oumi train -c path/to/config.yaml \
  --training.learning_rate 1e-4 \
  --training.num_train_epochs 5

For a complete reference of configuration options, see Training Configuration.

Training with GPUs#

Oumi supports both single and multi-GPU training setups.

Single GPU Training#

For training on a specific GPU:

# Using CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0 oumi train -c config.yaml

# Using device parameter
oumi train -c config.yaml --model.device_map cuda:0

Multi-GPU Training#

For distributed training across multiple GPUs:

# Using DDP
torchrun --standalone --nproc-per-node=<NUM_GPUS> oumi train -c config.yaml

# Using FSDP
oumi distributed torchrun -m oumi train -c config.yaml --fsdp.enable_fsdp true

For more details on distributed training options, see Training.

Monitoring#

Effective monitoring is crucial for understanding your model’s training progress. You have multiple options to monitor your training progress:

Terminal Output#

Monitor training progress directly in the terminal:

# Configure logging
oumi train -c config.yaml --training.logging_steps 10

TensorBoard#

Monitor metrics with TensorBoard for rich visualizations:

First add the following to your train.yaml config file:

training:
  enable_tensorboard: true
  output_dir: oumi_output_dir
  logging_steps: 10

Then run the following command to start TensorBoard:

# Start TensorBoard
tensorboard --logdir oumi_output_dir

Weights & Biases#

You can also track experiments with W&B for collaborative projects:

training:
  enable_wandb: true
  run_name: "experiment-1"
  logging_steps: 10

For more monitoring options and best practices, see Monitoring & Debugging.

Next Steps#