Quickstart#

📋 Prerequisites#

Let’s start by installing Oumi. You can easily install the latest stable version of Oumi with the following commands:

pip install oumi

# Optional: If you have an Nvidia or AMD GPU, you can install the GPU dependencies
pip install oumi[gpu]

If you need help setting up your environment (python, pip, git, etc), you can find detailed instructions in the Dev Environment Setup guide. The installation guide offers more details on how to install Oumi for your specific environment and use case.

👋 Introduction#

Now that we have Oumi installed, let’s get started with the basics! We’re going to use the oumi command-line interface (CLI) to train, evaluate, and run inference with a model.

We’ll use a small model (SmolLM-135M) so that the examples can run fast on both CPU and GPU. SmolLM is a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset. You can learn more about about them in this blog post.

For a full list of recipes, including larger models like Llama 3.2, you can explore the recipes page.

💻 Oumi CLI#

The general structure of Oumi CLI commands is:

oumi <command> [options]

For detailed help on any command, you can use the --help option:

oumi --help            # for general help
oumi <command> --help  # for command-specific help

The available commands are:

  • train

  • evaluate

  • infer

  • launch

  • judge

Let’s go through some examples of each command.

📚 Training#

You can quickly start training a model using any of existing recipes or your own custom configs. The following command will start training using the recipe in configs/recipes/smollm/sft/135m/quickstart_train.yaml:

configs/recipes/smollm/sft/135m/quickstart_train.yaml
# Class: oumi.core.configs.TrainingConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py

# SFT config for SmolLM 135M Instruct.

model:
  model_name: "HuggingFaceTB/SmolLM2-135M-Instruct"
  model_max_length: 2048
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"
  load_pretrained_weights: True
  trust_remote_code: True

data:
  train:
    datasets:
      - dataset_name: "yahma/alpaca-cleaned"
    target_col: "prompt"

training:
  trainer_type: TRL_SFT
  save_final_model: True
  save_steps: 100
  max_steps: 10
  per_device_train_batch_size: 4
  gradient_accumulation_steps: 4

  ddp_find_unused_parameters: False
  optimizer: "adamw_torch"
  learning_rate: 2.0e-05
  compile: False

  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 32

  logging_steps: 5
  log_model_summary: False
  empty_device_cache_steps: 50
  output_dir: "output/smollm135m.fft"
  include_performance_metrics: True
oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml

You can easily override any parameters directly in the command line, for example:

oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml \
  --training.max_steps 20 \
  --training.learning_rate 1e-4 \
  --training.output_dir output/smollm-135m-sft

To run the same recipe on your own dataset (e.g., in our supported JSON or JSONL formats), you can override the dataset name and path. You can try this functionality out by downloading the alpaca_cleaned dataset manually via the huggingface CLI, then including that local path in your run.

huggingface-cli download yahma/alpaca-cleaned --repo-type dataset --local-dir /path/to/local/dataset

oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml \
  --data.train.datasets "[{dataset_name: text_sft, dataset_path: /path/to/local/dataset}]" \
  --training.output_dir output/smollm-135m-sft-custom

You can also train on multiple GPUs (make sure to install the GPU dependencies if not already installed).

For example, if you have a machine with 4 GPUs, you can run this command to launch a local distributed training run:

oumi distributed torchrun \
  -m oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml \
  --training.output_dir output/smollm-135m-sft-dist

You can also use torchrun directly in standalone mode.

torchrun --standalone --nproc-per-node 4 --log-dir ./logs \
-m oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml \
--training.output_dir output/smollm-135m-sft-dist

📊 Evaluation#

To evaluate a trained model:

configs/recipes/smollm/evaluation/135m/quickstart_eval.yaml
# Class: oumi.core.configs.EvaluationConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/evaluation_config.py

# Eval config for SmolLM 135M Instruct.

model:
  model_name: "HuggingFaceTB/SmolLM2-135M-Instruct"
  model_max_length: 2048
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"
  load_pretrained_weights: True
  trust_remote_code: True

generation:
  batch_size: 4

tasks:
  # For all available tasks, see https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html
  - evaluation_platform: lm_harness
    task_name: mmlu_college_computer_science
    eval_kwargs:
      num_fewshot: 5

Using a model downloaded from HuggingFace:

oumi evaluate -c configs/recipes/smollm/evaluation/135m/quickstart_eval.yaml \
  --model.model_name HuggingFaceTB/SmolLM2-135M-Instruct

Or, with our newly trained model saved on disk:

oumi evaluate -c configs/recipes/smollm/evaluation/135m/quickstart_eval.yaml \
  --model.model_name output/smollm135m.fft

If you saved your model to a different directory such as output/smollm-135m-sft-dist, you need only change --model.model_name.

To explore the benchmarks that our evaluations support, including HuggingFace leaderboards and AlpacaEval, visit our evaluation guide.

🧠 Inference#

To run inference with a trained model:

configs/recipes/smollm/inference/135m_infer.yaml
# Class: oumi.core.configs.InferenceConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/inference_config.py

# Inference config for SmolLM 135M Instruct.

model:
  model_name: "HuggingFaceTB/SmolLM2-135M-Instruct"
  adapter_model: null # Update for LoRA-tuned models.
  model_max_length: 2048
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"
  load_pretrained_weights: True
  trust_remote_code: True

generation:
  max_new_tokens: 100
  batch_size: 4

engine: NATIVE

Using a model downloaded from HuggingFace:

oumi infer -c configs/recipes/smollm/inference/135m_infer.yaml \
  --generation.max_new_tokens 40 \
  --generation.temperature 0.7 \
  --interactive

Or, with our newly trained model saved on disk:

oumi infer -c configs/recipes/smollm/inference/135m_infer.yaml \
  --model.model_name output/smollm135m.fft \
  --generation.max_new_tokens 40 \
  --generation.temperature 0.7 \
  --interactive

To learn more about running inference locally or remotely (including OpenAI, Google, Anthropic APIs) and leveraging inference engines to parallelize and speed up your jobs, visit our inference guide.

☁️ Launching Jobs in the Cloud#

So far we have been using Oumi locally. But one of the most exciting and unique Oumi features, compared to similar frameworks, is its integrated ability to launch jobs directly to the cloud (GCP, AWS, Azure, etc).

This section of the quickstart is going to be a little different than the others, so please read this next bit carefully before you proceed.

  • This tutorial uses GCP; you’ll need a GCP account. You can also use other cloud providers, such as AWS, Azure, etc. See running jobs remotely for more details.

Configuring GCP Account
  • In particular, Oumi uses Skypilot, and the recommended way to use SkyPilot and GCP is with a GCP service account

  • You will need to install Oumi with GCP support: pip install oumi[gcp]. Please note that we recommend setting up a different environment for each cloud provider you wish to use.

  • Depending on your precise use case, you may also need to install a few other packages from Google

conda install -c conda-forge google-cloud-sdk -y
conda install -c conda-forge google-api-python-client -y
conda install -c conda-forge google-cloud-storage -y
  • There are multiple ways to handle credentials with GCP service accounts. We recommend creating a service account key in JSON format, then downloading it to the machine from which you plan to launch the cloud job. After that, you’ll need to run a few more setup commands.

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALS
gcloud config set project <YOUR_PROJECT>

You can now run sky check to confirm GCP is enabled.

If you get stuck, please refer to our running jobs remotely section, as well as the documentation for GCP and SkyPilot linked above, for more information.

Launching your first cloud job with Oumi#

Once the one-time setup is out of the way, launching a new cloud job with Oumi is very simple.

configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml
# Class: oumi.core.configs.JobConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/job_config.py

# Config to tune smollm 135M on 1 GCP node.
# Example command:
# oumi launch up -c configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml --cluster smollm-135m-fft
name: smollm-135m-sft

resources:
  cloud: gcp
  accelerators: "A100:1"
  use_spot: false
  disk_size: 100 # Disk size in GBs

working_dir: .

envs:
  OUMI_RUN_NAME: smollm135m.train
  # https://github.com/huggingface/tokenizers/issues/899#issuecomment-1027739758
  TOKENIZERS_PARALLELISM: false

setup: |
  set -e
  pip install uv && uv pip install oumi[gpu]

run: |
  set -e  # Exit if any command failed.
  source ./configs/examples/misc/sky_init.sh

  set -x
  oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml

  echo "Training complete!"
oumi launch up -c configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml

To launch an evaluation job:

configs/recipes/smollm/evaluation/135m/quickstart_gcp_job.yaml
# Class: oumi.core.configs.JobConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/job_config.py

# Config to evaluate smollm 135M on 1 GCP node.
# Example command:
# oumi launch up -c configs/recipes/smollm/evaluation/135m/quickstart_gcp_job.yaml --cluster smollm-135m-eval
name: smollm-135m-eval

resources:
  cloud: gcp
  accelerators: "A100:1"
  use_spot: false
  disk_size: 100 # Disk size in GBs

working_dir: .

envs:
  OUMI_RUN_NAME: smollm135m.eval
  # https://github.com/huggingface/tokenizers/issues/899#issuecomment-1027739758
  TOKENIZERS_PARALLELISM: false

setup: |
  set -e
  pip install uv && uv pip install oumi[gpu,evaluation]

run: |
  set -e  # Exit if any command failed.
  source ./configs/examples/misc/sky_init.sh

  set -x
  oumi evaluate -c configs/recipes/smollm/evaluation/135m/quickstart_eval.yaml

  echo "Evaluation complete!"
oumi launch up -c configs/recipes/smollm/evaluation/135m/quickstart_gcp_job.yaml

After you run one of the above commands, you should see some console output from Oumi which describes how your job is being provisioned and how the cloud installation is proceeding. In particular, your cluster will be assigned a semi-random name such as sky-7fdd-ab183, which you should take note of.

After 15 minutes or so, Oumi should tell you that the run is complete.

If you want to see the logs from your cloud run, you can pull them down to your local machine –

sky logs --sync-down sky-7fdd-ab183

Cloud services can be expensive! Please keep an eye on your costs, and don’t forget to tear down your cluster when you’re done with this tutorial.

sky down sky-7fdd-ab183

This command will destroy your cluster, including all data on those remote machines, so save your logs and artifacts first!

🧭 What’s next?#

Although this example used GCP, Oumi natively supports a wide range of cloud providers. To explore the Cloud providers that we support, visit running jobs remotely.

🔗 Community#

⭐ If you like Oumi and you would like to support it, please give it a star on GitHub.

👋 If you are interested in contributing, please read the Contributor’s Guide.