Running Jobs on Clusters#
In addition to running Oumi locally, you can use the oumi launch
command in the Oumi CLI to run jobs on remote clusters. It provides a unified interface for running your code, allowing you to seamlessly switch between popular cloud providers and your own custom clusters!
Overview#
The Oumi Launcher operates using three key concepts:
Jobs
: Ajob
is a unit of work, such as running training or model evaluation. This can be any script you’d like!Clusters
: Acluster
is a set of dedicated hardware upon whichjobs
are run. Acluster
could be as simple as a cloud VM environment.Clouds
: Acloud
is a resource provider that managesclusters
. These include GCP, AWS, Lambda, Runpod, etc.
When you submit a job to the launcher it will handle queueing your job in the proper cluster’s job queue. If your desired Cloud does not have an appropriate cluster for running your job it will try to create one on the fly!
Setup#
The Oumi launcher integrates with SkyPilot to launch jobs on various cloud providers. To run on a cloud GPU cluster, first make sure to have all the dependencies installed for your desired cloud provider:
pip install oumi[aws] # For Amazon Web Services
pip install oumi[azure] # For Microsoft Azure
pip install oumi[gcp] # For Google Cloud Platform
pip install oumi[lambda] # For Lambda Cloud
pip install oumi[runpod] # For RunPod
Then, you need to enable your desired cloud provider in SkyPilot. Run sky check
to check which providers you have enabled, along with instructions on how to enable the ones you don’t. More detailed setup instructions can be found in SkyPilot’s documentation.
Quickstart#
Got a TrainingConfig
you want to run on the cloud?
Just replace the run
section of one of the configs below with your training command
and kick off the job via our CLI:
oumi launch up -c ./your_job.yaml
sample-gcp-job.yaml
name: sample-gcp-job
resources:
cloud: gcp
accelerators: "A100"
# If you don't have quota for a non-spot VM, try setting use_spot to true.
# However, make sure you are saving your output to a mounted cloud storage in case of
# preemption. For more information, see:
# https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
use_spot: false
disk_size: 500 # Disk size in GBs
num_nodes: 1 # Set it to a larger number for multi-node training.
working_dir: .
# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
# ~/.cache/huggingface/token: ~/.cache/huggingface/token
# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
# /gcs_dir:
# source: gs://<your-bucket>
# store: gcs
# /s3_dir:
# source: s3://<your-bucket>
# store: s3
# /r2_dir
# source: r2://,
# store: r2
envs:
OUMI_RUN_NAME: sample.gcp.job
setup: |
set -e
pip install uv && uv pip install 'oumi[gpu]'
# NOTE: Update this section with your training command.
run: |
set -e # Exit if any command failed.
oumi train -c ./path/to/your/config
sample-aws-job.yaml
name: sample-aws-job
resources:
cloud: aws
accelerators: "A100"
# If you don't have quota for a non-spot VM, try setting use_spot to true.
# However, make sure you are saving your output to a mounted cloud storage in case of
# preemption. For more information, see:
# https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
use_spot: false
disk_size: 500 # Disk size in GBs
num_nodes: 1 # Set it to a larger number for multi-node training.
working_dir: .
# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
# ~/.cache/huggingface/token: ~/.cache/huggingface/token
# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
# /gcs_dir:
# source: gs://<your-bucket>
# store: gcs
# /s3_dir:
# source: s3://<your-bucket>
# store: s3
# /r2_dir
# source: r2://,
# store: r2
envs:
OUMI_RUN_NAME: sample.aws.job
setup: |
set -e
pip install uv && uv pip install 'oumi[gpu]'
# NOTE: Update this section with your training command.
run: |
set -e # Exit if any command failed.
oumi train -c ./path/to/your/config
sample-azure-job.yaml
name: sample-azure-job
resources:
cloud: azure
accelerators: "A100"
# If you don't have quota for a non-spot VM, try setting use_spot to true.
# However, make sure you are saving your output to a mounted cloud storage in case of
# preemption. For more information, see:
# https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
use_spot: false
disk_size: 500 # Disk size in GBs
num_nodes: 1 # Set it to a larger number for multi-node training.
working_dir: .
# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
# ~/.cache/huggingface/token: ~/.cache/huggingface/token
# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
# /gcs_dir:
# source: gs://<your-bucket>
# store: gcs
# /s3_dir:
# source: s3://<your-bucket>
# store: s3
# /r2_dir
# source: r2://,
# store: r2
envs:
OUMI_RUN_NAME: sample.azure.job
setup: |
set -e
pip install uv && uv pip install 'oumi[gpu]'
# NOTE: Update this section with your training command.
run: |
set -e # Exit if any command failed.
oumi train -c ./path/to/your/config
sample-runpod-job.yaml
name: sample-runpod-job
resources:
cloud: runpod
accelerators: "A100"
# If you don't have quota for a non-spot VM, try setting use_spot to true.
# However, make sure you are saving your output to a mounted cloud storage in case of
# preemption. For more information, see:
# https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
use_spot: false
disk_size: 500 # Disk size in GBs
num_nodes: 1 # Set it to a larger number for multi-node training.
working_dir: .
# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
# ~/.cache/huggingface/token: ~/.cache/huggingface/token
# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
# /gcs_dir:
# source: gs://<your-bucket>
# store: gcs
# /s3_dir:
# source: s3://<your-bucket>
# store: s3
# /r2_dir
# source: r2://,
# store: r2
envs:
OUMI_RUN_NAME: sample.runpod.job
setup: |
set -e
pip install uv && uv pip install 'oumi[gpu]'
# NOTE: Update this section with your training command.
run: |
set -e # Exit if any command failed.
oumi train -c ./path/to/your/config
sample-lambda-job.yaml
name: sample-lambda-job
resources:
cloud: lambda
accelerators: "A100"
# If you don't have quota for a non-spot VM, try setting use_spot to true.
# However, make sure you are saving your output to a mounted cloud storage in case of
# preemption. For more information, see:
# https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
use_spot: false
disk_size: 500 # Disk size in GBs
num_nodes: 1 # Set it to a larger number for multi-node training.
working_dir: .
# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
# ~/.cache/huggingface/token: ~/.cache/huggingface/token
# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
# /gcs_dir:
# source: gs://<your-bucket>
# store: gcs
# /s3_dir:
# source: s3://<your-bucket>
# store: s3
# /r2_dir
# source: r2://,
# store: r2
envs:
OUMI_RUN_NAME: sample.lambda.job
setup: |
set -e
pip install uv && uv pip install 'oumi[gpu]'
# NOTE: Update this section with your training command.
run: |
set -e # Exit if any command failed.
oumi train -c ./path/to/your/config
Note
Don’t forget:
Make sure your training config is saved under
working_dir
so it will be copied by your jobUpdate the
setup
section if you need to install any custom dependenciesUpdate
accelerators
if you need to run on a specific set of GPUs (e.g. “A100-80GB:4” creates a job with 4x A100-80GBs)
Defining a Job#
Like most configurable pieces of Oumi, Jobs are defined via YAML configs. In this case, every job is defined by a JobConfig
.
When creating a job, there are several important fields you should be aware of:
resources
: where you specify resource requirements (cloud to use, GPUs, disk size, etc) viaoumi.launcher.JobResources
setup
: an optional script that is run when a cluster is createdrun
: the main script to run for your jobworking_dir
: the local directory to be copied to the cluster for use during execution.
A sample job is provided below:
configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml
# Class: oumi.core.configs.JobConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/job_config.py
# Config to tune smollm 135M on 1 GCP node.
# Example command:
# oumi launch up -c configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml --cluster smollm-135m-fft
name: smollm-135m-sft
resources:
cloud: gcp
accelerators: "A100:1"
use_spot: false
disk_size: 100 # Disk size in GBs
working_dir: .
envs:
OUMI_RUN_NAME: smollm135m.train
# https://github.com/huggingface/tokenizers/issues/899#issuecomment-1027739758
TOKENIZERS_PARALLELISM: false
setup: |
set -e
pip install uv && uv pip install oumi[gpu]
run: |
set -e # Exit if any command failed.
source ./configs/examples/misc/sky_init.sh
set -x
oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml
echo "Training complete!"
Core Functionality#
Launching jobs remotely is available via both the Oumi CLI and our python API (launcher
)
oumi launch
provides you with all the capabilities you need to kickoff and monitor jobs running on remote machines.
We’ll cover the most common use case here, which boils down to:
Using
oumi launch up
to create a cluster and run a job.Using
oumi launch status
to check the status of your job and cluster.Canceling jobs using
oumi launch cancel
Turning down a cluster manually using
oumi launch down
For a quick overview of all oumi launch
commands, see our CLI Launch Reference
Launching Jobs#
To launch a job on your desired cloud, run:
oumi launch up --cluster my-cluster -c configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml
This command will create the cluster if it doesn’t exist, and then execute the job on it. It can also run the job on an existing cluster with that name.
To launch on the cloud of your choice, use the --resources.cloud
flag, ex. --resources.cloud lambda
. Most of our configs run on GCP by default. See cloud
for all supported clouds, or run:
oumi launch which
To return immediately when the job is scheduled and not poll for the job’s completion, specify the --detach
flag:
oumi launch up --cluster my-cluster -c configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml --detach
To find out more about the GPUs available on your cloud provider, you can use skypilot:
sky show-gpus
To launch a job on your desired cloud, run:
import oumi.launcher as launcher
# Read our JobConfig from the YAML file
job_config = launcher.JobConfig.from_yaml(str(Path("configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml")))
# Start the job
launcher.up(job_config, "your_cluster_name")
This command will create the cluster if it doesn’t exist, and then execute the job on it. It can also run the job on an existing cluster with that name.
To launch on the cloud of your choice, simply set job_config.resources.cloud
, ex. job_config.resources.cloud = "gcp"
. Most of our configs run on GCP by default. See cloud
for all supported clouds, or run:
import oumi.launcher as launcher
# Print all available clouds
print(launcher.which_clouds())
To find out more about the GPUs available on your cloud provider, you can use skypilot:
sky show-gpus
Code Development#
You can use the Oumi job launcher as part of your development process using Oumi if your code changes need to be tested outside your local machine. First, make sure to follow the Dev Environment Setup guide to install Oumi from source. Then, make sure your job config uses pip install -e .
instead of pip install oumi
in the setup section. This lets the job pick up on your local changes by installing Oumi from source, in addition to automatically applying your code changes on the remote machine with the editable installation.
Spot instances#
On some cloud providers, you can use spot/preemptible instances instead of on-demand instances. These instances often have more quota available and are much cheaper (ex. ~3x cheaper on GCP). However, they may be shut down at any time, losing their disk. To mitigate this, follow the next section to mount cloud storage to persist your job’s output.
To use spot instances, set use_spot
to True in the JobResources
of your JobConfig
.
Mount Cloud Storage#
You can mount cloud storage containers like GCS or S3 to your job, which maps their remote paths to a directory on your job’s disk. This is a fantastic way to write important information (such as data or model checkpoints) to a persistent disk that outlives your cluster’s lifetime.
Tip
Writing your job’s output to cloud storage is recommended for preemptible cloud instances, or jobs outputting a large amount of data like large model checkpoints. Data on local disk will be lost on job preemption, and your job’s local disk may not have enough storage for multiple large model checkpoints.
To resume training from your last saved checkpoint after your instance is preempted, set training.try_resume_from_last_checkpoint
to True in your TrainingConfig
.
For example, to mount your GCS bucket gs://my-bucket
, add the following to your JobConfig
:
storage_mounts:
/gcs_dir:
source: gs://my-bucket
store: gcs
You can now access files in your bucket as if they’re on your local disk’s file system! For example, gs://my-bucket/path/to/file
can be accessed in your jobs with /gcs_dir/path/to/file
.
Tip
To improve I/O speeds, prefer using a bucket in the same cloud region as your job!
Check Cluster and Job Status#
To quickly check the status of all jobs and clusters, run:
oumi launch status
This will return a list of all jobs and clusters you’ve created across all registered cloud providers.
To further filter this list, you can optionally specify a cloud provider, cluster name, and/or job id. The results will be filtered to only jobs / clusters meeting the specified criteria. For example, the following command will return a list of jobs from all cloud providers running on a cluster named my-cluster
with a job id of my-job-id
:
oumi launch status --cluster my-cluster --id my-job-id
To quickly check the status of all jobs and clusters, run:
import oumi.launcher as launcher
status_list = launcher.status()
print(status_list)
This will return a list of all jobs and clusters you’ve created across all registered cloud providers.
To further filter this list, you can optionally specify a cloud provider, cluster name, and/or job id. The results will be filtered to only jobs / clusters meeting the specified criteria. For example, the following command will return a list of jobs from all cloud providers running on a cluster named my-cluster
with a job id of my-job-id
:
import oumi.launcher as launcher
status_list = launcher.status(cluster="my-cluster", id="my-job-id")
print(status_list)
View Logs#
Often you’ll want to view logs of running or terminated jobs. To view the logs of your jobs on clouds supported by SkyPilot, run:
sky logs my-cluster
Cancel Jobs#
To cancel a running job without stopping the cluster, run:
oumi launch cancel --cluster my-cluster --cloud gcp --id my-job-id
The id of the job can be obtained by running oumi launch status
.
To cancel a running job without stopping the cluster, run:
import oumi.launcher as launcher
launcher.cancel(job_id="my-job-id", cloud_name="gcp", cluster_name="my-cluster")
The id of the job can be obtained by using launcher.status()
as in the previous
section.
Stop/Turn Down Clusters#
To stop the cluster when you are done to avoid extra charges, run:
oumi launch stop --cluster my-cluster
In addition, the Oumi launcher automatically sets idle_minutes_to_autostop
to 60, i.e. clusters will stop automatically after 60 minutes of no jobs running.
Stopped clusters preserve their disk, and are quicker to initialize than turning up a brand new cluster. Stopped clusters can be automatically restarted by specifying them in an oumi launch up
command.
To turn down a cluster, which deletes their associated disk and removes them from our list of existing clusters, run:
oumi launch down --cluster my-cluster
To stop the cluster when you are done to avoid extra charges, run:
import oumi.launcher as launcher
launcher.stop(cloud_name="gcp", cluster_name="my-cluster")
In addition, Oumi automatically sets idle_minutes_to_autostop
to 60, i.e. clusters will stop automatically after 60 minutes of no jobs running.
Stopped clusters preserve their disk, and are quicker to initialize than turning up a brand new cluster. Stopped clusters can be automatically restarted by specifying them in a launcher.up(...)
command.
To turn down a cluster, which deletes their associated disk and removes them from our list of existing clusters, run:
import oumi.launcher as launcher
launcher.down(cloud_name="gcp", cluster_name="my-cluster")