Running Jobs on Clusters#

In addition to running Oumi locally, you can use the oumi launch command in the Oumi CLI to run jobs on remote clusters. It provides a unified interface for running your code, allowing you to seamlessly switch between popular cloud providers and your own custom clusters!

Overview#

The Oumi Launcher operates using three key concepts:

  1. Jobs: A job is a unit of work, such as running training or model evaluation. This can be any script you’d like!

  2. Clusters: A cluster is a set of dedicated hardware upon which jobs are run. A cluster could be as simple as a cloud VM environment.

  3. Clouds : A cloud is a resource provider that manages clusters. These include GCP, AWS, Lambda, Runpod, etc.

When you submit a job to the launcher it will handle queueing your job in the proper cluster’s job queue. If your desired Cloud does not have an appropriate cluster for running your job it will try to create one on the fly!

Setup#

The Oumi launcher integrates with SkyPilot to launch jobs on various cloud providers. To run on a cloud GPU cluster, first make sure to have all the dependencies installed for your desired cloud provider:

pip install oumi[aws]     # For Amazon Web Services
pip install oumi[azure]   # For Microsoft Azure
pip install oumi[gcp]     # For Google Cloud Platform
pip install oumi[lambda]  # For Lambda Cloud
pip install oumi[runpod]  # For RunPod

Then, you need to enable your desired cloud provider in SkyPilot. Run sky check to check which providers you have enabled, along with instructions on how to enable the ones you don’t. More detailed setup instructions can be found in SkyPilot’s documentation.

Quickstart#

Got a TrainingConfig you want to run on the cloud? Just replace the run section of one of the configs below with your training command and kick off the job via our CLI:

oumi launch up -c ./your_job.yaml
sample-gcp-job.yaml
name: sample-gcp-job

resources:
  cloud: gcp
  accelerators: "A100"
  # If you don't have quota for a non-spot VM, try setting use_spot to true.
  # However, make sure you are saving your output to a mounted cloud storage in case of
  # preemption. For more information, see:
  # https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
  use_spot: false
  disk_size: 500 # Disk size in GBs

num_nodes: 1 # Set it to a larger number for multi-node training.

working_dir: .

# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
#   ~/.cache/huggingface/token: ~/.cache/huggingface/token

# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
#   /gcs_dir:
#     source: gs://<your-bucket>
#     store: gcs
#   /s3_dir:
#     source: s3://<your-bucket>
#     store: s3
#   /r2_dir
#     source: r2://,
#     store: r2

envs:
  OUMI_RUN_NAME: sample.gcp.job

setup: |
  set -e
  pip install uv && uv pip install 'oumi[gpu]'

# NOTE: Update this section with your training command.
run: |
  set -e  # Exit if any command failed.
  oumi train -c ./path/to/your/config
sample-aws-job.yaml
name: sample-aws-job

resources:
  cloud: aws
  accelerators: "A100"
  # If you don't have quota for a non-spot VM, try setting use_spot to true.
  # However, make sure you are saving your output to a mounted cloud storage in case of
  # preemption. For more information, see:
  # https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
  use_spot: false
  disk_size: 500 # Disk size in GBs

num_nodes: 1 # Set it to a larger number for multi-node training.

working_dir: .

# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
#   ~/.cache/huggingface/token: ~/.cache/huggingface/token

# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
#   /gcs_dir:
#     source: gs://<your-bucket>
#     store: gcs
#   /s3_dir:
#     source: s3://<your-bucket>
#     store: s3
#   /r2_dir
#     source: r2://,
#     store: r2

envs:
  OUMI_RUN_NAME: sample.aws.job

setup: |
  set -e
  pip install uv && uv pip install 'oumi[gpu]'

# NOTE: Update this section with your training command.
run: |
  set -e  # Exit if any command failed.
  oumi train -c ./path/to/your/config
sample-azure-job.yaml
name: sample-azure-job

resources:
  cloud: azure
  accelerators: "A100"
  # If you don't have quota for a non-spot VM, try setting use_spot to true.
  # However, make sure you are saving your output to a mounted cloud storage in case of
  # preemption. For more information, see:
  # https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
  use_spot: false
  disk_size: 500 # Disk size in GBs

num_nodes: 1 # Set it to a larger number for multi-node training.

working_dir: .

# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
#   ~/.cache/huggingface/token: ~/.cache/huggingface/token

# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
#   /gcs_dir:
#     source: gs://<your-bucket>
#     store: gcs
#   /s3_dir:
#     source: s3://<your-bucket>
#     store: s3
#   /r2_dir
#     source: r2://,
#     store: r2

envs:
  OUMI_RUN_NAME: sample.azure.job

setup: |
  set -e
  pip install uv && uv pip install 'oumi[gpu]'

# NOTE: Update this section with your training command.
run: |
  set -e  # Exit if any command failed.
  oumi train -c ./path/to/your/config
sample-runpod-job.yaml
name: sample-runpod-job

resources:
  cloud: runpod
  accelerators: "A100"
  # If you don't have quota for a non-spot VM, try setting use_spot to true.
  # However, make sure you are saving your output to a mounted cloud storage in case of
  # preemption. For more information, see:
  # https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
  use_spot: false
  disk_size: 500 # Disk size in GBs

num_nodes: 1 # Set it to a larger number for multi-node training.

working_dir: .

# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
#   ~/.cache/huggingface/token: ~/.cache/huggingface/token

# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
#   /gcs_dir:
#     source: gs://<your-bucket>
#     store: gcs
#   /s3_dir:
#     source: s3://<your-bucket>
#     store: s3
#   /r2_dir
#     source: r2://,
#     store: r2

envs:
  OUMI_RUN_NAME: sample.runpod.job

setup: |
  set -e
  pip install uv && uv pip install 'oumi[gpu]'

# NOTE: Update this section with your training command.
run: |
  set -e  # Exit if any command failed.
  oumi train -c ./path/to/your/config
sample-lambda-job.yaml
name: sample-lambda-job

resources:
  cloud: lambda
  accelerators: "A100"
  # If you don't have quota for a non-spot VM, try setting use_spot to true.
  # However, make sure you are saving your output to a mounted cloud storage in case of
  # preemption. For more information, see:
  # https://oumi.ai/docs/en/latest/user_guides/launch/launch.html#mount-cloud-storage
  use_spot: false
  disk_size: 500 # Disk size in GBs

num_nodes: 1 # Set it to a larger number for multi-node training.

working_dir: .

# NOTE: Uncomment the following lines to download locked-down models from HF Hub.
# file_mounts:
#   ~/.cache/huggingface/token: ~/.cache/huggingface/token

# NOTE: Uncomment the following lines to mount a cloud bucket to your VM.
# For more details, see https://oumi.ai/docs/en/latest/user_guides/launch/launch.html.
# storage_mounts:
#   /gcs_dir:
#     source: gs://<your-bucket>
#     store: gcs
#   /s3_dir:
#     source: s3://<your-bucket>
#     store: s3
#   /r2_dir
#     source: r2://,
#     store: r2

envs:
  OUMI_RUN_NAME: sample.lambda.job

setup: |
  set -e
  pip install uv && uv pip install 'oumi[gpu]'

# NOTE: Update this section with your training command.
run: |
  set -e  # Exit if any command failed.
  oumi train -c ./path/to/your/config

Note

Don’t forget:

  • Make sure your training config is saved under working_dir so it will be copied by your job

  • Update the setup section if you need to install any custom dependencies

  • Update accelerators if you need to run on a specific set of GPUs (e.g. “A100-80GB:4” creates a job with 4x A100-80GBs)

Defining a Job#

Like most configurable pieces of Oumi, Jobs are defined via YAML configs. In this case, every job is defined by a JobConfig.

When creating a job, there are several important fields you should be aware of:

  • resources: where you specify resource requirements (cloud to use, GPUs, disk size, etc) via oumi.launcher.JobResources

  • setup: an optional script that is run when a cluster is created

  • run: the main script to run for your job

  • working_dir: the local directory to be copied to the cluster for use during execution.

A sample job is provided below:

configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml
# Class: oumi.core.configs.JobConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/job_config.py

# Config to tune smollm 135M on 1 GCP node.
# Example command:
# oumi launch up -c configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml --cluster smollm-135m-fft
name: smollm-135m-sft

resources:
  cloud: gcp
  accelerators: "A100:1"
  use_spot: false
  disk_size: 100 # Disk size in GBs

working_dir: .

envs:
  OUMI_RUN_NAME: smollm135m.train
  # https://github.com/huggingface/tokenizers/issues/899#issuecomment-1027739758
  TOKENIZERS_PARALLELISM: false

setup: |
  set -e
  pip install uv && uv pip install oumi[gpu]

run: |
  set -e  # Exit if any command failed.
  source ./configs/examples/misc/sky_init.sh

  set -x
  oumi train -c configs/recipes/smollm/sft/135m/quickstart_train.yaml

  echo "Training complete!"

Core Functionality#

Launching jobs remotely is available via both the Oumi CLI and our python API (launcher)

oumi launch provides you with all the capabilities you need to kickoff and monitor jobs running on remote machines.

We’ll cover the most common use case here, which boils down to:

  1. Using oumi launch up to create a cluster and run a job.

  2. Using oumi launch status to check the status of your job and cluster.

  3. Canceling jobs using oumi launch cancel

  4. Turning down a cluster manually using oumi launch down

For a quick overview of all oumi launch commands, see our CLI Launch Reference

Launching Jobs#

To launch a job on your desired cloud, run:

oumi launch up --cluster my-cluster -c configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml

This command will create the cluster if it doesn’t exist, and then execute the job on it. It can also run the job on an existing cluster with that name.

To launch on the cloud of your choice, use the --resources.cloud flag, ex. --resources.cloud lambda. Most of our configs run on GCP by default. See cloud for all supported clouds, or run:

oumi launch which

To return immediately when the job is scheduled and not poll for the job’s completion, specify the --detach flag:

oumi launch up --cluster my-cluster -c configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml --detach

To find out more about the GPUs available on your cloud provider, you can use skypilot:

sky show-gpus

To launch a job on your desired cloud, run:

import oumi.launcher as launcher

# Read our JobConfig from the YAML file
job_config = launcher.JobConfig.from_yaml(str(Path("configs/recipes/smollm/sft/135m/quickstart_gcp_job.yaml")))
# Start the job
launcher.up(job_config, "your_cluster_name")

This command will create the cluster if it doesn’t exist, and then execute the job on it. It can also run the job on an existing cluster with that name.

To launch on the cloud of your choice, simply set job_config.resources.cloud, ex. job_config.resources.cloud = "gcp". Most of our configs run on GCP by default. See cloud for all supported clouds, or run:

import oumi.launcher as launcher

# Print all available clouds
print(launcher.which_clouds())

To find out more about the GPUs available on your cloud provider, you can use skypilot:

sky show-gpus

Code Development#

You can use the Oumi job launcher as part of your development process using Oumi if your code changes need to be tested outside your local machine. First, make sure to follow the Dev Environment Setup guide to install Oumi from source. Then, make sure your job config uses pip install -e . instead of pip install oumi in the setup section. This lets the job pick up on your local changes by installing Oumi from source, in addition to automatically applying your code changes on the remote machine with the editable installation.

Spot instances#

On some cloud providers, you can use spot/preemptible instances instead of on-demand instances. These instances often have more quota available and are much cheaper (ex. ~3x cheaper on GCP). However, they may be shut down at any time, losing their disk. To mitigate this, follow the next section to mount cloud storage to persist your job’s output.

To use spot instances, set use_spot to True in the JobResources of your JobConfig.

Mount Cloud Storage#

You can mount cloud storage containers like GCS or S3 to your job, which maps their remote paths to a directory on your job’s disk. This is a fantastic way to write important information (such as data or model checkpoints) to a persistent disk that outlives your cluster’s lifetime.

Tip

Writing your job’s output to cloud storage is recommended for preemptible cloud instances, or jobs outputting a large amount of data like large model checkpoints. Data on local disk will be lost on job preemption, and your job’s local disk may not have enough storage for multiple large model checkpoints.

To resume training from your last saved checkpoint after your instance is preempted, set training.try_resume_from_last_checkpoint to True in your TrainingConfig.

For example, to mount your GCS bucket gs://my-bucket, add the following to your JobConfig:

storage_mounts:
  /gcs_dir:
    source: gs://my-bucket
    store: gcs

You can now access files in your bucket as if they’re on your local disk’s file system! For example, gs://my-bucket/path/to/file can be accessed in your jobs with /gcs_dir/path/to/file.

Tip

To improve I/O speeds, prefer using a bucket in the same cloud region as your job!

Check Cluster and Job Status#

To quickly check the status of all jobs and clusters, run:

oumi launch status

This will return a list of all jobs and clusters you’ve created across all registered cloud providers.

To further filter this list, you can optionally specify a cloud provider, cluster name, and/or job id. The results will be filtered to only jobs / clusters meeting the specified criteria. For example, the following command will return a list of jobs from all cloud providers running on a cluster named my-cluster with a job id of my-job-id:

oumi launch status --cluster my-cluster --id my-job-id

To quickly check the status of all jobs and clusters, run:

import oumi.launcher as launcher

status_list = launcher.status()

print(status_list)

This will return a list of all jobs and clusters you’ve created across all registered cloud providers.

To further filter this list, you can optionally specify a cloud provider, cluster name, and/or job id. The results will be filtered to only jobs / clusters meeting the specified criteria. For example, the following command will return a list of jobs from all cloud providers running on a cluster named my-cluster with a job id of my-job-id:

import oumi.launcher as launcher

status_list = launcher.status(cluster="my-cluster", id="my-job-id")

print(status_list)

View Logs#

Often you’ll want to view logs of running or terminated jobs. To view the logs of your jobs on clouds supported by SkyPilot, run:

sky logs my-cluster

Cancel Jobs#

To cancel a running job without stopping the cluster, run:

oumi launch cancel --cluster my-cluster --cloud gcp --id my-job-id

The id of the job can be obtained by running oumi launch status.

To cancel a running job without stopping the cluster, run:

import oumi.launcher as launcher

launcher.cancel(job_id="my-job-id", cloud_name="gcp", cluster_name="my-cluster")

The id of the job can be obtained by using launcher.status() as in the previous section.

Stop/Turn Down Clusters#

To stop the cluster when you are done to avoid extra charges, run:

oumi launch stop --cluster my-cluster

In addition, the Oumi launcher automatically sets idle_minutes_to_autostop to 60, i.e. clusters will stop automatically after 60 minutes of no jobs running.

Stopped clusters preserve their disk, and are quicker to initialize than turning up a brand new cluster. Stopped clusters can be automatically restarted by specifying them in an oumi launch up command.

To turn down a cluster, which deletes their associated disk and removes them from our list of existing clusters, run:

oumi launch down --cluster my-cluster

To stop the cluster when you are done to avoid extra charges, run:

import oumi.launcher as launcher

launcher.stop(cloud_name="gcp", cluster_name="my-cluster")

In addition, Oumi automatically sets idle_minutes_to_autostop to 60, i.e. clusters will stop automatically after 60 minutes of no jobs running.

Stopped clusters preserve their disk, and are quicker to initialize than turning up a brand new cluster. Stopped clusters can be automatically restarted by specifying them in a launcher.up(...) command.

To turn down a cluster, which deletes their associated disk and removes them from our list of existing clusters, run:

import oumi.launcher as launcher

launcher.down(cloud_name="gcp", cluster_name="my-cluster")