Deploying a Job#

In this tutorial we’ll take a working JobConfig and deploy it remotely on a cluster of your choice.

This guide dovetails nicely with our Finetuning Tutorial where you create your own TrainingConfig and run it locally. Give it a try if you haven’t already!

Launching Your Job#

Note

Try using our sample helloworld job for this tutorial:

configs/examples/misc/hello_world_gcp_job.yaml
# Class: oumi.core.configs.JobConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/job_config.py

name: hello-world
resources:
  cloud: gcp
  accelerators: "A100:1"

# Upload working directory to remote.
working_dir: .

envs:
  TEST_ENV_VARIABLE: '"Hello, World!"'

# For GCP, setup is only run once at cluster creation.
setup: |
  echo "Running setup..."

run: |
  set -e  # Exit if any command failed.

  echo "$TEST_ENV_VARIABLE"

Let’s get started with launching a job! Don’t worry about the nitty-gritty—we’ll address configuring your job in the following sections.

You can easily kick off a job directly from the CLI:

oumi launch up --cluster my-cluster -c configs/examples/misc/hello_world_gcp_job.yaml

At any point you can easily change the cloud where your job will run by modifying the job’s resources.cloud parameter:

oumi launch up --cluster my-cluster -c configs/examples/misc/hello_world_gcp_job.yaml --resources.cloud local

First let’s load your JobConfig:

import oumi.launcher as launcher
# Read our JobConfig from the YAML file.
working_dir = "YOUR_WORKING_DIRECTORY" # Specify this value
job_config = launcher.JobConfig.from_yaml(str(Path(working_dir) / "job.yaml"))

At any point you can easily change the cloud where your job will run by modifying the job’s resources.cloud parameter:

# Manually set the cloud to use.
job_config.resources.cloud = "local"

Once you have a job config, kicking off your job is simple:

# You can optionally specify a cluster name here. If not specified, a random name will
# be generated. This is also useful for launching multiple jobs on the same cluster.
cluster_name = None

# Launch the job!
cluster, job_status = launcher.up(job_config, cluster_name)
print(f"Job status: {job_status}")

Don’t worry if you see any errors from the launcher–you may need to configure permissions to run a job on your specified cloud. The error message should provide you with the proper command to run to authenticate (for GCP this is often gcloud auth application-default login).

We can quickly check on the status of our job using the cluster returned in the previous command:

oumi launch status
while job_status and not job_status.done:
    print("Job is running...")
    time.sleep(15)
    job_status = cluster.get_job(job_status.id)

print("Job is done!")

Now that we’re done with the cluster, let’s turn it down to stop billing for non-local clouds.

oumi launch down --cluster my-cluster
cluster.down()

Choosing a Cloud#

We’ll be using the Oumi Launcher to run remote training. To use the launcher, you need to specify which cloud you’d like to run training on. We’ll list the clouds below:

oumi launch which
import oumi.launcher as launcher

# Print all available clouds
print(launcher.which_clouds())

Local Cloud#

If you don’t have any clouds set up yet, feel free to use the local cloud. This will simply execute your job on your current device as if it’s a remote cluster. Hardware requirements are ignored for the local cloud.

Other Providers#

Note that to use a cloud you must already have an account registered with that cloud provider.

For example, GCP, RunPod, and Lambda require accounts with billing enabled.

Once you’ve picked a cloud, move on to the next step.

Preparing Your JobConfig#

Let’s get started by creating your JobConfig. In the config below, feel free to change cloud: local to the cloud you chose in the previous step.

A sample job is provided below:

job.yaml
name: job-tutorial
resources:
  cloud: local
  # Accelerators is ignored for the local cloud.
  # This is required for other clouds like GCP, AWS, etc.
  accelerators: A100

# Upload working directory to remote.
# If on the local cloud, we CD into the working directory before running the job.
working_dir: .

envs:
  TEST_ENV_VARIABLE: '"Hello, World!"'
  OUMI_LOGGING_DIR: "deploy_tutorial/logs"

# `setup` will always be executed once when a cluster is created
setup: |
  echo "Running setup..."

run: |
  set -e  # Exit if any command failed.

  echo "$TEST_ENV_VARIABLE"

Deploying a Training Config#

In our Finetuning Tutorial, we created and saved a TrainingConfig. We then invoked training by running

oumi train -c "$tutorial_dir/train.yaml"

You can also run that command as a job! Simply update the “run” section of the JobConfig with your desired command:

export PATH_TO_YOUR_TRAIN_CONFIG="deploy_tutorial/train.yaml" # Make sure this exists!
oumi launch up --cluster my-new-cluster -c deploy_tutorial/job.yaml --run "oumi train -c $PATH_TO_YOUR_TRAIN_CONFIG" --setup "pip install oumi"
working_dir = "YOUR_WORKING_DIRECTORY" # Specify this value
path_to_your_train_config = Path(working_dir) / "train.yaml"  # Make sure this exists!

# Set the `run` command to run your training script.
job_config.run = f'oumi train -c "{path_to_your_train_config}"'
# Make sure we install oumi
job_config.setup = "pip install oumi"