oumi.launcher.clusters#

Submodules#

oumi.launcher.clusters.local_cluster module#

class oumi.launcher.clusters.local_cluster.LocalCluster(name: str, client: LocalClient)[source]#

Bases: BaseCluster

A cluster implementation for running jobs locally.

__eq__(other: Any) bool[source]#

Checks if two LocalClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

Cancels all jobs, running or queued.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

name() str[source]#

Gets the name of the cluster.

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

Parameters:

job – The job to run.

Returns:

The job status.

stop() None[source]#

Cancels all jobs, running or queued.

oumi.launcher.clusters.polaris_cluster module#

class oumi.launcher.clusters.polaris_cluster.PolarisCluster(name: str, client: PolarisClient)[source]#

Bases: BaseCluster

A cluster implementation backed by Polaris.

__eq__(other: Any) bool[source]#

Checks if two PolarisClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

This is a no-op for Polaris clusters.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

name() str[source]#

Gets the name of the cluster.

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

For Polaris this method consists of 5 parts:

  1. Copy the working directory to /home/$USER/oumi_launcher/$JOB_NAME.

  2. Check if there is a conda installation at /home/$USER/miniconda3/envs/oumi. If not, install it.

  3. Copy all file mounts.

  4. Create a job script with all env vars, setup, and run commands.

  5. CD into the working directory and submit the job.

Parameters:

job – The job to run.

Returns:

The job status.

Return type:

JobStatus

stop() None[source]#

This is a no-op for Polaris clusters.

oumi.launcher.clusters.sky_cluster module#

class oumi.launcher.clusters.sky_cluster.SkyCluster(name: str, client: SkyClient)[source]#

Bases: BaseCluster

A cluster implementation backed by Sky Pilot.

__eq__(other: Any) bool[source]#

Checks if two SkyClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

Tears down the current cluster.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

name() str[source]#

Gets the name of the cluster.

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

stop() None[source]#

Stops the current cluster.

oumi.launcher.clusters.slurm_cluster module#

class oumi.launcher.clusters.slurm_cluster.SlurmCluster(name: str, client: SlurmClient)[source]#

Bases: BaseCluster

A cluster implementation backed by a Slurm scheduler.

class ConnectionInfo(hostname: str, user: str)[source]#

Bases: object

Dataclass to hold information about a connection.

hostname: str#
property name#

Gets the name of the connection in the form user@hostname.

user: str#
__eq__(other: Any) bool[source]#

Checks if two SlurmClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

This is a no-op for Slurm clusters.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

static get_slurm_connections() list[ConnectionInfo][source]#

Gets Slurm connections from the OUMI_SLURM_CONNECTIONS env variable.

name() str[source]#

Gets the name of the cluster.

static parse_cluster_name(name: str) ConnectionInfo[source]#

Parses the cluster name into queue and user components.

Parameters:

name – The name of the cluster.

Returns:

The parsed cluster information.

Return type:

_ConnectionInfo

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

For Slurm this method consists of 5 parts:

  1. Copy the working directory to ~/oumi_launcher/$JOB_NAME.

  2. Check if there is a conda installation at /home/$USER/miniconda3/envs/oumi. If not, install it.

  3. Copy all file mounts.

  4. Create a job script with all env vars, setup, and run commands.

  5. CD into the working directory and submit the job.

Parameters:

job – The job to run.

Returns:

The job status.

Return type:

JobStatus

stop() None[source]#

This is a no-op for Slurm clusters.