oumi.launcher.clusters

oumi.launcher.clusters#

Submodules#

oumi.launcher.clusters.frontier_cluster module#

class oumi.launcher.clusters.frontier_cluster.FrontierCluster(name: str, client: SlurmClient)[source]#

Bases: BaseCluster

A cluster implementation backed by OLCF Frontier.

class SupportedQueues(value)[source]#

Bases: Enum

Enum representing the supported partitions (queues) on Frontier.

For more details, see: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#batch-partition-queue-policy

BATCH = 'batch'#

EXTENDED = 'extended'#

__eq__(other: Any) → bool[source]#: Checks if two FrontierClusters are equal.

cancel_job(job_id: str) → JobStatus[source]#: Cancels the specified job on this cluster.

down() → None[source]#: This is a no-op for Frontier clusters.

get_job(job_id: str) → JobStatus | None[source]#: Gets the jobs on this cluster if it exists, else returns None.

get_jobs() → list[JobStatus][source]#: Lists the jobs on this cluster.

name() → str[source]#: Gets the name of the cluster.

run_job(job: JobConfig) → JobStatus[source]#

Runs the specified job on this cluster.

For Frontier this method consists of 5 parts:

Copy the working directory to /lustre/orion/lrn081/scratch/$USER/oumi_launcher/$JOB_NAME.
Check if there is a conda installation at /lustre/orion/lrn081/scratch/$USER/miniconda3/envs/oumi. If not, install it.
Copy all file mounts.
Create a job script with all env vars, setup, and run commands.
CD into the working directory and submit the job.

Parameters:: job – The job to run.
Returns:: The job status.
Return type:: JobStatus

stop() → None[source]#: This is a no-op for Frontier clusters.

oumi.launcher.clusters.local_cluster module#

class oumi.launcher.clusters.local_cluster.LocalCluster(name: str, client: LocalClient)[source]#

Bases: BaseCluster

A cluster implementation for running jobs locally.

__eq__(other: Any) → bool[source]#: Checks if two LocalClusters are equal.

cancel_job(job_id: str) → JobStatus[source]#: Cancels the specified job on this cluster.

down() → None[source]#: Cancels all jobs, running or queued.

get_job(job_id: str) → JobStatus | None[source]#: Gets the jobs on this cluster if it exists, else returns None.

get_jobs() → list[JobStatus][source]#: Lists the jobs on this cluster.

name() → str[source]#: Gets the name of the cluster.

run_job(job: JobConfig) → JobStatus[source]#

Runs the specified job on this cluster.

Parameters:: job – The job to run.
Returns:: The job status.

stop() → None[source]#: Cancels all jobs, running or queued.

oumi.launcher.clusters.polaris_cluster module#

class oumi.launcher.clusters.polaris_cluster.PolarisCluster(name: str, client: PolarisClient)[source]#

Bases: BaseCluster

A cluster implementation backed by Polaris.

__eq__(other: Any) → bool[source]#: Checks if two PolarisClusters are equal.

cancel_job(job_id: str) → JobStatus[source]#: Cancels the specified job on this cluster.

down() → None[source]#: This is a no-op for Polaris clusters.

get_job(job_id: str) → JobStatus | None[source]#: Gets the jobs on this cluster if it exists, else returns None.

get_jobs() → list[JobStatus][source]#: Lists the jobs on this cluster.

name() → str[source]#: Gets the name of the cluster.

run_job(job: JobConfig) → JobStatus[source]#

Runs the specified job on this cluster.

For Polaris this method consists of 5 parts:

Copy the working directory to /home/$USER/oumi_launcher/<submission_time>.
Check if there is a conda installation at /home/$USER/miniconda3/envs/oumi. If not, install it.
Copy all file mounts.
Create a job script with all env vars, setup, and run commands.
CD into the working directory and submit the job.

Parameters:: job – The job to run.
Returns:: The job status.
Return type:: JobStatus

stop() → None[source]#: This is a no-op for Polaris clusters.

oumi.launcher.clusters.sky_cluster module#

class oumi.launcher.clusters.sky_cluster.SkyCluster(name: str, client: SkyClient)[source]#

Bases: BaseCluster

A cluster implementation backed by Sky Pilot.

__eq__(other: Any) → bool[source]#: Checks if two SkyClusters are equal.

cancel_job(job_id: str) → JobStatus[source]#: Cancels the specified job on this cluster.

down() → None[source]#: Tears down the current cluster.

get_job(job_id: str) → JobStatus | None[source]#: Gets the jobs on this cluster if it exists, else returns None.

get_jobs() → list[JobStatus][source]#: Lists the jobs on this cluster.

name() → str[source]#: Gets the name of the cluster.

run_job(job: JobConfig) → JobStatus[source]#: Runs the specified job on this cluster.

stop() → None[source]#: Stops the current cluster.

oumi.launcher.clusters.slurm_cluster module#

class oumi.launcher.clusters.slurm_cluster.SlurmCluster(name: str, client: SlurmClient)[source]#

Bases: BaseCluster

A cluster implementation backed by a Slurm scheduler.

class ConnectionInfo(hostname: str, user: str)[source]#

Bases: object

Dataclass to hold information about a connection.

hostname: str#

property name#: Gets the name of the connection in the form user@hostname.

user: str#

__eq__(other: Any) → bool[source]#: Checks if two SlurmClusters are equal.

cancel_job(job_id: str) → JobStatus[source]#: Cancels the specified job on this cluster.

down() → None[source]#: This is a no-op for Slurm clusters.

get_job(job_id: str) → JobStatus | None[source]#: Gets the jobs on this cluster if it exists, else returns None.

get_jobs() → list[JobStatus][source]#: Lists the jobs on this cluster.

static get_slurm_connections() → list[ConnectionInfo][source]#: Gets Slurm connections from the OUMI_SLURM_CONNECTIONS env variable.

name() → str[source]#: Gets the name of the cluster.

static parse_cluster_name(name: str) → ConnectionInfo[source]#

Parses the cluster name into queue and user components.

Parameters:: name – The name of the cluster.
Returns:: The parsed cluster information.
Return type:: _ConnectionInfo

run_job(job: JobConfig) → JobStatus[source]#

Runs the specified job on this cluster.

For Slurm this method consists of 4 parts:

Copy the working directory to ~/oumi_launcher/<submission_time>.
Copy all file mounts.
Create a job script with all env vars, setup, and run commands.
CD into the working directory and submit the job.

Parameters:: job – The job to run.
Returns:: The job status.
Return type:: JobStatus

stop() → None[source]#: This is a no-op for Slurm clusters.

oumi.launcher.clusters

Contents

oumi.launcher.clusters#

Submodules#

oumi.launcher.clusters.frontier_cluster module#

oumi.launcher.clusters.local_cluster module#

oumi.launcher.clusters.polaris_cluster module#

oumi.launcher.clusters.sky_cluster module#

oumi.launcher.clusters.slurm_cluster module#