oumi.launcher.clusters

Contents

oumi.launcher.clusters#

Submodules#

oumi.launcher.clusters.frontier_cluster module#

class oumi.launcher.clusters.frontier_cluster.FrontierCluster(name: str, client: SlurmClient)[source]#

Bases: BaseCluster

A cluster implementation backed by OLCF Frontier.

class SupportedQueues(value)[source]#

Bases: Enum

Enum representing the supported partitions (queues) on Frontier.

For more details, see: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#batch-partition-queue-policy

BATCH = 'batch'#
EXTENDED = 'extended'#
__eq__(other: Any) bool[source]#

Checks if two FrontierClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

This is a no-op for Frontier clusters.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

get_logs_stream(cluster_name: str, job_id: str | None = None) TextIOBase[source]#

Gets a stream that tails the logs of the target job.

Parameters:
  • cluster_name – The name of the cluster the job was run in.

  • job_id – The ID of the job to tail the logs of.

name() str[source]#

Gets the name of the cluster.

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

For Frontier this method consists of 5 parts:

  1. Copy the working directory to /lustre/orion/lrn081/scratch/$USER/oumi_launcher/$JOB_NAME.

  2. Check if there is a conda installation at /lustre/orion/lrn081/scratch/$USER/miniconda3/envs/oumi. If not, install it.

  3. Copy all file mounts.

  4. Create a job script with all env vars, setup, and run commands.

  5. CD into the working directory and submit the job.

Parameters:

job – The job to run.

Returns:

The job status.

Return type:

JobStatus

stop() None[source]#

This is a no-op for Frontier clusters.

oumi.launcher.clusters.local_cluster module#

class oumi.launcher.clusters.local_cluster.LocalCluster(name: str, client: LocalClient)[source]#

Bases: BaseCluster

A cluster implementation for running jobs locally.

__eq__(other: Any) bool[source]#

Checks if two LocalClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

Cancels all jobs, running or queued.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

get_logs_stream(cluster_name: str, job_id: str | None = None) TextIOBase[source]#

Gets a stream that tails the logs of the target job.

Parameters:
  • cluster_name – The name of the cluster the job was run in.

  • job_id – The ID of the job to tail the logs of.

name() str[source]#

Gets the name of the cluster.

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

Parameters:

job – The job to run.

Returns:

The job status.

stop() None[source]#

Cancels all jobs, running or queued.

oumi.launcher.clusters.modal_cluster module#

Modal-backed cluster implementation.

Modal has no native cluster concept — every job is a single Sandbox. ModalCluster is a thin façade that maps a logical cluster name (the SkyPilot-style identifier callers like the Oumi worker pass to oumi.launcher.up) onto sandbox lookups by object_id. Job lookups use the job_id argument directly so callers don’t need to know the mapping.

stop() and down() cancel every sandbox the in-process ModalClient has launched under this cluster name. Across worker restarts the mapping is lost; cleanup at that point should fall back to per-sandbox cancel_job using the job_id persisted by the caller alongside the cluster name.

class oumi.launcher.clusters.modal_cluster.ModalCluster(name: str, client: ModalClient)[source]#

Bases: BaseCluster

A cluster implementation backed by Modal sandboxes.

__eq__(other: Any) bool[source]#

Checks if two ModalClusters are equal.

__hash__() int[source]#

Hashes by cluster name so instances can live in sets/dicts.

cancel_job(job_id: str) JobStatus[source]#

Cancels the sandbox identified by job_id and returns its status.

down() None[source]#

Alias for stop — Modal is serverless, nothing else to tear down.

get_job(job_id: str) JobStatus | None[source]#

Gets the status of the sandbox identified by job_id.

job_id is the opaque Sandbox.object_id returned at launch time (and persisted by the caller). The cluster name is purely logical, so this method ignores self._name and goes straight to the sandbox lookup.

get_jobs() list[JobStatus][source]#

Lists the jobs spawned under this cluster name in this process.

get_logs_stream(cluster_name: str, job_id: str | None = None) ModalLogStream[source]#

Returns a stream of logs for job_id (sandbox object_id).

cluster_name is accepted for interface compatibility and ignored. job_id is the canonical handle. If job_id is omitted, falls back to the most recently launched sandbox under this cluster name (in this process).

name() str[source]#

Gets the cluster name.

run_job(job: JobConfig) JobStatus[source]#

Re-running on a Modal cluster is unsupported.

Modal jobs are 1:1 with sandboxes. To run a new job, allocate a new sandbox via ModalCloud.up_cluster.

stop() None[source]#

Best-effort cancel of every sandbox tracked under this cluster name.

oumi.launcher.clusters.perlmutter_cluster module#

class oumi.launcher.clusters.perlmutter_cluster.PerlmutterCluster(name: str, client: SlurmClient)[source]#

Bases: BaseCluster

A cluster implementation backed by NERSC Perlmutter.

class SupportedQueues(value)[source]#

Bases: Enum

Enum representing the supported queues on Perlmutter.

Unlike most other research clusters, Perlmutter calls queues quality of service (QoS). We use the term queue for consistency with other clusters. For more details, see: https://docs.nersc.gov/jobs/policy/#perlmutter-gpu.

DEBUG = 'debug'#
DEBUG_PREEMPT = 'debug_preempt'#
INTERACTIVE = 'interactive'#
JUPYTER = 'jupyter'#
OVERRUN = 'overrun'#
PREEMPT = 'preempt'#
PREMIUM = 'premium'#
REALTIME = 'realtime'#
REGULAR = 'regular'#
SHARED = 'shared'#
SHARED_INTERACTIVE = 'shared_interactive'#
SHARED_OVERRUN = 'shared_overrun'#
__eq__(other: Any) bool[source]#

Checks if two PerlmutterClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

This is a no-op for Perlmutter clusters.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

get_logs_stream(job_id: str, cluster_name: str) TextIOBase[source]#

Gets a stream that tails the logs of the target job.

Parameters:
  • job_id – The ID of the job to tail the logs of.

  • cluster_name – The name of the cluster the job was run in.

name() str[source]#

Gets the name of the cluster.

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

For Perlmutter this method consists of 5 parts:

  1. Copy the working directory to remote’s $HOME/oumi_launcher/$JOB_NAME.

  2. Check if there is a conda installation. If not, install it.

  3. Copy all file mounts.

  4. Create a job script with all env vars, setup, and run commands.

  5. CD into the working directory and submit the job.

Parameters:

job – The job to run.

Returns:

The job status.

Return type:

JobStatus

stop() None[source]#

This is a no-op for Perlmutter clusters.

oumi.launcher.clusters.polaris_cluster module#

class oumi.launcher.clusters.polaris_cluster.PolarisCluster(name: str, client: PolarisClient)[source]#

Bases: BaseCluster

A cluster implementation backed by Polaris.

__eq__(other: Any) bool[source]#

Checks if two PolarisClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

This is a no-op for Polaris clusters.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

get_logs_stream(cluster_name: str, job_id: str | None = None) TextIOBase[source]#

Gets a stream that tails the logs of the target job.

Parameters:
  • cluster_name – The name of the cluster the job was run in.

  • job_id – The ID of the job to tail the logs of.

name() str[source]#

Gets the name of the cluster.

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

For Polaris this method consists of 5 parts:

  1. Copy the working directory to /home/$USER/oumi_launcher/<submission_time>.

  2. Check if there is a conda installation at /home/$USER/miniconda3/envs/oumi. If not, install it.

  3. Copy all file mounts.

  4. Create a job script with all env vars, setup, and run commands.

  5. CD into the working directory and submit the job.

Parameters:

job – The job to run.

Returns:

The job status.

Return type:

JobStatus

stop() None[source]#

This is a no-op for Polaris clusters.

oumi.launcher.clusters.sky_cluster module#

class oumi.launcher.clusters.sky_cluster.SkyCluster(name: str, client: SkyClient)[source]#

Bases: BaseCluster

A cluster implementation backed by Sky Pilot.

__eq__(other: Any) bool[source]#

Checks if two SkyClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

Tears down the current cluster.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

get_logs_stream(cluster_name: str, job_id: str | None = None) SkyLogStream[source]#

Gets a stream that tails the logs of the target job.

Parameters:
  • cluster_name – The name of the cluster the job was run in.

  • job_id – The ID of the job to tail the logs of.

name() str[source]#

Gets the name of the cluster.

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

stop() None[source]#

Stops the current cluster.

oumi.launcher.clusters.slurm_cluster module#

class oumi.launcher.clusters.slurm_cluster.SlurmCluster(name: str, client: SlurmClient)[source]#

Bases: BaseCluster

A cluster implementation backed by a Slurm scheduler.

class ConnectionInfo(hostname: str, user: str)[source]#

Bases: object

Dataclass to hold information about a connection.

hostname: str#
property name#

Gets the name of the connection in the form user@hostname.

user: str#
__eq__(other: Any) bool[source]#

Checks if two SlurmClusters are equal.

cancel_job(job_id: str) JobStatus[source]#

Cancels the specified job on this cluster.

down() None[source]#

This is a no-op for Slurm clusters.

get_job(job_id: str) JobStatus | None[source]#

Gets the jobs on this cluster if it exists, else returns None.

get_jobs() list[JobStatus][source]#

Lists the jobs on this cluster.

get_logs_stream(cluster_name: str, job_id: str | None = None) SlurmLogStream[source]#

Gets a stream that tails the logs of the target job.

Parameters:
  • cluster_name – The name of the cluster the job was run in.

  • job_id – The ID of the job to tail the logs of.

Returns:

A SlurmLogStream object that can be used to read the logs.

static get_slurm_connections() list[ConnectionInfo][source]#

Gets Slurm connections from the OUMI_SLURM_CONNECTIONS env variable.

name() str[source]#

Gets the name of the cluster.

static parse_cluster_name(name: str) ConnectionInfo[source]#

Parses the cluster name into queue and user components.

Parameters:

name – The name of the cluster.

Returns:

The parsed cluster information.

Return type:

_ConnectionInfo

run_job(job: JobConfig) JobStatus[source]#

Runs the specified job on this cluster.

For Slurm this method consists of 4 parts:

  1. Copy the working directory to ~/oumi_launcher/<submission_time>.

  2. Copy all file mounts.

  3. Create a job script with all env vars, setup, and run commands.

  4. CD into the working directory and submit the job.

Parameters:

job – The job to run.

Returns:

The job status.

Return type:

JobStatus

stop() None[source]#

This is a no-op for Slurm clusters.