Deploying Oumi on Kubernetes

Deploying Oumi on Kubernetes#

This guide covers deploying Oumi on Kubernetes (k8s) clusters. For automated cluster provisioning and job management using the Oumi launcher, see the Running Jobs on Clusters guide instead.

Please follow this guide to deploy Oumi onto an existing k8s cluster with GPU nodes. For examples per cloud providers on setting up a k8s cluster, you may follow platform specific examples.

Prerequisites#

A running k8s cluster with GPU nodes
kubectl configured to access your cluster
For GPU workloads: NVIDIA Device Plugin installed
Cluster must have internet access to pull Oumi container images from ghcr.io/oumi-ai/oumi

Note

Most cloud k8s clusters (EKS, GKE, AKS) use amd64/x86_64 architecture. Verify your node architecture with kubectl get nodes -o wide and select the appropriate image from the container registry to use below.

Quick Start#

1. Create Namespace#

kubectl create namespace oumi

2. Deploy Oumi#

Create oumi-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: oumi
  namespace: oumi
spec:
  replicas: 1
  selector:
    matchLabels:
      app: oumi
  template:
    metadata:
      labels:
        app: oumi
    spec:
      containers:
      - name: oumi
        # Use the linux/amd64 image for most cloud providers
        # Get latest images from: https://github.com/oumi-ai/oumi/pkgs/container/oumi
        image: ghcr.io/oumi-ai/oumi:latest
        command: ["sleep", "infinity"]
        # Adjust gpu, memory, and storage based on the model you want to run.
        # Below is configuration for single GPU per pod.
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "16Gi"
            ephemeral-storage: "100Gi"
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            ephemeral-storage: "1Ti"
      # Configure for GPU nodes
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      nodeSelector:
        nvidia.com/gpu: "true"  # Adjust based on your node labels

Apply the deployment:

kubectl apply -f oumi-deployment.yaml

Deployment can take about 15 minutes for the cluster to pull and load the Oumi container image.

3. Access Oumi#

# Get pod name
POD_NAME=$(kubectl get pods -n oumi -l app=oumi -o jsonpath='{.items[0].metadata.name}')

# Execute commands in the pod
kubectl exec -it $POD_NAME -n oumi -- /bin/bash

Inside the pod, run Oumi commands:

oumi train -c /path/to/config.yaml

Platform Examples#

AWS EKS

GCP GKE