Deploying Models#
Oumi provides a top-level oumi deploy command for taking a trained or downloaded model and standing it up as a managed inference endpoint on a third-party provider. Today it supports Fireworks AI and Parasail.io.
Related
To launch training on remote clusters, see Running Jobs on Clusters.
To call a deployed endpoint, see Inference Engines.
Overview#
The deploy workflow has three stages, each exposed as a sub-command:
Upload — push the model (full weights or a LoRA adapter) to the provider.
Create endpoint — provision hardware and start serving the uploaded model.
Test / use — smoke-test the endpoint and then call it with any inference engine.
For the common case, oumi deploy up runs all three stages end-to-end from a single YAML config.
Prerequisites#
A provider account and API key exported in your shell:
Fireworks:
FIREWORKS_API_KEYParasail:
PARASAIL_API_KEY
For Fireworks, the model must exist on your local disk (HuggingFace download or an Oumi training output).
Quick Start: End-to-End Deploy#
oumi deploy up --config configs/examples/deploy/fireworks_deploy.yaml
The --config YAML matches the DeploymentConfig schema:
# configs/examples/deploy/fireworks_deploy.yaml
model_source: /path/to/my-finetuned-model/ # local directory
provider: fireworks # fireworks | parasail
model_name: my-finetuned-model-v1 # display name on the provider
model_type: full # full | adapter
# base_model: accounts/fireworks/models/llama-v3p1-8b-instruct # required if adapter
hardware:
accelerator: nvidia_h100_80gb # see `oumi deploy list-hardware`
count: 2
autoscaling:
min_replicas: 1
max_replicas: 4
test_prompts:
- "Hello, how are you?"
Any of model_source, provider, and hardware can be overridden on the CLI, e.g.:
oumi deploy up \
--config fireworks_deploy.yaml \
--model-path /tmp/llama3-8b \
--hardware nvidia_a100_80gb
oumi deploy up will upload the model, wait for it to be ready, create an endpoint, optionally run any test_prompts, and print the endpoint URL.
Sub-Commands#
Command |
What it does |
|---|---|
|
Full pipeline: upload → create endpoint → test |
|
Upload a model only |
|
Create an endpoint for a previously uploaded model |
|
List all deployments on the provider |
|
List uploaded models |
|
List hardware options available for a provider |
|
Show endpoint state, replica counts, URL |
|
Start or stop an existing endpoint (pause to save cost) |
|
Delete an endpoint |
|
Delete an uploaded model |
|
Send a sample request to an endpoint |
Add --help to any sub-command for the exact flags it accepts, or see CLI Reference.
Using a Deployed Endpoint#
Once oumi deploy up reports RUNNING, point any Oumi inference engine at the returned URL. For Fireworks:
from oumi.inference import FireworksInferenceEngine
from oumi.core.configs import ModelParams
engine = FireworksInferenceEngine(
model_params=ModelParams(model_name="my-finetuned-model-v1")
)
For Parasail:
from oumi.inference import ParasailInferenceEngine
from oumi.core.configs import ModelParams
engine = ParasailInferenceEngine(
model_params=ModelParams(model_name="my-finetuned-model-v1")
)
Both engines are documented in Inference Engines.
Tips#
Cost control. Use
oumi deploy stop <endpoint>to pause an endpoint without deleting it;startbrings it back online. Setautoscaling.min_replicas: 0if the provider supports scale-to-zero.LoRA adapters. Set
model_type: adapterand a matchingbase_modelto deploy a LoRA adapter on top of a hosted base model. This is usually cheaper than a full model.Smoke tests.
test_promptsat the bottom of the YAML run automatically afteroumi deploy upfinishes — quick sanity check before sending real traffic.
See Also#
Inference Engines — calling the deployed endpoint
Running Jobs on Clusters — launching training jobs on remote clusters
CLI Reference — CLI reference