Oumi OSS v0.8: Deploy, MCP, and Batch Inference Everywhere
By Stefan Webb
May 15, 2026
We’re excited to announce Oumi OSS (Open Source Stack) v0.8, a release focused on closing the loop from training to production. This version lands a brand-new oumi deploy CLI for shipping models to dedicated inference endpoints, an oumi-mcp server that puts Oumi inside any MCP-capable assistant, batch-API parity across the major hosted providers, sliding-window rate limiting on remote engines, a new Cerebras integration, multi-turn conversation synthesis, and a major dependency push to Transformers v5, TRL 0.24+, and vLLM 0.14+.
What’s New in v0.8
1. oumi deploy — One-Command Dedicated Endpoints
Going from a fine-tuned checkpoint to a live, autoscaling endpoint used to mean a tour through provider docs and a custom script. Oumi v0.8 introduces oumi deploy, a first-class deployment CLI that validates your model, uploads the weights, stands up a dedicated endpoint, polls it until it’s live, and (optionally) fires test prompts — all in one command.
Key Features:
🚀 Single-command deploy from a YAML config or CLI flags
🧩 Full-model or LoRA-adapter uploads, with provider-specific base-model linking
⚙️ Autoscaling: configurable
min_replicas,max_replicas, hardware, and GPU count🔄 Lifecycle commands:
upload,create-endpoint,status,list,list-models,list-hardware🧱 Pluggable provider architecture: Fireworks.ai and Parasail ship in this release, more on the way
Quick Start:
# Single-command deploy from a YAML config
oumi deploy up --config configs/examples/deploy/fireworks_deploy.yaml
# Or assemble the deploy on the CLI
oumi deploy up \
--model-path /path/to/my-finetuned-model/ \
--provider fireworks \
--hardware nvidia_h100_80gb \
--gpu-count 2 \
--min-replicas 1 \
--max-replicas 4Example fireworks_deploy.yaml:
model_source: /path/to/my-finetuned-model/
provider: fireworks
model_name: my-finetuned-model-v1
model_type: full # or "adapter" for LoRA + base_model: ...
hardware:
accelerator: nvidia_h100_80gb
count: 2
autoscaling:
min_replicas: 1
max_replicas: 4
test_prompts:
- "Hello, how are you?"A typed exception hierarchy and a shared base_client.py make it straightforward to plug additional providers into the same workflow. If you’re tired of writing the “upload → create endpoint → poll → smoke test” script for every model you ship, this one’s for you.
Learn More: Deploying Models Guide
2. oumi-mcp — Oumi from Inside Your Assistant
Oumi v0.8 ships an MCP (Model Context Protocol) server so that MCP-capable assistants — Claude Desktop, Claude Code, Cursor, and others — can browse Oumi’s ~500 bundled YAML configs, validate them, launch and monitor training, eval, and inference jobs (local or cloud), and read built-in workflow guidance, all without leaving chat.
Key Features:
🔍 Fuzzy search across ~500 ready-to-use configs by path, filename, or content
✅ Pre-flight validation: HF auth, gated repo access, hardware, local paths, SkyPilot setup
🛰️ Launch and babysit training/eval/inference jobs on local or cloud (SkyPilot)
📖 Built-in workflow prompts for
get-started,train,infer,eval,synth,analyze,post-training,cloud-launch, and an end-to-endmle_workflow🛡️ Safe defaults: dry-run on every job launch, typed confirmation on destructive cluster teardown
Quick Start:
pip install "oumi[mcp]"
# Claude Code (run from your project directory)
claude mcp add oumi oumi-mcpOr wire it up manually in claude_desktop_config.json / ~/.claude.json / Cursor settings:
{
"mcpServers": {
"oumi": { "command": "oumi-mcp" }
}
}Once it’s connected, the assistant can do things like “find me a LoRA config for Llama 3.1 8B on an A100, validate it, and dry-run it on GCP” — and the server gates every action through pre-flight checks and dry-run previews before anything real happens.
Learn More: MCP Server Guide
3. Batch API Support Across Hosted Providers
Batch APIs cut hosted-inference cost by ~50% and are perfect for offline eval, synthetic data generation, and large-scale judging. Oumi v0.8 brings batch-inference parity to Anthropic, Fireworks, and Together, with unified job control (submit, poll, cancel, partial retry) and progress tracking across providers.
Key Features:
🪄 Single
infer_batch()call handles the full submit-poll-fetch lifecycle⏱️ Configurable batch completion windows (e.g.,
"24h")🔁 Job control: cancel, partial retry, progress tracking
🧮 Cache-token usage now reported for Anthropic and Together (prompt caching is on by default for Anthropic)
Quick Start:
from oumi.core.configs import InferenceConfig
from oumi.inference import AnthropicInferenceEngine
config = InferenceConfig.from_yaml("infer.yaml")
engine = AnthropicInferenceEngine(model_params=config.model)
# Submit, poll, fetch results — engine handles the batch lifecycle
results = engine.infer_batch(input=conversations, inference_config=config)Or flip it on from YAML:
# infer.yaml
remote_params:
use_batch_api: true
batch_completion_window: "24h"Learn More: Inference Engines Guide
4. Built-in RPM/TPM Rate Limiting
Every RemoteInferenceEngine now has sliding-window rate limiting built in, tracking requests per minute, input tokens per minute, and output tokens per minute independently from each provider response. This means you can pin API budgets directly in your config — no external proxy, no homegrown semaphore wrapper.
Key Features:
🚦 Independent RPM, input-TPM, and output-TPM limits
🧠 Provider-aware: reads actual usage from each response
🔧 Drop-in replacement for the now-deprecated
politeness_policy
Quick Start:
# infer.yaml
model:
model_name: claude-opus-4-7
engine: ANTHROPIC
remote_params:
num_workers: 16
requests_per_minute: 4000
input_tokens_per_minute: 400_000
output_tokens_per_minute: 80_000If you’ve ever burned an afternoon debugging 429s in the middle of a large eval, this is the upgrade you’ve been waiting for.
5. Cerebras Inference Engine
A new CerebrasInferenceEngine lands in v0.8, registered with the standard InferenceEngineType factory so it plugs into every Oumi workflow (eval, judge, synth, infer) the same way every other engine does.
Quick Start:
model:
model_name: llama-3.3-70b
engine: CEREBRAS
remote_params:
api_key_env_varname: CEREBRAS_API_KEYfrom oumi.inference import CerebrasInferenceEngine
engine = CerebrasInferenceEngine(model_params=...)6. Multi-Turn Conversation Synthesis
oumi synth learned to chain conversation synthesizers into full multi-turn dialogues — a major step up for anyone distilling assistant or customer-support style training data.
Key Features:
💬 Compose multiple synthesizers into a single multi-turn conversation
🎭 Structured action blocks (
CLARIFY,LOOKUP_ORDER,INITIATE_RETURN, …) for realistic agent traces🧱 Same
oumi synthCLI and config schema you already know📊 Token-usage accumulation across synthesizer steps
Quick Start:
oumi synth -c oumi://configs/examples/synthesis/multiturn_conversation_synth.yamlThe bundled config generates five customer-support dialogues end-to-end; use it as a starting point and swap in your own scenarios, personas, and action vocabulary.
Learn More: Data Synthesis Guide
Plus: New Model Recipes and Transformers v5
A few more things worth calling out:
Transformers v5 — Oumi v0.8 absorbs the upgrade so you don’t have to.
Transformersis now>=4.57,<5.7, withTRL>=0.24,<1.4,vLLM>=0.14,<0.21,veRL>=0.5,<0.8, andPEFT>=0.17,<0.20. KTO/GKD import paths, DPO preprocessing,warmup_ratiodeprecation, and VL processor message handling are all handled internally.New model configs — Qwen3.5 0.8B (full-finetune, LoRA, HF + vLLM inference), Qwen3-VL (2B / 4B / 8B / 30B-A3B), Qwen3 MoE (235B, 30B-A3B, 80B-A3B-Instruct) LoRA, GPT-OSS 120B LoRA multi-GPU, and a Llama 4 Scout LoRA refresh.
Judge framework upgrades — batch inference, token-usage accounting, and the ability to take pre-built
Conversationobjects.Tool-use SFT, fixed —
DataCollatorForCompletionOnlyLMnow correctly masks tool-result tokens during completion-only training, so SFT on tool-use traces actually trains the assistant turn instead of the (already-known) tool output.Inference quality of life —
list_models()on every engine,finish_reasonsurfaced across engines, typedQuotaErrorfor catchable rate-limit failures, retries on transient HTTP 400s,api_inputattached toAPIStatusErrorfor debugging, andvllm_config_overridesfor arbitrary kwargs.Cleaner config errors — raw
OmegaConfstack traces are replaced by a typedOumiConfigParsingErrorhierarchy with human-readable messages, andDatasetParams.finalize_and_validatenow checks thatdataset_pathactually exists.Launcher —
start_at,end_at, andcost_per_houron the base cluster, useful for time-boxed runs and pricing-aware cluster selection.
New Contributors
A huge welcome to our new contributors who helped make v0.8 possible:
@idoudali
@AnandVishesh1301
Thank you for your contributions!
Get Started with Oumi v0.8
Installation
# Core installation
pip install oumi
# With the MCP server
pip install "oumi[mcp]"
# With deployment providers
pip install oumi
# (no extra needed — providers are picked up at runtime)
# Per-cloud extras for cloud launches via SkyPilot
pip install "oumi[aws]"
pip install "oumi[gcp]"
pip install "oumi[azure]"
pip install "oumi[kubernetes]"
Documentation
Example Configs
Check out the example configs in the repository:
Fireworks Deploy —
oumi deploy upexamplesMulti-Turn Synthesis — multi-turn customer-support and other scenarios
Qwen3.5 0.8B — full-finetune, LoRA, HF + vLLM inference recipes
Qwen3-VL — 2B / 4B / 8B / 30B-A3B vision-language configs
GPT-OSS 120B LoRA — multi-GPU LoRA training
Full Changelog
For a complete list of changes, see the full changelog.
What’s Next?
We’re constantly improving Oumi based on your feedback. Have ideas or feature requests? Open an issue on GitHub or join our community discussions on Discord.
Happy training (and deploying)!
— The Oumi Team
