Oumi AI

Oumi OSS v0.8: Deploy, MCP, and Batch Inference Everywhere

By Stefan Webb

May 15, 2026

We’re excited to announce Oumi OSS (Open Source Stack) v0.8, a release focused on closing the loop from training to production. This version lands a brand-new oumi deploy CLI for shipping models to dedicated inference endpoints, an oumi-mcp server that puts Oumi inside any MCP-capable assistant, batch-API parity across the major hosted providers, sliding-window rate limiting on remote engines, a new Cerebras integration, multi-turn conversation synthesis, and a major dependency push to Transformers v5, TRL 0.24+, and vLLM 0.14+.

What’s New in v0.8

1. oumi deploy — One-Command Dedicated Endpoints

Going from a fine-tuned checkpoint to a live, autoscaling endpoint used to mean a tour through provider docs and a custom script. Oumi v0.8 introduces oumi deploy, a first-class deployment CLI that validates your model, uploads the weights, stands up a dedicated endpoint, polls it until it’s live, and (optionally) fires test prompts — all in one command.

Key Features:

  • 🚀 Single-command deploy from a YAML config or CLI flags

  • 🧩 Full-model or LoRA-adapter uploads, with provider-specific base-model linking

  • ⚙️ Autoscaling: configurable min_replicas, max_replicas, hardware, and GPU count

  • 🔄 Lifecycle commands: upload, create-endpoint, status, list, list-models, list-hardware

  • 🧱 Pluggable provider architecture: Fireworks.ai and Parasail ship in this release, more on the way

Quick Start:

# Single-command deploy from a YAML config
oumi deploy up --config configs/examples/deploy/fireworks_deploy.yaml

# Or assemble the deploy on the CLI
oumi deploy up \
  --model-path /path/to/my-finetuned-model/ \
  --provider fireworks \
  --hardware nvidia_h100_80gb \
  --gpu-count 2 \
  --min-replicas 1 \
  --max-replicas 4

Example fireworks_deploy.yaml:

model_source: /path/to/my-finetuned-model/
provider: fireworks
model_name: my-finetuned-model-v1
model_type: full          # or "adapter" for LoRA + base_model: ...
hardware:
  accelerator: nvidia_h100_80gb
  count: 2
autoscaling:
  min_replicas: 1
  max_replicas: 4
test_prompts:
  - "Hello, how are you?"

A typed exception hierarchy and a shared base_client.py make it straightforward to plug additional providers into the same workflow. If you’re tired of writing the “upload → create endpoint → poll → smoke test” script for every model you ship, this one’s for you.

Learn More: Deploying Models Guide


2. oumi-mcp — Oumi from Inside Your Assistant

Oumi v0.8 ships an MCP (Model Context Protocol) server so that MCP-capable assistants — Claude Desktop, Claude Code, Cursor, and others — can browse Oumi’s ~500 bundled YAML configs, validate them, launch and monitor training, eval, and inference jobs (local or cloud), and read built-in workflow guidance, all without leaving chat.

Key Features:

  • 🔍 Fuzzy search across ~500 ready-to-use configs by path, filename, or content

  • ✅ Pre-flight validation: HF auth, gated repo access, hardware, local paths, SkyPilot setup

  • 🛰️ Launch and babysit training/eval/inference jobs on local or cloud (SkyPilot)

  • 📖 Built-in workflow prompts for get-started, train, infer, eval, synth, analyze, post-training, cloud-launch, and an end-to-end mle_workflow

  • 🛡️ Safe defaults: dry-run on every job launch, typed confirmation on destructive cluster teardown

Quick Start:

pip install "oumi[mcp]"

# Claude Code (run from your project directory)
claude mcp add oumi oumi-mcp

Or wire it up manually in claude_desktop_config.json / ~/.claude.json / Cursor settings:

{
  "mcpServers": {
    "oumi": { "command": "oumi-mcp" }
  }
}

Once it’s connected, the assistant can do things like “find me a LoRA config for Llama 3.1 8B on an A100, validate it, and dry-run it on GCP” — and the server gates every action through pre-flight checks and dry-run previews before anything real happens.

Learn More: MCP Server Guide


3. Batch API Support Across Hosted Providers

Batch APIs cut hosted-inference cost by ~50% and are perfect for offline eval, synthetic data generation, and large-scale judging. Oumi v0.8 brings batch-inference parity to Anthropic, Fireworks, and Together, with unified job control (submit, poll, cancel, partial retry) and progress tracking across providers.

Key Features:

  • 🪄 Single infer_batch() call handles the full submit-poll-fetch lifecycle

  • ⏱️ Configurable batch completion windows (e.g., "24h")

  • 🔁 Job control: cancel, partial retry, progress tracking

  • 🧮 Cache-token usage now reported for Anthropic and Together (prompt caching is on by default for Anthropic)

Quick Start:

from oumi.core.configs import InferenceConfig
from oumi.inference import AnthropicInferenceEngine

config = InferenceConfig.from_yaml("infer.yaml")
engine = AnthropicInferenceEngine(model_params=config.model)

# Submit, poll, fetch results — engine handles the batch lifecycle
results = engine.infer_batch(input=conversations, inference_config=config)

Or flip it on from YAML:

# infer.yaml
remote_params:
  use_batch_api: true
  batch_completion_window: "24h"

Learn More: Inference Engines Guide


4. Built-in RPM/TPM Rate Limiting

Every RemoteInferenceEngine now has sliding-window rate limiting built in, tracking requests per minute, input tokens per minute, and output tokens per minute independently from each provider response. This means you can pin API budgets directly in your config — no external proxy, no homegrown semaphore wrapper.

Key Features:

  • 🚦 Independent RPM, input-TPM, and output-TPM limits

  • 🧠 Provider-aware: reads actual usage from each response

  • 🔧 Drop-in replacement for the now-deprecated politeness_policy

Quick Start:

# infer.yaml
model:
  model_name: claude-opus-4-7

engine: ANTHROPIC

remote_params:
  num_workers: 16
  requests_per_minute: 4000
  input_tokens_per_minute: 400_000
  output_tokens_per_minute: 80_000

If you’ve ever burned an afternoon debugging 429s in the middle of a large eval, this is the upgrade you’ve been waiting for.


5. Cerebras Inference Engine

A new CerebrasInferenceEngine lands in v0.8, registered with the standard InferenceEngineType factory so it plugs into every Oumi workflow (eval, judge, synth, infer) the same way every other engine does.

Quick Start:

model:
  model_name: llama-3.3-70b
engine: CEREBRAS
remote_params:
  api_key_env_varname: CEREBRAS_API_KEY
from oumi.inference import CerebrasInferenceEngine
engine = CerebrasInferenceEngine(model_params=...)

6. Multi-Turn Conversation Synthesis

oumi synth learned to chain conversation synthesizers into full multi-turn dialogues — a major step up for anyone distilling assistant or customer-support style training data.

Key Features:

  • 💬 Compose multiple synthesizers into a single multi-turn conversation

  • 🎭 Structured action blocks (CLARIFY, LOOKUP_ORDER, INITIATE_RETURN, …) for realistic agent traces

  • 🧱 Same oumi synth CLI and config schema you already know

  • 📊 Token-usage accumulation across synthesizer steps

Quick Start:

oumi synth -c oumi://configs/examples/synthesis/multiturn_conversation_synth.yaml

The bundled config generates five customer-support dialogues end-to-end; use it as a starting point and swap in your own scenarios, personas, and action vocabulary.

Learn More: Data Synthesis Guide


Plus: New Model Recipes and Transformers v5

A few more things worth calling out:

  • Transformers v5 — Oumi v0.8 absorbs the upgrade so you don’t have to. Transformers is now >=4.57,<5.7, with TRL >=0.24,<1.4, vLLM >=0.14,<0.21, veRL >=0.5,<0.8, and PEFT >=0.17,<0.20. KTO/GKD import paths, DPO preprocessing, warmup_ratio deprecation, and VL processor message handling are all handled internally.

  • New model configs — Qwen3.5 0.8B (full-finetune, LoRA, HF + vLLM inference), Qwen3-VL (2B / 4B / 8B / 30B-A3B), Qwen3 MoE (235B, 30B-A3B, 80B-A3B-Instruct) LoRA, GPT-OSS 120B LoRA multi-GPU, and a Llama 4 Scout LoRA refresh.

  • Judge framework upgrades — batch inference, token-usage accounting, and the ability to take pre-built Conversation objects.

  • Tool-use SFT, fixedDataCollatorForCompletionOnlyLM now correctly masks tool-result tokens during completion-only training, so SFT on tool-use traces actually trains the assistant turn instead of the (already-known) tool output.

  • Inference quality of lifelist_models() on every engine, finish_reason surfaced across engines, typed QuotaError for catchable rate-limit failures, retries on transient HTTP 400s, api_input attached to APIStatusError for debugging, and vllm_config_overrides for arbitrary kwargs.

  • Cleaner config errors — raw OmegaConf stack traces are replaced by a typed OumiConfigParsingError hierarchy with human-readable messages, and DatasetParams.finalize_and_validate now checks that dataset_path actually exists.

  • Launcherstart_at, end_at, and cost_per_hour on the base cluster, useful for time-boxed runs and pricing-aware cluster selection.


New Contributors

A huge welcome to our new contributors who helped make v0.8 possible:

  • @idoudali

  • @AnandVishesh1301

Thank you for your contributions!


Get Started with Oumi v0.8

Installation

# Core installation
pip install oumi

# With the MCP server
pip install "oumi[mcp]"

# With deployment providers
pip install oumi
# (no extra needed — providers are picked up at runtime)

# Per-cloud extras for cloud launches via SkyPilot
pip install "oumi[aws]"
pip install "oumi[gcp]"
pip install "oumi[azure]"
pip install "oumi[kubernetes]"

Documentation

Example Configs

Check out the example configs in the repository:

Full Changelog

For a complete list of changes, see the full changelog.


What’s Next?

We’re constantly improving Oumi based on your feedback. Have ideas or feature requests? Open an issue on GitHub or join our community discussions on Discord.

Happy training (and deploying)!

— The Oumi Team