Inference Engines#
Oumi’s inference API provides a unified interface for multiple inference engines through the InferenceEngine
class.
In this guide, we’ll go through each supported engine, what they are best for, and how to get started using them.
Introduction#
Before digging into specific engines, let’s look at the basic patterns for initializing both local and remote inference engines.
These patterns will be consistent across all engine types, making it easy to switch between them as your needs change.
Local Inference
Let’s start with a basic example of how to use the VLLMInferenceEngine
to run inference on a local model.
from oumi.inference import VLLMInferenceEngine
from oumi.core.configs import ModelParams
# Local inference with vLLM
engine = VLLMInferenceEngine(
ModelParams(
model_name="meta-llama/Llama-3.2-1B-Instruct",
)
)
Using the CLI
You can also specify configuration in YAML, and use the CLI to run inference:
oumi infer --engine VLLM --model.model_name meta-llama/Llama-3.2-1B-Instruct
Checkout the Inference CLI for more information on how to use the CLI.
Cloud APIs
Remote inference engines (i.e. API based) require a RemoteParams
object to be passed in.
The RemoteParams
object contains the API URL and any necessary API keys. For example, here is to use Claude Sonnet 3.5:
from oumi.inference import AnthropicInferenceEngine
from oumi.core.configs import ModelParams, RemoteParams
engine = AnthropicInferenceEngine(
model_params=ModelParams(
model_name="claude-3-5-sonnet-20240620",
),
remote_params=RemoteParams(
api_key_env_varname="ANTHROPIC_API_KEY",
)
)
Supported Parameters
Each inference engine supports a different set of parameters (for example, different generation parameters, or specific model kwargs).
Make sure to check the Inference Configuration for an exhaustive list of supported parameters, and the reference page for the specific engine you are using to find the parameters it supports.
For example, the supported parameters for the VLLMInferenceEngine
can be found in get_supported_params()
.
Local Inference#
This next section covers setting up and optimizing local inference engines for running models directly on your machine, whether you’re running on a laptop or a server with multiple GPUs.
Local inference is ideal for running your own fine-tuned models, and in general for development, testing, and scenarios where you need complete control over your inference environment.
Hardware Recommendations#
The following tables provide a rough estimate of the memory requirements for different model sizes using both BF16 and Q4 quantization.
The actual memory requirements might vary based on the specific quantization implementation and additional optimizations used.
Also note that Q4 quantization typically comes with some degradation in model quality, though the impact varies by model architecture and task.
BF16 / FP16 (16-bit)
Model Size |
GPU VRAM |
Notes |
---|---|---|
1B |
~2 GB |
Can run on most modern GPUs |
3B |
~6 GB |
Can run on mid-range GPUs |
7B |
~14 GB |
Can run on consumer GPUs like RTX 3090 or RX 7900 XTX |
13B |
~26 GB |
Requires high-end GPU or multiple GPUs |
33B |
~66 GB |
Requires enterprise GPUs or multi-GPU setup |
70B |
~140 GB |
Typically requires multiple A100s or H100s |
Q4 (4-bit)
Model Size |
GPU VRAM |
Notes |
---|---|---|
1B |
~0.5 GB |
Can run on most integrated GPUs |
3B |
~1.5 GB |
Can run on entry-level GPUs |
7B |
~3.5 GB |
Can run on most gaming GPUs |
13B |
~6.5 GB |
Can run on mid-range GPUs |
33B |
~16.5 GB |
Can run on high-end consumer GPUs |
70B |
~35 GB |
Can run on professional GPUs |
vLLM Engine#
vLLM is a high-performance inference engine that implements state-of-the-art serving techniques like PagedAttention for optimal memory usage and throughput.
vLLM is our recommended choice for production deployments on GPUs.
Installation
First, make sure to install the vLLM package:
pip install vllm
Basic Usage
engine = VLLMInferenceEngine(
ModelParams(
model_name="meta-llama/Llama-3.1-8B-Instruct",
)
)
Tensor Parallel Inference
For multi-GPU setups, you can leverage tensor parallelism:
# Tensor parallel inference
model_params = ModelParams(
model_name="meta-llama/Llama-3.2-1B-Instruct",
model_kwargs={
"tensor_parallel_size": 2, # Set to number of GPUs
"gpu_memory_utilization": 1.0, # Memory usage
"enable_prefix_caching": True, # Enable prefix caching
}
)
Resources
LlamaCPP Engine#
For scenarios where GPU resources are limited or unavailable, the LlamaCPP engine provides an excellent alternative.
Built on the highly optimized llama.cpp library, this engine excels at CPU inference and quantized models, making it particularly suitable for edge deployment and resource-constrained environments. ls even on modest hardware.
LlamaCPP is a great choice for CPU inference and inference with quantized models.
Installation
pip install llama-cpp-python
Basic Usage
engine = LlamaCppInferenceEngine(
ModelParams(
model_name="model.gguf",
model_kwargs={
"n_gpu_layers": 0, # CPU only
"n_ctx": 2048, # Context window
"n_batch": 512, # Batch size
"low_vram": True # Memory optimization
}
)
)
Resources
Native Engine#
The Native engine uses HuggingFace’s 🤗 Transformers library directly, providing maximum compatibility and ease of use.
While it may not offer the same performance optimizations as vLLM or LlamaCPP, its simplicity and compatibility make it an excellent choice for prototyping and testing.
Basic Usage
engine = NativeTextInferenceEngine(
ModelParams(
model_name="meta-llama/Llama-3.2-1B-Instruct",
model_kwargs={
"device_map": "auto",
"torch_dtype": "float16"
}
)
)
4-bit Quantization
For memory-constrained environments, 4-bit quantization is available:
model_params = ModelParams(
model_kwargs={
"load_in_4bit": True,
}
)
Remote vLLM#
vLLM can be deployed as a server, providing high-performance inference capabilities over HTTP. This section covers different deployment scenarios and configurations.
Server Setup#
Basic Server - Suitable for development and testing:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 6864
Multi-GPU Server - For large models requiring multiple GPUs:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--port 6864 \
--tensor-parallel-size 4
Client Configuration#
The client can be configured with different reliability and performance options similar to any other remote engine:
# Basic client with timeout and retry settings
engine = RemoteVLLMInferenceEngine(
model_params=ModelParams(
model_name="meta-llama/Llama-3.1-8B-Instruct"
),
remote_params=RemoteParams(
api_url="http://localhost:6864",
max_retries=3, # Maximum number of retries
num_workers=10, # Number of parallel threads
)
)
Remote SGLang#
SGLang is another model server, providing high-performance LLM inference capabilities.
Server Setup#
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 6864 \
--disable-cuda-graph \
--mem-fraction-static=0.99
Please refer to SGLang documentation for more advanced configuration options.
Client Configuration#
The client can be configured with different reliability and performance options similar to any other remote engines:
engine = SGLangInferenceEngine(
model_params=ModelParams(
model_name="meta-llama/Llama-3.1-8B-Instruct"
),
remote_params=RemoteParams(
api_url="http://localhost:6864"
)
)
To run inference interactively, use the oumi infer
command with the -i
flag.
oumi infer -c configs/recipes/llama3_1/inference/8b_sglang_infer.yaml -i
Cloud APIs#
While local inference offers control and flexibility, cloud APIs provide access to state-of-the-art models and scalable infrastructure without the need to manage your own hardware.
Anthropic#
Claude is Anthropic’s advanced language model, available through their API.
Basic Usage
from oumi.inference import AnthropicInferenceEngine
from oumi.core.configs import ModelParams, RemoteParams
engine = AnthropicInferenceEngine(
model_params=ModelParams(
model_name="claude-3-5-sonnet-20240620"
),
remote_params=RemoteParams(
api_key_env_varname="ANTHROPIC_API_KEY",
)
)
Supported Models
The Anthropic models available via this API as of late Jan’2025 are listed below. For an up-to-date list, please visit this page.
Anthropic Model |
API Model Name |
---|---|
Claude 3.5 Sonnet (most intelligent) |
claude-3-5-sonnet-latest |
Claude 3.5 Haiku (fastest) |
claude-3-5-haiku-latest |
Claude 3.0 Opus |
claude-3-opus-latest |
Claude 3.0 Sonnet |
claude-3-sonnet-20240229 |
Claude 3.0 Haiku |
claude-3-haiku-20240307 |
Resources
Google Cloud#
Google Cloud provides multiple pathways for accessing their AI models, either through the Vertex AI platform or directly via the Gemini API.
Vertex AI#
Installation
pip install "oumi[gcp]"
Basic Usage
from oumi.inference import GoogleVertexInferenceEngine
from oumi.core.configs import ModelParams, RemoteParams
engine = GoogleVertexInferenceEngine(
model_params=ModelParams(
model_name="google/gemini-1.5-pro"
),
remote_params=RemoteParams(
api_url="https://{region}-aiplatform.googleapis.com/v1beta1/projects/{project_id}/locations/{region}/endpoints/openapi/chat/completions",
)
)
Supported Models
The most popular Google Vertex AI models available via this API (as of late Jan’2025) are listed below. For a full list, including specialized and 3rd party models, please visit this page.
Gemini Model |
API Model Name |
---|---|
Gemini 2.0 Flash Thinking Mode |
google/gemini-2.0-flash-thinking-exp-01-21 |
Gemini 2.0 Flash |
google/gemini-2.0-flash-exp |
Gemini 1.5 Flash |
google/gemini-1.5-flash-002 |
Gemini 1.5 Pro |
google/gemini-1.5-pro-002 |
Gemini 1.0 Pro Vision |
google/gemini-1.0-pro-vision-001 |
Gemma Model |
API Model Name |
---|---|
Gemma 2 2B IT |
google/gemma2-2b-it |
Gemma 2 9B IT |
google/gemma2-9b-it |
Gemma 2 27B IT |
google/gemma2-27b-it |
Code Gemma 2B |
google/codegemma-2b |
Code Gemma 7B |
google/codegemma-7b |
Code Gemma 7B IT |
google/codegemma-7b-it |
Resources
Vertex AI Documentation for Google Cloud AI services
Gemini API#
Basic Usage
from oumi.inference import GoogleGeminiInferenceEngine
from oumi.core.configs import ModelParams, RemoteParams
engine = GoogleGeminiInferenceEngine(
model_params=ModelParams(
model_name="gemini-1.5-flash"
),
remote_params=RemoteParams(
api_key_env_varname="GEMINI_API_KEY",
)
)
Supported Models
The Gemini models available via this API as of late Jan’2025 are listed below. For an up-to-date list, please visit this page.
Model Name |
API Model Name |
---|---|
Gemini 2.0 Flash (experimental) |
gemini-2.0-flash-exp |
Gemini 1.5 Flash |
gemini-1.5-flash |
Gemini 1.5 Flash-8B |
gemini-1.5-flash-8b |
Gemini 1.5 Pro |
gemini-1.5-pro |
Gemini 1.0 Pro (deprecated) |
gemini-1.0-pro |
AQA |
aqa |
Resources
Gemini API Documentation for Gemini API details
OpenAI#
OpenAI’s models, including GPT-4, represent some of the most widely used and capable AI systems available.
Basic Usage
from oumi.inference import OpenAIInferenceEngine
from oumi.core.configs import ModelParams, RemoteParams
engine = OpenAIInferenceEngine(
model_params=ModelParams(
model_name="gpt-4o-mini"
),
remote_params=RemoteParams(
api_key_env_varname="OPENAI_API_KEY",
)
)
Supported Models
The most popular models available via the OpenAI API as of late Jan’2025 are listed below. For a full list please visit this page
OpenAI Model |
API Model Name |
---|---|
GPT 4o (flagship model) |
gpt-4o |
GPT 4o mini (fast and affordable) |
gpt-4o-mini |
o1 (reasoning model) |
o1 |
o1 mini (reasoning and affordable) |
o1-mini |
GPT-4 Turbo |
gpt-4-turbo |
GPT-4 |
gpt-4 |
Resources
OpenAI API Documentation for OpenAI API details
Together#
Together offers remote inference for 100+ models through serverless endpoints.
Basic Usage
from oumi.inference import TogetherInferenceEngine
from oumi.core.configs import ModelParams, RemoteParams
engine = TogetherInferenceEngine(
model_params=ModelParams(
model_name="meta-llama/Llama-3.2-1B-Instruct"
),
remote_params=RemoteParams(
api_key_env_varname="TOGETHER_API_KEY",
)
)
The models available via this API can be found at together.ai.
DeepSeek#
DeepSeek allows to access the DeepSeek models (Chat, Code, and Reasoning) through the DeepSeek AI Platform.
Basic Usage
from oumi.inference import DeepSeekInferenceEngine
from oumi.core.configs import ModelParams, RemoteParams
engine = DeepSeekInferenceEngine(
model_params=ModelParams(
model_name="deepseek-chat"
),
remote_params=RemoteParams(
api_key_env_varname="DEEPSEEK_API_KEY",
)
)
Supported Models
The DeepSeek models available via this API as of late Jan’2025 are listed below. For an up-to-date list, please visit this page.
DeepSeek Model |
API Model Name |
---|---|
DeepSeek-V3 |
deepseek-chat |
DeepSeek-R1 (reasoning with CoT) |
deepseek-reasoner |
SambaNova#
SambaNova offers an extreme-speed inference platform on cloud infrastructure with wide variety of models.
This service is particularly useful when you need to run open source models in a managed environment.
Basic Usage
from oumi.inference import SambanovaInferenceEngine
from oumi.core.configs import ModelParams, RemoteParams
engine = SambanovaInferenceEngine(
model_params=ModelParams(
model_name="Meta-Llama-3.1-405B-Instruct"
),
remote_params=RemoteParams(
api_key_env_varname="SAMBANOVA_API_KEY",
)
)
** Reference **
Parasail.io#
Parasail.io offers a cloud-native inference platform that combines the flexibility of self-hosted models with the convenience of cloud infrastructure.
This service is particularly useful when you need to run open source models in a managed environment.
Basic Usage
Here’s how to configure Oumi for Parasail.io:
from oumi.inference import ParasailInferenceEngine
from oumi.core.configs import ModelParams, RemoteParams
engine = ParasailInferenceEngine(
model_params=ModelParams(
model_name="meta-llama/Llama-3.2-1B-Instruct"
),
remote_params=RemoteParams(
api_key_env_varname="PARASAIL_API_KEY",
)
)
The models available via this API can be found at docs.parasail.io.
Resources
See Also#
Configuration Guide for detailed config options
Common Workflows for usage examples