Inference CLI#
Overview#
The Oumi CLI provides a simple interface for running inference tasks. The main command is oumi infer,
which supports both interactive chat and batch processing modes. The interactive mode lets you send text inputs
directly from your terminal, while the batch mode lets you submit a jsonl file of conversations for batch processing.
To use the CLI you need an InferenceConfig. This config
will specify which model and inference engine you’re using, as well as any relevant
inference-time variables - see Inference Configuration for more details.
See also
Check out our Infer CLI definition to see the full list of command line options.
Basic Usage#
# Interactive chat
oumi infer -i -c config.yaml
# Process input file
oumi infer -c config.yaml --input_path input.jsonl --output_path output.jsonl
Command Options#
Option |
Description |
Default |
Example |
|---|---|---|---|
|
Configuration file path |
Required |
|
|
Enable interactive mode |
False |
|
|
Input JSONL file path |
None |
|
|
Output JSONL file path |
None |
|
|
GPU device(s) |
“cuda” |
|
|
Model name |
None |
|
|
Random seed |
None |
|
|
Logging level |
INFO |
|
Configuration File#
Example config.yaml:
model:
model_name: "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs:
device_map: "auto"
torch_dtype: "float16"
generation:
max_new_tokens: 100
temperature: 0.7
top_p: 0.9
batch_size: 1
engine: "VLLM"
Common Usage Patterns#
Interactive Chat#
# Basic chat
oumi infer -i -c configs/chat.yaml
# Chat with specific GPU
oumi infer -i -c configs/chat.yaml --model.device_map cuda:0
Batch Processing#
# Process dataset
oumi infer -c configs/batch.yaml \
--input_path dataset.jsonl \
--output_path results.jsonl \
--generation.batch_size 32
Multi-GPU Inference#
# Use specific GPUs
oumi infer -c configs/multi_gpu.yaml \
--model.device_map "cuda:0,cuda:1"
# Tensor parallel inference
oumi infer -c configs/multi_gpu.yaml \
--model.model_kwargs.tensor_parallel_size 4
Input/Output Formats#
Input JSONL#
{"messages": [{"role": "user", "content": "Hello!"}]}
{"messages": [{"role": "user", "content": "How are you?"}]}
Output JSONL#
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi!"}]}
{"messages": [{"role": "user", "content": "How are you?"}, {"role": "assistant", "content": "I'm good!"}]}
See Also#
Inference Configuration for config file options
Common Workflows for usage examples