Inference CLI#
Overview#
The Oumi CLI provides a simple interface for running inference tasks. The main command is oumi infer
,
which supports both interactive chat and batch processing modes. The interactive mode lets you send text inputs
directly from your terminal, while the batch mode lets you submit a jsonl file of conversations for batch processing.
To use the CLI you need an InferenceConfig
. This config
will specify which model and inference engine you’re using, as well as any relevant
inference-time variables - see Inference Configuration for more details.
See also
Check out our Infer CLI definition to see the full list of command line options.
Basic Usage#
# Interactive chat
oumi infer -i -c config.yaml
# Process input file
oumi infer -c config.yaml --input_path input.jsonl --output_path output.jsonl
Command Options#
Option |
Description |
Default |
Example |
---|---|---|---|
|
Configuration file path |
Required |
|
|
Enable interactive mode |
False |
|
|
Input JSONL file path |
None |
|
|
Output JSONL file path |
None |
|
|
GPU device(s) |
“cuda” |
|
|
Model name |
None |
|
|
Random seed |
None |
|
|
Logging level |
INFO |
|
Configuration File#
Example config.yaml
:
model:
model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct"
model_kwargs:
device_map: "auto"
torch_dtype: "float16"
generation:
max_new_tokens: 100
temperature: 0.7
top_p: 0.9
batch_size: 1
engine: "VLLM"
Common Usage Patterns#
Interactive Chat#
# Basic chat
oumi infer -i -c configs/chat.yaml
# Chat with specific GPU
oumi infer -i -c configs/chat.yaml --model.device_map cuda:0
Batch Processing#
# Process dataset
oumi infer -c configs/batch.yaml \
--input_path dataset.jsonl \
--output_path results.jsonl \
--generation.batch_size 32
Multi-GPU Inference#
# Use specific GPUs
oumi infer -c configs/multi_gpu.yaml \
--model.device_map "cuda:0,cuda:1"
# Tensor parallel inference
oumi infer -c configs/multi_gpu.yaml \
--model.model_kwargs.tensor_parallel_size 4
Input/Output Formats#
Input JSONL#
{"messages": [{"role": "user", "content": "Hello!"}]}
{"messages": [{"role": "user", "content": "How are you?"}]}
Output JSONL#
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi!"}]}
{"messages": [{"role": "user", "content": "How are you?"}, {"role": "assistant", "content": "I'm good!"}]}
See Also#
Inference Configuration for config file options
Common Workflows for usage examples