Inference CLI

Contents

Inference CLI#

Overview#

The Oumi CLI provides a simple interface for running inference tasks. The main command is oumi infer, which supports both interactive chat and batch processing modes. The interactive mode lets you send text inputs directly from your terminal, while the batch mode lets you submit a jsonl file of conversations for batch processing.

To use the CLI you need an InferenceConfig. This config will specify which model and inference engine you’re using, as well as any relevant inference-time variables - see Inference Configuration for more details.

See also

Check out our Infer CLI definition to see the full list of command line options.

Basic Usage#

# Interactive chat
oumi infer -i -c config.yaml

# Process input file
oumi infer -c config.yaml --input_path input.jsonl --output_path output.jsonl

Command Options#

Option	Description	Default	Example
`-c`, `--config`	Configuration file path	Required	`-c config.yaml`
`-i`, `--interactive`	Enable interactive mode	False	`-i`
`--input_path`	Input JSONL file path	None	`--input_path data.jsonl`
`--output_path`	Output JSONL file path	None	`-output_path results.jsonl`
`--model.device_map`	GPU device(s)	“cuda”	`--model.device_map "cuda:0"`
`--model.model_name`	Model name	None	`--model.model_name "HuggingFaceTB/SmolLM2-135M-Instruct"`
`--generation.seed`	Random seed	None	`--seed 42`
`--log-level`	Logging level	INFO	`--log-level DEBUG`

Configuration File#

Example config.yaml:

model:
  model_name: "meta-llama/Llama-3.1-8B-Instruct"
  model_kwargs:
    device_map: "auto"
    torch_dtype: "float16"

generation:
  max_new_tokens: 100
  temperature: 0.7
  top_p: 0.9
  batch_size: 1

engine: "VLLM"

Common Usage Patterns#

Interactive Chat#

# Basic chat
oumi infer -i -c configs/chat.yaml

# Chat with specific GPU
oumi infer -i -c configs/chat.yaml --model.device_map cuda:0

Batch Processing#

# Process dataset
oumi infer -c configs/batch.yaml \
  --input_path dataset.jsonl \
  --output_path results.jsonl \
  --generation.batch_size 32

Multi-GPU Inference#

# Use specific GPUs
oumi infer -c configs/multi_gpu.yaml \
  --model.device_map "cuda:0,cuda:1"

# Tensor parallel inference
oumi infer -c configs/multi_gpu.yaml \
  --model.model_kwargs.tensor_parallel_size 4

Input/Output Formats#

Input JSONL#

{"messages": [{"role": "user", "content": "Hello!"}]}
{"messages": [{"role": "user", "content": "How are you?"}]}

Output JSONL#

{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi!"}]}
{"messages": [{"role": "user", "content": "How are you?"}, {"role": "assistant", "content": "I'm good!"}]}

See Also#

Inference Configuration for config file options
Common Workflows for usage examples