Evaluation#
Overview#
Oumi offers a flexible and unified framework designed to assess and benchmark Large Language Models (LLMs) and Vision Language Models (VLMs). The framework allows researchers, developers, and organizations to easily evaluate the performance of their models across a variety of benchmarks, compare results, and track progress in a standardized and reproducible way.
Key features include:
Seamless Setup: Single-step installation for all packages and dependencies, ensuring quick and conflict-free setup.
Consistency: Platform ensures deterministic execution and reproducible results. Reproducibility is achieved by automatically logging and versioning all environmental parameters and experimental configurations.
Diversity: Offering a wide range of benchmarks across domains. Oumi enables a comprehensive evaluation of LLMs on tasks ranging from natural language understanding to creative text generation, providing holistic assessment across various real-world applications.
Scalability: Supports multi-GPU and multi-node evaluations, along with the ability to shard large models across multiple GPUs/nodes. Incorporates batch processing optimizations to effectively manage memory constraints and ensure efficient resource utilization.
Multimodality: Designed with multiple modalities in mind, Oumi already supports evaluating on joint image-text inputs, assessing VLMs on cross-modal reasoning tasks, where visual and linguistic data are inherently linked.
Oumi seamlessly integrates with leading platforms such as LM Evaluation Harness, AlpacaEval, and (WIP) MT-Bench.
Benchmark Types#
Type |
Description |
When to Use |
Get Started |
---|---|---|---|
Standardized Benchmarks |
Assess model knowledge and reasoning capability through structured questions with predefined answers |
Ideal for measuring factual knowledge, reasoning capabilities, and performance on established text-based and multi-modal benchmarks |
|
Open-Ended Generation |
Evaluate model’s ability to effectively respond to open-ended questions |
Best for assessing instruction-following capabilities and response quality |
|
LLM as Judge |
Automated assessment using LLMs |
Suitable for automated evaluation of response quality against predefined (helpfulness, honesty, safety) or custom criteria |
Quick Start#
Using the CLI#
The simplest way to evaluate a model is by authoring a YAML
configuration, and calling the Oumi CLI:
configs/recipes/phi3/evaluation/eval.yaml
# Eval config for Phi3.
#
# Requirements:
# - Log into WandB (`wandb login`) or disable `enable_wandb`
#
# Usage:
# oumi evaluate -c configs/recipes/phi3/evaluation/eval.yaml
#
# See Also:
# - Documentation: https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html
# - Config class: oumi.core.configs.EvaluationConfig
# - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/evaluation_config.py
# - Other eval configs: configs/**/evaluation/
model:
model_name: "microsoft/Phi-3-mini-4k-instruct"
trust_remote_code: True
shard_for_eval: True
# HuggingFace Leaderboard V1
tasks:
- evaluation_platform: lm_harness
task_name: mmlu
eval_kwargs:
num_fewshot: 5
- evaluation_platform: lm_harness
task_name: arc_challenge
eval_kwargs:
num_fewshot: 25
- evaluation_platform: lm_harness
task_name: winogrande
eval_kwargs:
num_fewshot: 5
- evaluation_platform: lm_harness
task_name: hellaswag
eval_kwargs:
num_fewshot: 10
- evaluation_platform: lm_harness
task_name: truthfulqa_mc2
eval_kwargs:
num_fewshot: 0
- evaluation_platform: lm_harness
task_name: gsm8k
eval_kwargs:
num_fewshot: 5
generation:
batch_size: 1
enable_wandb: False
oumi evaluate -c configs/recipes/phi3/evaluation/eval.yaml
To run evaluation with multiple GPUs, see Multi-GPU Evaluation.
Using the Python API#
For more programmatic control, you can use the Python API to load the EvaluationConfig
class:
from oumi import evaluate
from oumi.core.configs import EvaluationConfig
# Load configuration from YAML
config = EvaluationConfig.from_yaml("configs/recipes/phi3/evaluation/eval.yaml")
# Run evaluation
evaluate(config)
Configuration File#
A minimal evaluation configuration file looks as follows. The model_name
can be a HuggingFace model name or a local path to a model. For more details on configuration settings, please visit the evaluation configuration page.
model:
model_name: "microsoft/Phi-3-mini-4k-instruct"
trust_remote_code: True
tasks:
- evaluation_platform: lm_harness
task_name: mmlu
output_dir: "my_evaluation_results"
Multi-GPU Evaluation#
Multiple GPUs can be used to make evaluation faster and to allow evaluation of larger models that do not fit on a single GPU.
The parallelization can be enabled using the shard_for_eval: True
configuration parameter.
model:
model_name: "microsoft/Phi-3-mini-4k-instruct"
trust_remote_code: True
shard_for_eval: True
tasks:
- evaluation_platform: lm_harness
task_name: mmlu
output_dir: "my_evaluation_results"
With shard_for_eval: True
it’s recommended to use accelerate
:
oumi distributed accelerate launch -m oumi evaluate -c configs/recipes/phi3/evaluation/eval.yaml
Note
Only single node, multiple GPU machine configurations are currently allowed i.e., multi-node evaluation isn’t supported.
Results and Logging#
The evaluation outputs are saved under the specified output_dir
, in a folder named <platform>_<timestamp>
. This folder includes the evaluation results and all metadata required to reproduce the results.
Evaluation Results#
File |
Description |
---|---|
|
A dictionary that contains all evaluation metrics relevant to the benchmark, together with the execution duration, and date/time of execution. |
Schema
{
"results": {
<benchmark_name>: {
<metric_1>: <metric_1_value>,
<metric_2>: <metric_2_value>,
etc.
},
},
"duration_sec": <execution_duration>,
"start_time": <start_date_and_time>,
}
Reproducibility Metadata#
To ensure that evaluations are fully reproducible, Oumi automatically logs all input configurations and environmental parameters, as shown below. These files provide a complete and traceable record of each evaluation, enabling users to reliably replicate results, ensuring consistency and transparency throughout the evaluation lifecycle.
File |
Description |
Reference |
---|---|---|
|
Evaluation task parameters |
|
|
Model parameters |
|
|
Generation parameters |
|
|
Inference configuration (for generative benchmarks) |
|
|
Package version information |
N/A. Flat dictionary of all installed packages and their versions |
Weights & Biases#
To enhance experiment tracking and result visualization, Oumi integrates with Weights and Biases (Wandb), a leading tool for managing machine learning workflows. Wandb enables users to monitor and log metrics, hyperparameters, and model outputs in real time, providing detailed insights into model performance throughout the evaluation process. When enable_wandb
is set, Wandb results are automatically logged, empowering users to track experiments with greater transparency, and easily visualize trends across multiple runs. This integration streamlines the process of comparing models, identifying optimal configurations, and maintaining an organized, collaborative record of all evaluation activities.
To ensure Wandb results are logged:
Enable Wandb in the configuration file
enable_wandb: true
Ensure the environmental variable
WANDB_PROJECT
points to your project name
os.environ["WANDB_PROJECT"] = "my-evaluation-project"