Evaluation

Evaluation#

Overview#

Oumi offers a flexible and unified framework designed to assess and benchmark Large Language Models (LLMs) and Vision Language Models (VLMs). The framework allows researchers, developers, and organizations to easily evaluate the performance of their models across a variety of benchmarks, compare results, and track progress in a standardized and reproducible way.

Key features include:

Seamless Setup: Single-step installation for all packages and dependencies, ensuring quick and conflict-free setup.
Consistency: Platform ensures deterministic execution and reproducible results. Reproducibility is achieved by automatically logging and versioning all environmental parameters and experimental configurations.
Diversity: Offering a wide range of benchmarks across domains. Oumi enables a comprehensive evaluation of LLMs on tasks ranging from natural language understanding to creative text generation, providing holistic assessment across various real-world applications.
Scalability: Supports multi-GPU and multi-node evaluations, along with the ability to shard large models across multiple GPUs/nodes. Incorporates batch processing optimizations to effectively manage memory constraints and ensure efficient resource utilization.
Multimodality: Designed with multiple modalities in mind, Oumi already supports evaluating on joint image-text inputs, assessing VLMs on cross-modal reasoning tasks, where visual and linguistic data are inherently linked.

Oumi seamlessly integrates with leading evaluation frameworks such as LM Evaluation Harness, AlpacaEval, and (WIP) MT-Bench. For more specialized use cases not covered by these frameworks, Oumi also supports custom evaluation functions, enabling you to tailor evaluations to your specific needs.

Benchmark Types#

Type	Description	When to Use	Get Started
Standardized Benchmarks	Assess model knowledge and reasoning capability through structured questions with predefined answers	Ideal for measuring factual knowledge, reasoning capabilities, and performance on established text-based and multi-modal benchmarks	See Standardized benchmarks page
Open-Ended Generation	Evaluate model’s ability to effectively respond to open-ended questions	Best for assessing instruction-following capabilities and response quality	See Generative benchmarks page
LLM as Judge	Automated assessment using LLMs	Suitable for automated evaluation of response quality against predefined (helpfulness, honesty, safety) or custom criteria	See Judge documentation
Custom Evaluations	Fully custom evaluation functions	The most flexible option, allowing you to build any complex evaluation scenario	See Custom evaluations documentation

Quick Start#

Using the CLI#

The simplest way to evaluate a model is by authoring a YAML configuration, and calling the Oumi CLI:

oumi evaluate -c configs/recipes/phi3/evaluation/eval.yaml

To run evaluation with multiple GPUs, see Multi-GPU Evaluation.

Using the Python API#

For more programmatic control, you can use the Python API to load the EvaluationConfig class:

from oumi import evaluate
from oumi.core.configs import EvaluationConfig

# Load configuration from YAML
config = EvaluationConfig.from_yaml("configs/recipes/phi3/evaluation/eval.yaml")

# Run evaluation
evaluate(config)

Configuration File#

A minimal evaluation configuration file looks as follows. The model_name can be a HuggingFace model name or a local path to a model. For more details on configuration settings, please visit the evaluation configuration page.

model:
  model_name: "microsoft/Phi-3-mini-4k-instruct"
  trust_remote_code: True

tasks:
  - evaluation_backend: lm_harness
    task_name: mmlu

output_dir: "my_evaluation_results"

Multi-GPU Evaluation#

Multiple GPUs can be used to make evaluation faster and to allow evaluation of larger models that do not fit on a single GPU. The parallelization can be enabled using the shard_for_eval: True configuration parameter.

model:
  model_name: "microsoft/Phi-3-mini-4k-instruct"
  trust_remote_code: True
  shard_for_eval: True

tasks:
  - evaluation_backend: lm_harness
    task_name: mmlu

output_dir: "my_evaluation_results"

With shard_for_eval: True it’s recommended to use accelerate:

oumi distributed accelerate launch -m oumi evaluate -c configs/recipes/phi3/evaluation/eval.yaml

Note

Only single node, multiple GPU machine configurations are currently allowed i.e., multi-node evaluation isn’t supported.

Results and Logging#

The evaluation outputs are saved under the specified output_dir, in a folder named <backend>_<timestamp>. This folder includes the evaluation results and all metadata required to reproduce the results.

Evaluation Results#

File	Description
`task_result.json`	A dictionary that contains all evaluation metrics relevant to the benchmark, together with the execution duration, and date/time of execution.

Schema

{
  "results": {
    <benchmark_name>: {
      <metric_1>: <metric_1_value>,
      <metric_2>: <metric_2_value>,
      etc.
    },
  },
  "duration_sec": <execution_duration>,
  "start_time": <start_date_and_time>,
}

Reproducibility Metadata#

To ensure that evaluations are fully reproducible, Oumi automatically logs all input configurations and environmental parameters, as shown below. These files provide a complete and traceable record of each evaluation, enabling users to reliably replicate results, ensuring consistency and transparency throughout the evaluation lifecycle.

File	Description	Reference
`task_params.json`	Evaluation task parameters	`EvaluationTaskParams`
`model_params.json`	Model parameters	`ModelParams`
`generation_params.json`	Generation parameters	`GenerationParams`
`inference_config.json`	Inference configuration (for generative benchmarks)	`InferenceConfig`
`package_versions.json`	Package version information	N/A. Flat dictionary of all installed packages and their versions

Weights & Biases#

To enhance experiment tracking and result visualization, Oumi integrates with Weights and Biases (Wandb), a leading tool for managing machine learning workflows. Wandb enables users to monitor and log metrics, hyperparameters, and model outputs in real time, providing detailed insights into model performance throughout the evaluation process. When enable_wandb is set, Wandb results are automatically logged, empowering users to track experiments with greater transparency, and easily visualize trends across multiple runs. This integration streamlines the process of comparing models, identifying optimal configurations, and maintaining an organized, collaborative record of all evaluation activities.

To ensure Wandb results are logged:

Enable Wandb in the configuration file

enable_wandb: true

Ensure the environmental variable WANDB_PROJECT points to your project name

os.environ["WANDB_PROJECT"] = "my-evaluation-project"