Evaluation#

Overview#

Oumi offers a flexible and unified framework designed to assess and benchmark Large Language Models (LLMs) and Vision Language Models (VLMs). The framework allows researchers, developers, and organizations to easily evaluate the performance of their models across a variety of benchmarks, compare results, and track progress in a standardized and reproducible way.

Key features include:

  • Seamless Setup: Single-step installation for all packages and dependencies, ensuring quick and conflict-free setup.

  • Consistency: Platform ensures deterministic execution and reproducible results. Reproducibility is achieved by automatically logging and versioning all environmental parameters and experimental configurations.

  • Diversity: Offering a wide range of benchmarks across domains. Oumi enables a comprehensive evaluation of LLMs on tasks ranging from natural language understanding to creative text generation, providing holistic assessment across various real-world applications.

  • Scalability: Supports multi-GPU and multi-node evaluations, along with the ability to shard large models across multiple GPUs/nodes. Incorporates batch processing optimizations to effectively manage memory constraints and ensure efficient resource utilization.

  • Multimodality: Designed with multiple modalities in mind, Oumi already supports evaluating on joint image-text inputs, assessing VLMs on cross-modal reasoning tasks, where visual and linguistic data are inherently linked.

Oumi seamlessly integrates with leading platforms such as LM Evaluation Harness, AlpacaEval, and (WIP) MT-Bench.

Benchmark Types#

Type

Description

When to Use

Get Started

Standardized Benchmarks

Assess model knowledge and reasoning capability through structured questions with predefined answers

Ideal for measuring factual knowledge, reasoning capabilities, and performance on established text-based and multi-modal benchmarks

See Standardized benchmarks page

Open-Ended Generation

Evaluate model’s ability to effectively respond to open-ended questions

Best for assessing instruction-following capabilities and response quality

See Generative benchmarks page

LLM as Judge

Automated assessment using LLMs

Suitable for automated evaluation of response quality against predefined (helpfulness, honesty, safety) or custom criteria

See Judge documentation

Quick Start#

Using the CLI#

The simplest way to evaluate a model is by authoring a YAML configuration, and calling the Oumi CLI:

configs/recipes/phi3/evaluation/eval.yaml
# Eval config for Phi3.
#
# Requirements:
#   - Log into WandB (`wandb login`) or disable `enable_wandb`
#
# Usage:
#   oumi evaluate -c configs/recipes/phi3/evaluation/eval.yaml
#
# See Also:
#   - Documentation: https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html
#   - Config class: oumi.core.configs.EvaluationConfig
#   - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/evaluation_config.py
#   - Other eval configs: configs/**/evaluation/

model:
  model_name: "microsoft/Phi-3-mini-4k-instruct"
  trust_remote_code: True
  shard_for_eval: True

# HuggingFace Leaderboard V1
tasks:
  - evaluation_platform: lm_harness
    task_name: mmlu
    eval_kwargs:
      num_fewshot: 5
  - evaluation_platform: lm_harness
    task_name: arc_challenge
    eval_kwargs:
      num_fewshot: 25
  - evaluation_platform: lm_harness
    task_name: winogrande
    eval_kwargs:
      num_fewshot: 5
  - evaluation_platform: lm_harness
    task_name: hellaswag
    eval_kwargs:
      num_fewshot: 10
  - evaluation_platform: lm_harness
    task_name: truthfulqa_mc2
    eval_kwargs:
      num_fewshot: 0
  - evaluation_platform: lm_harness
    task_name: gsm8k
    eval_kwargs:
      num_fewshot: 5

generation:
  batch_size: 1

enable_wandb: False
oumi evaluate -c configs/recipes/phi3/evaluation/eval.yaml

To run evaluation with multiple GPUs, see Multi-GPU Evaluation.

Using the Python API#

For more programmatic control, you can use the Python API to load the EvaluationConfig class:

from oumi import evaluate
from oumi.core.configs import EvaluationConfig

# Load configuration from YAML
config = EvaluationConfig.from_yaml("configs/recipes/phi3/evaluation/eval.yaml")

# Run evaluation
evaluate(config)

Configuration File#

A minimal evaluation configuration file looks as follows. The model_name can be a HuggingFace model name or a local path to a model. For more details on configuration settings, please visit the evaluation configuration page.

model:
  model_name: "microsoft/Phi-3-mini-4k-instruct"
  trust_remote_code: True

tasks:
  - evaluation_platform: lm_harness
    task_name: mmlu

output_dir: "my_evaluation_results"

Multi-GPU Evaluation#

Multiple GPUs can be used to make evaluation faster and to allow evaluation of larger models that do not fit on a single GPU. The parallelization can be enabled using the shard_for_eval: True configuration parameter.

model:
  model_name: "microsoft/Phi-3-mini-4k-instruct"
  trust_remote_code: True
  shard_for_eval: True

tasks:
  - evaluation_platform: lm_harness
    task_name: mmlu

output_dir: "my_evaluation_results"

With shard_for_eval: True it’s recommended to use accelerate:

oumi distributed accelerate launch -m oumi evaluate -c configs/recipes/phi3/evaluation/eval.yaml

Note

Only single node, multiple GPU machine configurations are currently allowed i.e., multi-node evaluation isn’t supported.

Results and Logging#

The evaluation outputs are saved under the specified output_dir, in a folder named <platform>_<timestamp>. This folder includes the evaluation results and all metadata required to reproduce the results.

Evaluation Results#

File

Description

platform_results.json

A dictionary that contains all evaluation metrics relevant to the benchmark, together with the execution duration, and date/time of execution.

Schema

{
  "results": {
    <benchmark_name>: {
      <metric_1>: <metric_1_value>,
      <metric_2>: <metric_2_value>,
      etc.
    },
  },
  "duration_sec": <execution_duration>,
  "start_time": <start_date_and_time>,
}

Reproducibility Metadata#

To ensure that evaluations are fully reproducible, Oumi automatically logs all input configurations and environmental parameters, as shown below. These files provide a complete and traceable record of each evaluation, enabling users to reliably replicate results, ensuring consistency and transparency throughout the evaluation lifecycle.

File

Description

Reference

task_params.json

Evaluation task parameters

EvaluationTaskParams

model_params.json

Model parameters

ModelParams

generation_params.json

Generation parameters

GenerationParams

inference_config.json

Inference configuration (for generative benchmarks)

InferenceConfig

package_versions.json

Package version information

N/A. Flat dictionary of all installed packages and their versions

Weights & Biases#

To enhance experiment tracking and result visualization, Oumi integrates with Weights and Biases (Wandb), a leading tool for managing machine learning workflows. Wandb enables users to monitor and log metrics, hyperparameters, and model outputs in real time, providing detailed insights into model performance throughout the evaluation process. When enable_wandb is set, Wandb results are automatically logged, empowering users to track experiments with greater transparency, and easily visualize trends across multiple runs. This integration streamlines the process of comparing models, identifying optimal configurations, and maintaining an organized, collaborative record of all evaluation activities.

To ensure Wandb results are logged:

enable_wandb: true
  • Ensure the environmental variable WANDB_PROJECT points to your project name

os.environ["WANDB_PROJECT"] = "my-evaluation-project"