Evaluation Configuration#

Oumi allows users to define their evaluation configurations through a YAML file, providing a flexible, human-readable, and easily customizable format for setting up experiments. By using YAML, users can effortlessly configure model and generation parameters, and define a list of tasks to evaluate with. This approach not only streamlines the process of configuring evaluations but also ensures that configurations are easily versioned, shared, and reproduced across different environments and teams.

Configuration Structure#

The configuration YAML file is loaded into EvaluationConfig class, and consists of ModelParams, EvaluationTaskParams, and GenerationParams. If the evaluation benchmark is generative, meaning that the model responses need to be first generated (inferred) and then evaluated by a judge, you can also set the inference_engine (InferenceEngineType) for local inference or the inference_remote_params (RemoteParams) for remote inference.

Here’s an advanced configuration example, showing many of the available parameters:

model:
  model_name: "microsoft/Phi-3-mini-4k-instruct"
  trust_remote_code: True
  adapter_model: "path/to/adapter"  # Optional: For adapter-based models

tasks:
  # LM Harness Tasks
  - evaluation_platform: lm_harness
    task_name: mmlu
    num_samples: 100
    eval_kwargs:
      num_fewshot: 5
  - evaluation_platform: lm_harness
    task_name: arc_challenge
    eval_kwargs:
      num_fewshot: 25
  - evaluation_platform: lm_harness
    task_name: hellaswag
    eval_kwargs:
      num_fewshot: 10

  # AlpacaEval Task
  - evaluation_platform: alpaca_eval
    version: 2.0  # or 1.0
    num_samples: 805

generation:
  batch_size: 16
  max_new_tokens: 512
  temperature: 0.0

inference_engine: NATIVE

output_dir: "my_evaluation_results"
enable_wandb: true
run_name: "phi3-evaluation"

Configuration Options#

  • model: Model-specific configuration (ModelParams)

    • model_name: HuggingFace model identifier or local path

    • trust_remote_code: Whether to trust remote code (for custom models)

    • adapter_model: Path to adapter weights (optional)

    • adapter_type: Type of adapter (“lora” or “qlora”)

    • shard_for_eval: Enable multi-GPU parallelization on a single node

  • tasks: List of evaluation tasks (EvaluationTaskParams)

    • LM Harness Task Parameters: (LMHarnessTaskParams)

      • evaluation_platform: “lm_harness”

      • task_name: Name of the LM Harness task

      • num_fewshot: Number of few-shot examples (0 for zero-shot)

      • num_samples: Number of samples to evaluate

      • eval_kwargs: Additional task-specific parameters

    • AlpacaEval Task Parameters: (AlpacaEvalTaskParams)

      • evaluation_platform: “alpaca_eval”

      • version: AlpacaEval version (1.0 or 2.0)

      • num_samples: Number of samples to evaluate

      • eval_kwargs: Additional task-specific parameters

  • generation: Generation parameters (GenerationParams)

    • batch_size: Batch size for inference (“auto” for automatic selection)

    • max_new_tokens: Maximum number of tokens to generate

    • temperature: Sampling temperature

  • inference_engine: Inference engine for local inference (InferenceEngineType)

  • inference_remote_params: Inference parameters for remote inference (RemoteParams)

  • enable_wandb: Enable Weights & Biases logging

  • output_dir: Directory for saving results

  • run_name: Name of the evaluation run