oumi.core.evaluation#

Core evaluator module for the Oumi library.

This module provides an evaluator for evaluating models with popular evaluation libraries, such as LM Harness and AlpacaEval. It also allows users to define their own custom evaluation function. The evaluator is designed to be modular and provide a consistent interface for evaluating across different tasks.

Example

>>> from oumi.core.configs import EvaluationConfig
>>> from oumi.core.evaluation import Evaluator
>>> config = EvaluationConfig.from_yaml("evaluation_config.yaml")
>>> evaluator = Evaluator()
>>> result = evaluator.evaluate(evaluation_config)
class oumi.core.evaluation.EvaluationResult(task_name: str | None = None, task_result: dict[str, Any] | None = None, backend_config: dict[str, Any] | None = None, start_time: str | None = None, elapsed_time_sec: int | None = None)[source]#

Bases: object

Class that retains the evaluation results generated by all backends.

Variables:
  • task_name – The name of the task on which the model was evaluated.

  • task_result – The result of evaluating on the task. This is a dictionary where the keys are metric names and the values are their corresponding values. The captured metrics vary based on the backend and the specific task.

  • backend_config – The configuration of the backend. This is a dictionary including configuration parameters that are specific to the backend used to evaluate on the task. They are retained for reproducibility.

  • start_time – A human-friendly string (recommended format is: YYYYMMDD_HHMMSS) that indicates the date and time when the evaluation started.

  • elapsed_time_sec – The duration to complete the evaluation (in seconds).

to_dict() dict[str, Any][source]#

Convert the EvaluationResult to a dictionary.

class oumi.core.evaluation.Evaluator[source]#

Bases: object

A class for evaluating language models on various tasks.

Currently, the evaluator supports a wide range of tasks that are handled by three separate backends: LM Harness, Alpaca Eval, and Custom.

  • LM Harness: Framework by EleutherAI for evaluating language models (mostly) on

    standardized benchmarks (multiple-choice, word match, etc). The backend supports a large number of popular benchmarks, which can be found at: EleutherAI/lm-evaluation-harness.

  • Alpaca Eval: Framework for evaluating the instruction-following capabilities of

    language models, as well as whether their responses are helpful, accurate, and relevant. The instruction set consists of 805 open-ended questions, while the evaluation is based on “LLM-as-judge” and prioritizes human-alignment, aiming to assess whether the model responses meet the expectations of human evaluators.

  • Custom: Users can register their own evaluation functions using the decorator

    @register_evaluation_function and run custom evaluations based on their functions. Note that the task_name should be the registry key for the custom evaluation function to be used.

evaluate(config: EvaluationConfig, **kwargs) list[EvaluationResult][source]#

Evaluates a model using the provided evaluation configuration.

Parameters:
  • config – The desired configuration for evaluation.

  • kwargs – Additional keyword arguments required by evaluator backends.

Returns:

List of evaluation results (one per task, in the same order with tasks).

evaluate_task(task_params: EvaluationTaskParams, config: EvaluationConfig, **kwargs) EvaluationResult[source]#

Evaluates a model using the provided configuration on a specific task.

Parameters:
  • task_params – The task parameters for evaluation.

  • config – The desired evaluation configuration for evaluation.

  • kwargs – Additional keyword arguments required by evaluator backends.

Returns:

The results for evaluating on the task.

save_output(task_params: EvaluationTaskParams, evaluation_result: EvaluationResult, base_output_dir: str, config: EvaluationConfig | None) None[source]#

Saves the evaluation’s output to the specified output directory.

Parameters:
  • task_params – The task parameters used for this evaluation.

  • evaluation_result – The evaluation result.

  • base_output_dir – The directory where the evaluation results will be saved.

  • config – The evaluation configuration.

Returns:

None

Subpackages#