oumi.core.evaluation.backends#

Submodules#

oumi.core.evaluation.backends.alpaca_eval module#

oumi.core.evaluation.backends.alpaca_eval.evaluate(task_params: AlpacaEvalTaskParams, config: EvaluationConfig) EvaluationResult[source]#

Evaluates a model using the Alpaca Eval framework.

For detailed documentation on the AlpacaEval framework, we refer you to the following readme: tatsu-lab/alpaca_eval.

Parameters:
  • task_params – The AlpacaEval parameters to use for evaluation.

  • config – The desired configuration for evaluation.

Returns:

The evaluation result (including metrics and their values).

oumi.core.evaluation.backends.lm_harness module#

oumi.core.evaluation.backends.lm_harness.evaluate(task_params: LMHarnessTaskParams, config: EvaluationConfig, random_seed: int | None = 0, numpy_random_seed: int | None = 1234, torch_random_seed: int | None = 1234) EvaluationResult[source]#

Evaluates a model using the LM Evaluation Harness framework (EleutherAI).

For detailed documentation, we refer you to the following readme: EleutherAI/lm-evaluation-harness

Parameters:
  • task_params – The LM Harness parameters to use for evaluation.

  • config – The evaluation configuration.

  • random_seed – The random seed to use for python’s random package.

  • numpy_random_seed – The numpy random seed to use for reproducibility.

  • torch_random_seed – The torch random seed to use for reproducibility.

Note for random seeds (random_seed, numpy_random_seed, torch_random_seed):

These have been set to be consistent with LM Harness’ simple_evaluate(). See: lm-evaluation-harness/blob/main/lm_eval/evaluator.py

Returns:

The evaluation results (dict of metric names and their corresponding values).