oumi.core.evaluation.backends#
Submodules#
oumi.core.evaluation.backends.alpaca_eval module#
- oumi.core.evaluation.backends.alpaca_eval.evaluate(task_params: AlpacaEvalTaskParams, config: EvaluationConfig) EvaluationResult [source]#
Evaluates a model using the Alpaca Eval framework.
For detailed documentation on the AlpacaEval framework, we refer you to the following readme: tatsu-lab/alpaca_eval.
- Parameters:
task_params – The AlpacaEval parameters to use for evaluation.
config – The desired configuration for evaluation.
- Returns:
The evaluation result (including metrics and their values).
oumi.core.evaluation.backends.lm_harness module#
- oumi.core.evaluation.backends.lm_harness.evaluate(task_params: LMHarnessTaskParams, config: EvaluationConfig, random_seed: int | None = 0, numpy_random_seed: int | None = 1234, torch_random_seed: int | None = 1234) EvaluationResult [source]#
Evaluates a model using the LM Evaluation Harness framework (EleutherAI).
For detailed documentation, we refer you to the following readme: EleutherAI/lm-evaluation-harness
- Parameters:
task_params – The LM Harness parameters to use for evaluation.
config – The evaluation configuration.
random_seed – The random seed to use for python’s random package.
numpy_random_seed – The numpy random seed to use for reproducibility.
torch_random_seed – The torch random seed to use for reproducibility.
- Note for random seeds (random_seed, numpy_random_seed, torch_random_seed):
These have been set to be consistent with LM Harness’ simple_evaluate(). See: lm-evaluation-harness/blob/main/lm_eval/evaluator.py
- Returns:
The evaluation results (dict of metric names and their corresponding values).