Custom Evaluations#
With Oumi, custom evaluations are effortless and powerful, giving you complete control over how model performance is assessed. Whether you’re working with open- or closed-source models, setup is simple: just configure a few settings, no code changes required. Provide your dataset, select your models, and register an evaluation function tailored to the metrics that matter most to you, from accuracy and consistency to bias or domain-specific goals. Oumi handles the rest, including running inference, so you can focus on gaining insights, not managing infrastructure.
Custom Evaluations Step-by-Step#
Running a custom evaluation involves three simple steps. First, define the evaluation configuration using a YAML
file. Next, register your custom evaluation function to compute the metrics that matter to you. Finally, execute the evaluation using Oumi’s Evaluator
, which orchestrates the entire process.
Step 1: Defining Evaluation Configuration#
The evaluation configuration is defined in a YAML
file and parsed into an EvaluationConfig
object. Below is a simple example for evaluating GPT-4o. You can evaluate most open models (Llama, DeepSeek, Qwen, Phi, and others), closed models (Gemini, Claude, OpenAI), and cloud-hosted models (Vertex AI, Together, SambaNova, etc.) by simply updating the model_name
and inference_engine
fields. Example configurations for popular APIs are available at Oumi’s repo.
For custom evaluations, always set evaluation_backend
to custom
, and assign task_name
to the name of your registered custom evaluation function (see Step 2). For more details on setting the configuration file for evaluations, including evaluating custom models, refer to our documentation.
model:
model_name: "gpt-4o"
inference_engine: OPENAI
generation:
max_new_tokens: 8192
temperature: 0.0
tasks:
- evaluation_backend: custom
task_name: my_custom_evaluation
Step 2: Defining Custom Evaluation Function#
To define a custom evaluation function, simply register a Python function using the @register_evaluation_function
decorator. Your function can optionally accept any of the reserved parameters below, depending on your needs:
config
(EvaluationConfig
): The full evaluation configuration defined in Step 1. Include this if you need access to platform-level settings or variables.task_params
(EvaluationTaskParams
): Represents a specific evaluation task from theYAML
file. If your configuration defines multiple tasks undertasks
, this parameter will contain the metadata for the one currently being evaluated.inference_engine
(BaseInferenceEngine
): An automatically generated engine for the model specified in the evaluation configuration (bymodel_name
). Use itsinfer()
method to run inference on a list of examples formatted asConversation
.User-defined inputs (e.g.
my_input
): You may also include any number of additional parameters of any type. These are passed in during execution (see Step 3).
Your custom evaluation function is expected to return a dictionary where each key is a metric name and each value is the corresponding computed result.
from oumi.core.registry import register_evaluation_function
from oumi.core.configs import EvaluationConfig, EvaluationTaskParams
from oumi.core.inference import BaseInferenceEngine
@register_evaluation_function("my_custom_evaluation")
def my_custom_evaluation(
config: EvaluationConfig,
task_params: EvaluationTaskParams,
inference_engine: BaseInferenceEngine,
my_input
) -> dict[str, Any]
Step 3: Executing the Evaluation#
Once you have defined your YAML
configuration and registered the custom evaluation function (as specified by the task_name
in your configuration), you can run the evaluation using the code snippet below.
The Evaluator
’s evaluate
method requires the evaluation configuration (config
of type EvaluationConfig
) to be passed in. It also supports any number of user-defined variables passed as keyword arguments (e.g., my_input
in the example below). These variable names must exactly match the parameters defined in your custom evaluation function’s signature. Otherwise, a runtime error will occur.
The evaluate
method returns a list of EvaluationResult
objects, one for each task defined in the tasks
section of your YAML
file. Each result includes the dictionary returned by the custom evaluation function (result.task_result
), along with useful metadata such as result.start_time
, result.elapsed_time_sec
, and more.
from oumi.core.configs import EvaluationConfig
from oumi.core.evaluation import Evaluator
config = EvaluationConfig.from_yaml(<path/to/yaml/file>)
results = Evaluator().evaluate(config, my_input=<user_input>)
Walk-through Example#
This section walks through a simple example to demonstrate how to use custom evaluations in practice. If you are interested in a more realistic walk-through, see our hallucination classifier notebook.
Suppose you want to assess response verbosity (i.e., the average length of model responses, measured in number of characters) across multiple models. To do this, assume you’ve prepared a dataset of user queries. A toy dataset (my_conversations
) with two examples is shown below, formatted as a list of Conversation
objects.
from oumi.core.types.conversation import Conversation, Message, Role
my_conversations = [
Conversation(
messages=[
Message(role=Role.USER, content="Hello there!"),
]
),
Conversation(
messages=[
Message(role=Role.USER, content="How are you?"),
]
),
]
Step 1: Defining the Evaluation Configuration#
Start by defining a YAML
configuration for each model you want to evaluate. The configuration specifies the model, inference engine, and a link to the custom evaluation function via the task_name
.
gpt_4o_config = """
model:
model_name: "gpt-4o"
inference_engine: OPENAI
tasks:
- evaluation_backend: custom
task_name: model_verboseness_evaluation
"""
Step 2: Defining Custom Evaluation Function#
Next, define the evaluation function. Start by using the provided inference_engine
to run inference and generate model responses. During inference, the engine appends a response (i.e., a Message
with role Role
=ASSISTANT
) at the end of each conversation
(type: Conversation
) of the list conversations
.
You can retrieve the model response from each Conversation
using the last_message()
method, then compute the average character length across all responses, as shown in the example below.
from oumi.core.registry import register_evaluation_function
@register_evaluation_function("model_verboseness_evaluation")
def model_verboseness_evaluation(inference_engine, conversations):
# Run inference to generate the model responses.
conversations = inference_engine.infer(conversations)
aggregate_response_length = 0
for conversation in conversations:
# Extract the assistant's (model's) response from the conversation.
response: str = conversation.last_message().content
# Update the sum of lengths for all model responses.
aggregate_response_length += len(response)
return {"average_response_length": aggregate_response_length / len(conversations)}
Step 3: Executing the Evaluation#
Finally, run the evaluation using the code snippet below. This will execute inference and compute the verbosity metric based on your custom evaluation function. Note that conversations
is a user-defined variable, intended to pass the dataset into the evaluation function.
from oumi.core.configs import EvaluationConfig
from oumi.core.evaluation import Evaluator
config = EvaluationConfig.from_str(gpt_4o_config)
results = Evaluator().evaluate(config, conversations=my_conversations)
The average response length can be retrieved from results
as shown below. Since this walkthrough assumes a single task (defined in the tasks
section of the YAML
config), we only examine the first ([0]
) item in the results
list.
result_dict = results[0].get_results()
print(f"Average length: {result_dict['average_response_length']}")