oumi.judges_v2#

This module provides access to various judge configurations for the Oumi project.

The judges are used to evaluate the quality of AI-generated responses based on different criteria such as helpfulness, honesty, and safety.

class oumi.judges_v2.BaseJudge(prompt_template: str, system_instruction: str | None, example_field_values: list[dict[str, str]], response_format: JudgeResponseFormat, output_fields: list[JudgeOutputField], inference_engine: BaseInferenceEngine)[source]#

Bases: object

Base class for implementing judges that evaluate model outputs.

A judge takes structured inputs, formats them using a prompt template, runs inference to get judgments, and parses the results into structured outputs.

judge(inputs: list[dict[str, str]]) list[JudgeOutput][source]#

Evaluate a batch of inputs and return structured judgments.

Parameters:

inputs – List of dictionaries containing input data for evaluation. Each dict must contain values for all prompt_template placeholders.

Returns:

List of structured judge outputs with parsed results

Raises:

ValueError – If inference returns unexpected number of conversations

class oumi.judges_v2.JudgeOutput(*, raw_output: str, parsed_output: dict[str, str] = {}, output_fields: list[JudgeOutputField] | None = None, field_values: dict[str, float | int | str | bool | None] = {}, field_scores: dict[str, float | None] = {}, response_format: JudgeResponseFormat | None = None)[source]#

Bases: BaseModel

Represents the output from a judge evaluation.

Variables:
  • raw_output (str) – The original unprocessed output from the judge

  • parsed_output (dict[str, str]) – Structured data (fields & their values) extracted from raw output

  • output_fields (list[oumi.judges_v2.base_judge.JudgeOutputField] | None) – List of expected output fields for this judge

  • field_values (dict[str, float | int | str | bool | None]) – Typed values for each expected output field

  • field_scores (dict[str, float | None]) – Numeric scores for each expected output field (if applicable)

  • response_format (oumi.core.configs.params.judge_params.JudgeResponseFormat | None) – Format used for generating output (XML, JSON, or RAW)

field_scores: dict[str, float | None]#
field_values: dict[str, float | int | str | bool | None]#
classmethod from_raw_output(raw_output: str, response_format: JudgeResponseFormat, output_fields: list[JudgeOutputField]) Self[source]#

Generate a structured judge output from a raw model output.

generate_raw_output(field_values: dict[str, str]) str[source]#

Generate raw output string from field values in the specified format.

Parameters:

field_values – Dictionary mapping field keys to their string values. Must contain values for all required output fields.

Returns:

Formatted raw output string ready for use as assistant response.

Raises:

ValueError – If required output fields are missing from field_values, if response_format/output_fields are not set, or if response_format is not supported.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'field_scores': FieldInfo(annotation=dict[str, Union[float, NoneType]], required=False, default={}), 'field_values': FieldInfo(annotation=dict[str, Union[float, int, str, bool, NoneType]], required=False, default={}), 'output_fields': FieldInfo(annotation=Union[list[JudgeOutputField], NoneType], required=False, default=None), 'parsed_output': FieldInfo(annotation=dict[str, str], required=False, default={}), 'raw_output': FieldInfo(annotation=str, required=True), 'response_format': FieldInfo(annotation=Union[JudgeResponseFormat, NoneType], required=False, default=None)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

output_fields: list[JudgeOutputField] | None#
parsed_output: dict[str, str]#
raw_output: str#
response_format: JudgeResponseFormat | None#
to_json() str[source]#

Convert the JudgeOutput to a JSON string.

Returns:

JSON string representation of the JudgeOutput data.

class oumi.judges_v2.JudgeOutputField(*, field_key: str, field_type: JudgeOutputType, field_scores: dict[str, float] | None)[source]#

Bases: BaseModel

Represents a single output field that a judge can produce.

Variables:
  • field_key (str) – The key/name for this field in the judge’s output

  • field_type (oumi.core.configs.params.judge_params.JudgeOutputType) – The data type expected for this field’s value

  • field_scores (dict[str, float] | None) – Optional mapping from categorical values to numeric scores

field_key: str#
field_scores: dict[str, float] | None#
field_type: JudgeOutputType#
get_typed_value(raw_value: str) float | int | str | bool | None[source]#

Convert the field’s raw string value to the appropriate type.

Parameters:

raw_value – The raw string value from the judge’s output

Returns:

The typed value, or None if conversion fails

Raises:

ValueError – If the field_type is not supported

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'field_key': FieldInfo(annotation=str, required=True), 'field_scores': FieldInfo(annotation=Union[dict[str, float], NoneType], required=True), 'field_type': FieldInfo(annotation=JudgeOutputType, required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

class oumi.judges_v2.SimpleJudge(judge_config: JudgeConfig | str)[source]#

Bases: BaseJudge

Judge class for evaluating outputs based on a given configuration.