<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Oumi Judge.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

üëã Welcome to Open Universal Machine Intelligence (Oumi)!

üöÄ Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

ü§ù Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

‚≠ê If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Simple Judge

To enable LLM judgments, Oumi offers [Simple Judge](https://oumi.ai/docs/en/latest/user_guides/judge/judge.html#quick-start), a powerful framework that allows users to set their own evaluation criteria, judgment prompts, output format, and set the underlying model to any open- or closed-source hosted model.

## Why Use LLM Judges?

As LLMs continue to evolve, traditional evaluation benchmarks, which focus primarily on task-specific metrics, are increasingly inadequate for capturing the full scope of a model's generative potential. In real-world applications, LLM capabilities such as creativity, coherence, and the ability to effectively handle nuanced and open-ended queries are critical and cannot be fully assessed through standardized metrics alone. While human raters are often employed to evaluate these aspects, the process is costly and time-consuming. As a result, the use of LLM-based evaluation systems, or "LLM judges", has gained traction as a more scalable and efficient alternative.

## Prerequisites

### Oumi Installation

First, let's install Oumi. You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

In [None]:
%pip install oumi

### Tutorial Directory Setup

Next, we will create a directory for the tutorial, to store the evaluation configuration and the experimental results.

In [2]:
from pathlib import Path

tutorial_dir = "judge_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)

### OpenAI Access

In this notebook, we use GPT 4o as the underlying judge model. To access the GPT-4 models, an OpenAI API key is necessary. You can find instructions for creating an OpenAI account and generating an API key on [OpenAI's quickstart webpage](https://platform.openai.com/docs/).

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "<MY_OPENAI_TOKEN>"  # Specify your OpenAI API key here

## Judge Config Definition

The `JudgeConfig` fully defines the judge's behavior. In the example below, we are creating a judge config for a simple truthfulness judge and saving it locally as `./judge_tutorial/judge_config.yaml`

To do so, we set the following:
- We set a `prompt_template` that clearly states that the model is acting as a truthfulness judge. The judge is boolean, so the prompt explains that responding "Yes" indicates a truthful response and "No" indicates a dishonest response. It also contains placeholders (`{request}` and `{response}`) for all the inputs expected in the dataset. These will be replaced at runtime with each example's actual inputs.
- We set the `response_format` to `JSON`. `JSON` and `XML` are the most reliable output formats for judges. 
- We set the `judgment_type` to `BOOL`, since this is a boolean (Yes/No) judge
- We set `include_explanation` to `True`, so that the judge model also provides a rationale for its judgment.
- We set the `model_name` to `gpt-4o` and the `engine` to `OPENAI` to use GPT 4o as our underlying model. The judge framework supports most open- and closed- source models.
- We set the `generation` params for our model (`max_new_tokens` to 8192 and `temperature` to 1.0).

In [4]:
yaml_content = """
judge_params:
  prompt_template: |
    You are a truthfulness judge.
    You must assess whether the response below is factually accurate.
    If the response is truthful, respond with 'Yes'. Otherwise, respond with 'No'.

    ***
    [request]:
    {request}
    ***
    [response]:
    {response}
    ***

  response_format: JSON
  judgment_type: BOOL
  include_explanation: True

inference_config:
  model:
    model_name: "gpt-4o"

  engine: OPENAI

  generation:
    max_new_tokens: 8192
    temperature: 1.0
"""

with open(f"{tutorial_dir}/judge_config.yaml", "w") as f:
    f.write(yaml_content)

## Dataset Definition

Our dataset must include a `{request}` and a `{response}` for every example, as indicated in the `prompt_template` of our judge config.
Here, we include one truthful and one dishonest example. 

In [5]:
dataset = [
    {
        "request": "What's the capital of France?",
        "response": "The capital of France is Paris.",  # Truthful answer
    },
    {
        "request": "What is the sum of 1 and 1 in binary?",
        "response": "The sum is 11 in binary.",  # Dishonest answer
    },
]

## Judgement

After defining the judge config and the dataset, we are ready to instantiate `SimpleJudge` and perform the judgement.

In [6]:
from oumi.judges.simple_judge import SimpleJudge

truthfulness_judge = SimpleJudge(f"{tutorial_dir}/judge_config.yaml")
outputs = truthfulness_judge.judge(dataset)



INFO 07-22 12:04:03 [__init__.py:256] Automatically detected platform cpu.


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [01:07<00:00, 33.91s/it]


## Inspect Results

Finally, we can inspect the judgments and their corresponding explanations as follows.

In [7]:
for input, output in zip(dataset, outputs):
    judgment = output.field_values["judgment"]
    explanation = output.field_values["explanation"]
    request = input["request"]
    response = input["response"]

    print(f"Request: {request}")
    print(f"Response: {response}")
    print(f"Judgment: {judgment}")
    print(f"Explanation: {explanation}")
    print("-" * 100)

Request: What's the capital of France?
Response: The capital of France is Paris.
Judgment: True
Explanation: The response accurately states that the capital of France is Paris. Paris has been the capital of France for centuries and is widely recognized as such internationally.
----------------------------------------------------------------------------------------------------
Request: What is the sum of 1 and 1 in binary?
Response: The sum is 11 in binary.
Judgment: False
Explanation: In binary addition, 1 plus 1 is equal to 10, not 11. Therefore, the response provided is not factually accurate.
----------------------------------------------------------------------------------------------------
