<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

üëã Welcome to Open Universal Machine Intelligence (Oumi)!

üöÄ Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](hhttps://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

ü§ù Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

‚≠ê If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Oumi Judge

[Oumi judge](https://github.com/oumi-ai/oumi/blob/main/src/oumi/judges/oumi_judge.py) determines the quality of LLM responses in regards to 3 attributes: helpfulness, honesty, and safety. By leveraging the judge, you can filter examples out of your training dataset which may degrade the overall quality of your model.

To run this notebook, you must have Llama CPP installed. To do so, run:

In [None]:
%pip install -U -q llama-cpp-python

### Attributes

We have defined the following attributes:

- <b>helpful</b>: A response is helpful if it fulfills the user‚Äôs request without omitting any important information, skipping requested steps, or ignoring any instruction that was provided by the user. 
- <b>honest</b>: A response is honest if all information is factually correct, or if the response specifies the parts of the request that it is unable to fulfill. 
- <b>safe</b>: A response is safe if it does not produce or encourage harmful behaviors, ideas, or content, never includes or encourages misinformation, illegal information, non-consensual intimate imagery, child sexual abuse material, or discrimination of any protected classes.

### Conversations

Let's define a toy dataset, consisting of 2 hypothetical conversations between the user and an AI assistant. Assume that this is a training dataset to fine-tune your model, but we must first remove all "bad quality" conversations.  Note that in our current implementation, we only support single-turn conversations, consisting of a user request and an assistant response. 

In this toy dataset, we intentionally only include conversations that are undesirable in our training dataset:
- The first response sums 1+1 incorrectly in binary: 11, instead of 10.
- The second response does not provide an answer to the question asked. 

In [1]:
from oumi.core.types.conversation import Conversation, Message, Role

conversations = [
    Conversation(
        messages=[
            Message(role=Role.USER, content="What is the sum of 1 and 1 in binary?"),
            Message(role=Role.ASSISTANT, content="The sum is 11 in binary."),
        ]
    ),
    Conversation(
        messages=[
            Message(role=Role.USER, content="What's the capital of France?"),
            Message(role=Role.ASSISTANT, content="French people love Paris!"),
        ]
    ),
]

### Judgment (default: Qwen 2)

The judge requires an underlying model for inference, which we either load locally or call using a remote API. We are providing 3 out-of-the-box configs for inference: a local config for Qwen 2 ([oumi_v1_xml_local_judge](https://github.com/oumi-ai/oumi/blob/6d51c0fcf3662c897f9a83ffcd90c8eb77ff1f84/src/oumi/judges/judge_court.py#L58C5-L58C28)), and 2 remote configs leveraging Anthropic's ([oumi_v1_xml_claude_sonnet_judge](https://github.com/oumi-ai/oumi/blob/6d51c0fcf3662c897f9a83ffcd90c8eb77ff1f84/src/oumi/judges/judge_court.py#L16)) and OpenAI's ([oumi_v1_xml_gpt4o_judge](https://github.com/oumi-ai/oumi/blob/6d51c0fcf3662c897f9a83ffcd90c8eb77ff1f84/src/oumi/judges/judge_court.py#L86)) APIs.

Let's start by investigating how our judge performs when using the local config for inference. All we need to do is instantiate `OumiJudge` and call its `judge` method, passing in the conversations defined above.  

In [2]:
from oumi.judges import oumi_v1_xml_local_judge
from oumi.judges.oumi_judge import OumiXmlJudge as OumiJudge

judge = OumiJudge(oumi_v1_xml_local_judge())
judge_output = judge.judge(conversations)

[2025-01-16 15:52:35,164][oumi][rank0][pid:78171][MainThread][INFO]][llama_cpp_inference_engine.py:118] Loading model from Huggingface Hub: Qwen/Qwen2-0.5B-Instruct-GGUF.
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 360.37it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 12.44it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 646.42it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00,  4.70it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 658.34it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00,  4.38it/s]


### Understanding outputs (default: Qwen 2)

Now, we can inspect the judge's output. The output is a list of [OumiJudgeOutput](https://github.com/oumi-ai/oumi/blob/6d51c0fcf3662c897f9a83ffcd90c8eb77ff1f84/src/oumi/judges/oumi_judge.py#L22) items, consisting of a `label` (`True`/`False`) and a `fields` dictionary that includes an `explanation` regarding why the judge assigned each label. A `True` label indicates that the conversation does not violate the corresponding attribute, while `False` indicates that it does. A `False` label implies that we should exclude the corresponding conversation from our training dataset.

In [3]:
def print_judge_output(judge_output):
    """Prints the judge output in a human-readable format."""
    for conversation_id, conversation in enumerate(judge_output):
        print("conversation", conversation_id)
        for attribute, judgment in conversation.items():
            print(
                f"{attribute}: label={judgment['label']} "
                f"explanation={judgment['fields']['explanation']}"
            )


print_judge_output(judge_output)

conversation 0
helpful: label=None explanation=None
honest: label=False explanation=The response is incorrect, because it incorrectly identifies 1 and 1 as 11 in binary.
safe: label=True explanation=The answer is safe because it does not contain any inappropriate information or language, and it provides a correct answer in binary format.
conversation 1
helpful: label=None explanation=None
honest: label=False explanation=The response is incorrect, as "French people love Paris!" is a completely false statement.
safe: label=True explanation=The answer is safe as it directly addresses the question without misleading the user.


We observe that the default model (Qwen 2) is not very effective as a judge. 
- The judge does not label all examples (i.e., `label` = `None`). This occurs when the underlying model fails to follow the requested response format and consequently the judgment cannot be parsed. Note that this is very rare for larger models, such as Llama 70B.
- Some explanations are incorrect. Specifically, in conversation 1, the explanation of `honest` claims that French people do not love Paris. This is the model's opinion, does not seem to be factually grounded, and thus it is not welcome.

### Judgment (Llama 3B)

To improve the quality of our judge, we leverage a more powerful model for judge's inference: `Llama 3B`. To do so, we overwrite the local config's model parameters (`ModelParams`), as shown below. We also define custom inference engine for this model, which we instantiate with `LlamaCppInferenceEngine`. 

Note: You can find all available inference engines under `src/oumi/inference`. We also recomend going through our inference tutorial [Oumi - Using vLLM Engine for Inference](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Using%20vLLM%20Engine%20for%20Inference.ipynb), which demonstates how to run inference for larger models that do not fit in the local machine. 

In [4]:
from oumi.core.configs import ModelParams
from oumi.inference import LlamaCppInferenceEngine

# Overwriting our local config with a different model (Llama 3B GGUF)
my_model_params = ModelParams(
    model_name="bartowski/Llama-3.2-3B-Instruct-GGUF",
    model_kwargs={"filename": "Llama-3.2-3B-Instruct-Q8_0.gguf"},  # 3.42 GB
)
my_config = oumi_v1_xml_local_judge()
my_config.model = my_model_params

# Creating an inference engine with the new model
my_inference_engine = LlamaCppInferenceEngine(my_model_params)

[2025-01-16 15:52:37,323][oumi][rank0][pid:78171][MainThread][INFO]][llama_cpp_inference_engine.py:118] Loading model from Huggingface Hub: bartowski/Llama-3.2-3B-Instruct-GGUF.
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


Next, we instantiate a new Oumi judge and re-run inference, as follows:

In [5]:
judge = OumiJudge(my_config, inference_engine=my_inference_engine)
judge_output = judge.judge(conversations)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 569.96it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:01<00:00,  1.36it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 655.51it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:01<00:00,  1.16it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 648.72it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:01<00:00,  1.14it/s]


### Understanding outputs (Llama 3B)

We can now inspect the judge output again. The judge:
- Correctly identifies conversation 0 as <b>dishonest</b> (`honest`: `False`) and provides a meaningful explanation.
- Correctly identifies conversation 1 as <b>not helpful</b> (`helpful`: `False`) and provides a meaningful explanation. 
- Labels conversation 1 as <b>unsafe</b> (`safe`: `False`) since, on top of not responding to the actual question, the response is stereotypical. 

In [6]:
print_judge_output(judge_output)

conversation 0
helpful: label=True explanation=The response is helpful because it directly addresses the question and provides a clear answer.
honest: label=False explanation=The response contains a factual error. The sum of 1 and 1 in binary is actually 10, not 11.
safe: label=True explanation=The answer is safe as it provides a mathematical solution to the user's query without promoting harm or discrimination.
conversation 1
helpful: label=False explanation=The response is not helpful because it fails to directly answer the user's question.
honest: label=False explanation=The response is partially honest in the sense that it identifies Paris as the capital of France. However, the response fails to directly answer the user's question and instead provides an emotional or subjective statement.
safe: label=False explanation=The answer is not safe because it includes a statement that seems to encourage or promote stereotyping and excessive enthusiasm towards a particular group of people, 

### Filtering the dataset

The final step is to filter all examples that the judge labelled as `False`. We do so by checking, for each conversation, if all attributes are `True`; if not, we add the conversation ID into a list. Then, we remove all the coversations corresponding to these IDs from our training dataset. 

In [7]:
conversation_ids_to_filter = []

# Find the conversation IDs which have any attribute set to False.
for conversation_id, conversation in enumerate(judge_output):
    if not all(judgment["label"] for judgment in conversation.values()):
        conversation_ids_to_filter.append(conversation_id)
print("Conversation IDs to filter:", conversation_ids_to_filter)

# Filter out the identified conversations from our dataset.
conversations = [
    conversation
    for conversations_id, conversation in enumerate(conversations)
    if conversations_id not in conversation_ids_to_filter
]
print("Count of conversations after filtering:", len(conversations))

Conversation IDs to filter: [0, 1]
Count of conversations after filtering: 0
