<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

üëã Welcome to Open Universal Machine Intelligence (Oumi)!

üöÄ Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](hhttps://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

ü§ù Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

‚≠ê If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Evaluation with Alpaca Eval 2.0

This notebook discusses how you can run E2E evaluations for your trained model, using Oumi inference for generating the responses, and [Alpaca Eval 2.0](https://github.com/tatsu-lab/alpaca_eval) for automatically calculating the win-rates vs. GPT4 Turbo (or other reference models of your choice).

## Prerequisites and Configuration

First, start by installing the [Alpaca Eval package](https://pypi.org/project/alpaca-eval/) and LlamaCPP (for inference):


In [1]:
%pip install -U -q alpaca_eval llama-cpp-python

When comparing your model's responses vs. the reference responses to calculate the win rates, an annotator (judge) is needed. By default, the annotator is set to GPT4 Turbo (annotator config: [weighted_alpaca_eval_gpt4_turbo](https://github.com/tatsu-lab/alpaca_eval?tab=readme-ov-file#alpacaeval-20)). To access the latest GPT-4 models, including GPT4 Turbo, an Open API key is required. Details on creating an OpenAI account and generating a key can be found at [OpenAI's quickstart webpage](https://platform.openai.com/docs/quickstart).

In [2]:
import os

os.environ["OPENAI_API_KEY"] = ""  # Set your OpenAI API key here

<b>‚ö†Ô∏è Cost considerations</b>: The cost of running a standard Alpaca evaluation 2.0 (with [weighted_alpaca_eval_gpt4_turbo](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/README.md) config) and annotating 805 examples with GPT4 Turbo is <b>$3.5</b>. However, the sample code of this notebook only annotates 3 (of 805 examples) and costs less than <b>0.5¬¢</b>.

In [3]:
NUM_EXAMPLES = 3  # Replace with 805 for full dataset evaluation.

Define your model and the max number of tokens it supports (to be used during generation). You can point to any model in HuggingFace, provide a path to a local folder that contains your model, or any other model format that Oumi inference supports. Also, please provide a (human friendly) display name for your model, to be used when displayed in leaderboards. 


In [4]:
MODEL_NAME = "bartowski/Llama-3.2-1B-Instruct-GGUF"
MODEL_DISPLAY_NAME = "MyLlamaTestModel"
MODEL_MAX_TOKENS = 8192

Finally, we'll create a tutorial directory to store our results.

In [5]:
from pathlib import Path

tutorial_dir = "alpaca_eval_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)

## Step 1: Retrieve Alpaca dataset

Alpaca Eval 2.0 requires model responses for the [tatsu-lab/alpaca_eval](https://huggingface.co/datasets/tatsu-lab/alpaca_eval) dataset.

In [6]:
from oumi.datasets.evaluation import AlpacaEvalDataset

alpaca_dataset = AlpacaEvalDataset(dataset_name="tatsu-lab/alpaca_eval").conversations()

[2025-01-16 16:02:59,880][oumi][rank0][pid:79891][MainThread][INFO]][base_map_dataset.py:68] Creating map dataset (type: AlpacaEvalDataset)...
[2025-01-16 16:03:00,656][oumi][rank0][pid:79891][MainThread][INFO]][base_map_dataset.py:470] Dataset Info:
	Split: eval
	Version: 1.0.0
	Dataset size: 554496
	Download size: 620778
	Size: 1175274 bytes
	Rows: 805
	Columns: ['instruction', 'output', 'generator', 'dataset']
[2025-01-16 16:03:00,737][oumi][rank0][pid:79891][MainThread][INFO]][base_map_dataset.py:408] Loaded DataFrame with shape: (805, 4). Columns:
instruction    object
output         object
generator      object
dataset        object
dtype: object


Since this notebook contains sample code, we will only run inference for the first `NUM_EXAMPLES` (of 805) from the dataset. 

In [7]:
alpaca_dataset = alpaca_dataset[:NUM_EXAMPLES]  # For testing purposes, reduce examples.

for index, conversation in enumerate(alpaca_dataset):
    print(index, conversation.messages)

0 [USER: What are the names of some famous actors that started their careers on Broadway?]
1 [USER: How did US states get their names?]
2 [USER: Hi, my sister and her girlfriends want me to play kickball with them. Can you explain how the game is played, so they don't take advantage of me?]


## Step 2: Run inference

First, define all the relevant parameters and configs required for inference.

In [8]:
from oumi.core.configs import GenerationParams, InferenceConfig, ModelParams

generation_params = GenerationParams(max_new_tokens=MODEL_MAX_TOKENS)
model_params = ModelParams(model_name=MODEL_NAME, model_max_length=MODEL_MAX_TOKENS)
inference_config = InferenceConfig(model=model_params, generation=generation_params)

Then, choose an inference engine that your model is compatible with. For more information on this, see Oumi's [inference documentation](https://oumi.ai/docs/en/latest/user_guides/infer/infer.html). 

In [9]:
from oumi.inference import LlamaCppInferenceEngine

inference_engine = LlamaCppInferenceEngine(model_params)

[2025-01-16 16:03:01,819][oumi][rank0][pid:79891][MainThread][INFO]][llama_cpp_inference_engine.py:118] Loading model from Huggingface Hub: bartowski/Llama-3.2-1B-Instruct-GGUF.
llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


Next, run inference to get responses from your model for the prompts contained in the `alpaca_dataset`.

In [10]:
responses = inference_engine.infer(alpaca_dataset, inference_config)

  0%|          | 0/3 [00:00<?, ?it/s]

Then, convert the responses from Oumi format (list of `Conversation`s) to Alpaca format (list of `dict`s, where the data is contained under the keys `instruction` and `output`). Create a DataFrame from the data and add a new column "`generator`", which captures the human-readable name of the model the responses were produced with. 

In [11]:
import pandas as pd

from oumi.datasets.evaluation import utils

responses_json = utils.conversations_to_alpaca_format(responses)
responses_df = pd.DataFrame(responses_json)
responses_df["generator"] = MODEL_DISPLAY_NAME

Your DataFrame should look as follows.

In [12]:
responses_df

Unnamed: 0,instruction,output,generator
0,What are the names of some famous actors that ...,There are many famous actors who started their...,MyLlamaTestModel
1,How did US states get their names?,The origin of US state names is a fascinating ...,MyLlamaTestModel
2,"Hi, my sister and her girlfriends want me to p...",Kickball is a fun team sport that's easy to le...,MyLlamaTestModel


## Step 3: Run Alpaca Eval 2.0

You can kick off evaluations as shown below. 

The default annotator for Alpaca Eval 2.0 is <b>GPT-4 Turbo</b>. While Alpaca Eval 1.0 was using a binary preference, Alpaca Eval 2.0 uses the logprobs to compute a continuous preference, resulting in a <b>weighted</b> win-rate. The default annotator config of Alpaca Eval 2.0 is thus `weighted_alpaca_eval_gpt4_turbo`. There is an option to use other annotators (judges) as well; see the [Annotators configs](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/README.md) page for details and relevant costs. However, the Alpaca 2.0 leaderboard is established with GPT4 Turbo as the reference annotator. Using other annotators is less informative if you are interested in generating comparative results. 

In [13]:
from alpaca_eval import evaluate

ANNOTATORS_CONFIG = "weighted_alpaca_eval_gpt4_turbo"

df_leaderboard, annotations = evaluate(  # type: ignore
    model_outputs=responses_df,
    annotators_config=ANNOTATORS_CONFIG,
    is_return_instead_of_print=True,
    output_path=tutorial_dir,
)

INFO:root:Evaluating the MyLlamaTestModel outputs.
INFO:root:Creating the annotator from `weighted_alpaca_eval_gpt4_turbo`.
INFO:root:Saving annotations to `/opt/miniconda3/envs/oumi/lib/python3.11/site-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json`.
Annotation chunk:   0%|          | 0/1 [00:00<?, ?it/s]INFO:root:Annotating 3 examples with weighted_alpaca_eval_gpt4_turbo
INFO:root:Using `openai_completions` on 3 prompts using gpt-4-1106-preview.
INFO:root:Kwargs to completion: {'model': 'gpt-4-1106-preview', 'temperature': 1, 'logprobs': True, 'top_logprobs': 5, 'is_chat': True}. num_procs=5
INFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:root:Using OAI client number 1 out of 1.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request

## Step 4: Inspect the metrics

Once the evaluation process completes, you can inspect the metrics produced, as shown below.

In [14]:
metrics = df_leaderboard.loc[MODEL_DISPLAY_NAME]

print(f"Metrics for `{MODEL_DISPLAY_NAME}`")
for metric, value in metrics.items():
    print(f" - {metric}={value}")

Metrics for `MyLlamaTestModel`
 - win_rate=0.0016009055697541186
 - standard_error=0.0007292077882424806
 - n_wins=0
 - n_wins_base=3
 - n_draws=0
 - n_total=3
 - discrete_win_rate=0.0
 - mode=community
 - avg_length=2192
 - length_controlled_winrate=0.051919831641191114
 - lc_standard_error=0.013651061718406078


## [Optional] Retain your configuration for reproducibility

In order to be able to repro your evaluation run in the future, do not forget to save the configuration of your evaluation, together with your evaluation metrics. 

In [15]:
import json
from importlib.metadata import version

evaluation_config_dict = {
    "packages": {
        "alpaca_eval": version("alpaca_eval"),
        "oumi": version("oumi"),
    },
    "configs": {
        "inference_config": str(inference_config),
        "annotators_config": ANNOTATORS_CONFIG,
    },
    "eval_metrics": metrics.to_dict(),
}

evaluation_config_json = json.dumps(evaluation_config_dict, indent=2)
with open(f"{tutorial_dir}/evaluation_config.json", "w") as output_file:
    output_file.write(evaluation_config_json)