Leaderboards#
Leaderboards provide a structured, transparent, and competitive environment for evaluating Large Language Models (LLMs), helping to guide the development of more powerful, reliable, and useful models while fostering collaboration and innovation within the field. This page discusses how to evaluate models on popular leaderboards.
HuggingFace Leaderboard V2#
As of early 2025, the most popular standardized benchmarks, used across academia and industry, are the benchmarks introduced by HuggingFace’s latest (V2) leaderboard. HuggingFace has posted a blog elaborating on why these benchmarks have been selected, while EleutherAI has also provided a comprehensive README discussing the benchmark evaluation goals, coverage, and applicability.
MMLU-Pro (Massive Multitask Language Understanding) [paper]
GPQA (Google-Proof Q&A Benchmark) [paper]
MuSR (Multistep Soft Reasoning) [paper]
MATH (Mathematics Aptitude Test of Heuristics, Level 5). [paper]
IFEval (Instruction Following Evaluation) [paper]
BBH (Big Bench Hard) [paper]
You can evaluate a model on Hugging Face’s latest leaderboard by creating a yaml file and invoking the CLI with the following command:
configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v2_eval.yaml
# Class: oumi.core.configs.EvaluationConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/evaluation_config.py
# NOTE: Your must first request access to the GPQA dataset here:
# https://huggingface.co/datasets/Idavidrein/gpqa
model:
model_name: "HuggingFaceTB/SmolLM2-135M-Instruct"
model_max_length: 2048
torch_dtype_str: "bfloat16"
attn_implementation: "sdpa"
load_pretrained_weights: True
trust_remote_code: True
generation:
batch_size: 4
############################## HuggingFace Leaderboard V2 ##############################
# https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard #
# #
# Benchmarks: #
# - BBH (Big Bench Hard), 3 shots: leaderboard_bbh #
# - GPQA (Google-Proof Q&A Benchmark), 0 shots: leaderboard_gpqa #
# - MMLU-Pro (Massive Multitask Language Understanding), 5 shots: leaderboard_mmlu_pro #
# - MuSR (Multistep Soft Reasoning), 0 shots: leaderboard_musr #
# - IFEval (Instruction Following Evaluation), 0 shots: leaderboard_ifeval #
# - MATH L5 (Mathematics Aptitude Test of Heuristics), 4 shots: leaderboard_math_hard #
########################################################################################
tasks:
- evaluation_platform: lm_harness
task_name: leaderboard_bbh
- evaluation_platform: lm_harness
task_name: leaderboard_gpqa
- evaluation_platform: lm_harness
task_name: leaderboard_mmlu_pro
- evaluation_platform: lm_harness
task_name: leaderboard_musr
- evaluation_platform: lm_harness
task_name: leaderboard_ifeval
# # Temporarily disabled due to packaging conflicts
# - evaluation_platform: lm_harness
# task_name: leaderboard_math_hard
# NOTE: If you are running this in a remote machine, which is not accessible after the
# evaluation completes, you need to re-direct your output to persistent storage.
# For GCP nodes, you can store your output into a mounted GCS Bucket.
# For example: `output_dir: "/my-gcs-bucket/huggingface_leaderboard_v2"`,
# assuming that `/my-gcs-bucket` is mounted to `gs://my-gcs-bucket`.
output_dir: "./huggingface_leaderboard_v2"
oumi launch up -c configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v2_eval.yaml
A few things to pay attention to:
GPQA Gating. Access to GPQA is restricted through gating mechanisms, to minimize the risk of data contamination. Before running the leaderboard evaluation, you must first log in to HuggingFace and accept the terms of use for QPQA. In addition, you need to authenticate on the Hub using HuggingFace’s User Access Token when launching the evaluation job. You can do so either by setting the environmental HuggingFace token variable HF_TOKEN or by storing its value at HF_TOKEN_PATH (default location is
~/.cache/huggingface/token
).Dependencies. This leaderboard (specifically the
IFEval
andMATH
benchmarks) requires specific packages to be deployed to function correctly. You can either install all Oumi evaluation packages withpip install oumi[evaluation]
, or explore the required packages for each benchmark at oumi-ai/oumi and only install the packages needed for your specific case.
HuggingFace Leaderboard V1#
Before HuggingFace’s leaderboard V2 was introduced, the most popular benchmarks were captured in the V1 leaderboard. Note that due to the fast advancement of AI models, many of these benchmarks have been saturated (i.e., they became too easy to measure meaningful improvements for recent models) while newer models also showed signs of contamination, indicating that data very similar to these benchmarks may exist in their training sets.
ARC (AI2 Reasoning Challenge) [paper]
MMLU (Massive Multitask Language Understanding) [paper]
Winogrande (Adversarial Winograd Schema Challenge at Scale) [paper]
HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) [paper]
GSM 8K (Grade School Math) [paper]
TruthfulQA (Measuring How Models Mimic Human Falsehoods) [paper]
You can evaluate a model on Hugging Face’s V1 leaderboard by creating a yaml file and invoking the CLI with the following command:
configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v1_eval.yaml
# Class: oumi.core.configs.EvaluationConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/evaluation_config.py
model:
model_name: "HuggingFaceTB/SmolLM2-135M-Instruct"
model_max_length: 2048
torch_dtype_str: "bfloat16"
attn_implementation: "sdpa"
load_pretrained_weights: True
trust_remote_code: True
generation:
batch_size: 4
############################## HuggingFace Leaderboard V1 ##############################
# https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard #
# #
# Benchmarks: #
# - MMLU (Massive Multitask Language Understanding): mmlu #
# - ARC (AI2 Reasoning Challenge): arc_challenge #
# - Winogrande (Adversarial Winograd Schema Challenge at Scale): winogrande #
# - HellaSwag: hellaswag #
# - TruthfulQA (Measuring How Models Mimic Human Falsehoods): truthfulqa_mc2 #
# - GSM 8K (Grade School Math): gsm8k #
########################################################################################
tasks:
- evaluation_platform: lm_harness
task_name: mmlu
eval_kwargs:
num_fewshot: 5
- evaluation_platform: lm_harness
task_name: arc_challenge
eval_kwargs:
num_fewshot: 25
- evaluation_platform: lm_harness
task_name: winogrande
eval_kwargs:
num_fewshot: 5
- evaluation_platform: lm_harness
task_name: hellaswag
eval_kwargs:
num_fewshot: 10
- evaluation_platform: lm_harness
task_name: truthfulqa_mc2
eval_kwargs:
num_fewshot: 0
- evaluation_platform: lm_harness
task_name: gsm8k
eval_kwargs:
num_fewshot: 5
# NOTE: If you are running this in a remote machine, which is not accessible after the
# evaluation completes, you need to re-direct your output to persistent storage.
# For GCP nodes, you can store your output into a mounted GCS Bucket.
# For example: `output_dir: "/my-gcs-bucket/huggingface_leaderboard_v1"`,
# assuming that `/my-gcs-bucket` is mounted to `gs://my-gcs-bucket`.
output_dir: "./huggingface_leaderboard_v1"
oumi launch up -c configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v1_eval.yaml
Running Remotely#
Running leaderboard evaluations can be resource-intensive, particularly when working with large models that require GPU acceleration. As such, you may need to execute on remote machines with the necessary hardware resources. Provisioning and running leaderboard evaluations on a remote GCP machine can be achieved with the following sample yaml code.
HuggingFace Leaderboard V2:
configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v2_gcp_job.yaml
# Class: oumi.core.configs.JobConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/job_config.py
# NOTE: Your must first request access to the GPQA dataset here:
# https://huggingface.co/datasets/Idavidrein/gpqa
# Config to evaluate smollm 135M on HuggingFace's Leaderboard V2 (1 GCP node).
# Example command:
# oumi launch up -c configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v2_gcp_job_eval.yaml --cluster smollm-135m-lb-v2-eval
name: smollm-135m-lb-v2-eval
resources:
cloud: gcp
accelerators: "A100:1"
use_spot: false
disk_size: 100 # Disk size in GBs
working_dir: .
# You can take advantage of `file_mounts` to mount important files and access them in
# the remote node, such as HuggingFace's access token. Caching this in the machine that
# executes the evaluation allows you to authenticate and verify your identity, in order
# to access non-public (or gated) models and datasets in the HuggingFace Hub.
# Specifically, for HuggingFace's Leaderboard V2 evaluation, access to GPQA is
# restricted through gating mechanisms to minimize the risk of data contamination.
# In order to evaluate with GPQA, you will have to accept the terms of use at
# https://huggingface.co/datasets/Idavidrein/gpqa, and authenticate with the HuggingFace
# token when launching the evaluation job.
file_mounts:
~/.cache/huggingface/token: ~/.cache/huggingface/token
~/.netrc: ~/.netrc # WandB credentials
# If the remote machine is not accessible after evaluation completes, which is the
# common case when provisioning a GCP node and setting an autostop timer, you need
# to mount your output directory to persistent storage. In this case, we are using a
# GCS Bucket (`my-gcs-bucket`) to store and later retrieve the evaluation results.
# Note: Autostop is a feature that allows you to set a timer to ensure that the machine
# automatically stops after a certain period of inactivity. This is useful to save costs
# and resources when the machine is not being actively used.
# storage_mounts:
# /my-gcs-bucket:
# source: gs://my-gcs-bucket
# store: gcs
envs:
OUMI_RUN_NAME: smollm135m.eval
# https://github.com/huggingface/tokenizers/issues/899#issuecomment-1027739758
TOKENIZERS_PARALLELISM: false
setup: |
set -e
pip install uv && uv pip install oumi[gpu,evaluation]
run: |
set -e # Exit if any command failed.
source ./configs/examples/misc/sky_init.sh
set -x
oumi evaluate -c configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v2_eval.yaml
echo "Evaluation with HuggingFace's Leaderboard V2 is complete!"
oumi launch up -c configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v2_gcp_job.yaml
HuggingFace Leaderboard V1:
configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v1_gcp_job.yaml
# Class: oumi.core.configs.JobConfig
# https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/job_config.py
# Config to evaluate smollm 135M on HuggingFace's Leaderboard V1 (1 GCP node).
# Example command:
# oumi launch up -c configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v1_gcp_job_eval.yaml --cluster smollm-135m-lb-v1-eval
name: smollm-135m-lb-v1-eval
resources:
cloud: gcp
accelerators: "A100:1"
use_spot: false
disk_size: 100 # Disk size in GBs
working_dir: .
# You can take advantage of `file_mounts` to mount important files and access them in
# the remote node, such as HuggingFace's access token. Caching this in the machine that
# executes the evaluation allows you to authenticate and verify your identity, in order
# to access non-public (or gated) models and datasets in the HuggingFace Hub.
file_mounts:
~/.cache/huggingface/token: ~/.cache/huggingface/token
~/.netrc: ~/.netrc # WandB credentials
# If the remote machine is not accessible after evaluation completes, which is the
# common case when provisioning a GCP node and setting an autostop timer, you need
# to mount your output directory to persistent storage. In this case, we are using a
# GCS Bucket (`my-gcs-bucket`) to store and later retrieve the evaluation results.
# Note: Autostop is a feature that allows you to set a timer to ensure that the machine
# automatically stops after a certain period of inactivity. This is useful to save costs
# and resources when the machine is not being actively used.
# storage_mounts:
# /my-gcs-bucket:
# source: gs://my-gcs-bucket
# store: gcs
envs:
OUMI_RUN_NAME: smollm135m.eval
# https://github.com/huggingface/tokenizers/issues/899#issuecomment-1027739758
TOKENIZERS_PARALLELISM: false
setup: |
set -e
pip install uv && uv pip install oumi[gpu,evaluation]
run: |
set -e # Exit if any command failed.
source ./configs/examples/misc/sky_init.sh
set -x
oumi evaluate -c configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v1_eval.yaml
echo "Evaluation with HuggingFace's Leaderboard V1 is complete!"
oumi launch up -c configs/recipes/smollm/evaluation/135m/leaderboards/huggingface_leaderboard_v1_gcp_job.yaml
Tip
In addition to GCP, Oumi supports out-of-the-box various cloud providers (including AWS, Azure, Runpod, Lambda) or even your own custom cluster. To explore these, visit the running code on clusters page.
A few things to pay attention to:
Output folder. When executing in a remote machine that is not accessible after the evaluation completes, you need to re-direct your output to persistent storage. For GCP, you can store your output into a mounted GCS Bucket. For example, assuming your bucket is
gs://my-gcs-bucket
, mount to it and setoutput_dir
as shown below.
storage_mounts:
/my-gcs-bucket:
source: gs://my-gcs-bucket
store: gcs
output_dir: "/my-gcs-bucket/huggingface_leaderboard"
HuggingFace Access Token. If you need to authenticate on the HuggingFace Hub to access private or gated models, datasets, or other resources that require authorization, you need to cache HuggingFace’s User Access Token in the remote machine. This token is acting as a HuggingFace login credential to interact with the platform beyond publicly available content. To do so, mount the locally cached token file (by default
~/.cache/huggingface/token
) to the remote machine, as shown below.
file_mounts:
~/.cache/huggingface/token: ~/.cache/huggingface/token
W&B Credentials. If you are using Weights & Biases for experiment tracking, make sure you mount the locally cached credentials file (by default
~/.netrc
) to the remote machine, as shown below.
file_mounts:
~/.netrc: ~/.netrc
Dependencies. If you need to deploy packages in the remote machine, such as Oumi’s evaluation packages, make sure that these are installed in the setup script, which is executed before the job starts (typically during cluster creation).
setup: |
pip install oumi[evaluation]
Tip
To learn more on running jobs remotely, including attaching to various storage systems and mounting local files, visit the running code on clusters page.