Supervised Fine-Tuning#

Supervised Fine-Tuning (SFT) is the most common approach for adapting a pre-trained language model to specific downstream tasks. This involves fine-tuning the model’s parameters on a labeled dataset of input-output pairs, effectively teaching the model to perform the desired task.

This guide covers datasets used for using SFT datasets in Oumi.

SFT Datasets#

Out-of-the box, we support multiple popular SFT datasets:

Name

Description

Reference

AlpacaDataset

In-memory dataset for SFT data.

AlpacaDataset

ArgillaDollyDataset

Dataset class for the Databricks Dolly 15k curated dataset.

ArgillaDollyDataset

ArgillaMagpieUltraDataset

Dataset class for the argilla/magpie-ultra-v0.1 dataset.

ArgillaMagpieUltraDataset

AyaDataset

Dataset class for the CohereForAI/aya_dataset dataset.

AyaDataset

ChatRAGBenchDataset

In-memory dataset for SFT data.

ChatRAGBenchDataset

ChatqaDataset

In-memory dataset for SFT data.

ChatqaDataset

ChatqaTatqaDataset

ChatQA Subclass to handle tatqa subsets.

ChatqaTatqaDataset

HuggingFaceDataset

Converts HuggingFace Datasets with messages to Oumi Message format.

HuggingFaceDataset

MagpieProDataset

Dataset class for the Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1 dataset.

MagpieProDataset

OpenO1SFTDataset

Synthetic reasoning SFT dataset.

OpenO1SFTDataset

PromptResponseDataset

Converts HuggingFace Datasets with input/output columns to Message format.

PromptResponseDataset

TextSftJsonLinesDataset

TextSftJsonLinesDataset for loading SFT data in oumi and alpaca formats.

TextSftJsonLinesDataset

Tulu3MixtureDataset

In-memory dataset for SFT data.

Tulu3MixtureDataset

UltrachatH4Dataset

Dataset class for the HuggingFaceH4/ultrachat_200k dataset.

UltrachatH4Dataset

WildChatDataset

Dataset class for the allenai/WildChat-1M dataset.

WildChatDataset

Usage#

Configuration#

To use a specific SFT dataset in your Oumi configuration, specify it in the TrainingConfig.

Here’s an example:

training:
  data:
    train:
      datasets:
        - dataset_name: your_sft_dataset_name
          split: train
          stream: false
      collator_name: text_with_padding

In this configuration:

  • dataset_name specifies the name of your SFT dataset

  • split selects a specific dataset split (e.g., train, validation, test)

  • stream enables streaming mode for large datasets

  • collator_name specifies the collator to use for batching

Python API#

To use a specific SFT dataset in your code, you can use the build_dataset() function:

from oumi.builders import build_dataset
from oumi.core.configs import DatasetSplit
from torch.utils.data import DataLoader

# Assume you have your tokenizer initialized
tokenizer = ...

# Build the dataset
dataset = build_dataset(
    dataset_name="your_sft_dataset_name",
    tokenizer=tokenizer,
    dataset_split=DatasetSplit.TRAIN
)

loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Now you can use the dataset in your training loop
for batch in loader:
    # Process your batch
    ...

Adding a New SFT Dataset#

All SFT datasets in Oumi are subclasses of BaseSftDataset.

To add a new SFT dataset:

  1. Subclass BaseSftDataset

  2. Implement the transform_conversation() method to define the dataset-specific transformation logic.

  3. Register your new dataset to the dataset class by adding it to py and py.

For example:

from oumi.core.datasets import BaseSftDataset
from oumi.core.types.conversation import Conversation, Message, Role
from oumi.core.registry import register_dataset

@register_dataset("custom_sft_dataset")
class CustomSftDataset(BaseSftDataset):
    def __init__(self, config: TrainingConfig,
                 tokenizer: BaseTokenizer,
                 dataset_split: DatasetSplit):
        super().__init__(config, tokenizer, dataset_split)
        # Initialize your dataset here

    def transform_conversation(self, example: Dict[str, Any]) -> Conversation:
        # Transform the raw example into a Conversation object
        # 'example' represents one row of the raw dataset
        # Structure of 'example':
        # {
        #     'input': str,  # The user's input or question
        #     'output': str  # The assistant's response
        # }
        conversation = Conversation(
            messages=[
                Message(role=Role.USER, content=example['input']),
                Message(role=Role.ASSISTANT, content=example['output'])
            ]
        )

        return conversation

Tip

For more advanced SFT dataset implementations, explore the oumi.datasets module, which contains implementations of several open source datasets.

Using an Unregistered Dataset Whose Format is Identical to a Registered Dataset#

Many datasets on HuggingFace share the same format as Oumi registered datasets. It is not necessary to register each dataset explicitly to use it. Instead, you can override the dataset_name parameter using a keyword argument; see the code snippet below for an example of how to do this.

- dataset_name: registered_hf_dataset_with_compatible_class
  dataset_kwargs:
  - dataset_name_override: hf_dataset_with_data_to_use

NOTE: This feature is experimental, and we expect it to change in a future release.

Using Custom Datasets via the CLI#

See Customizing Oumi to quickly enable your dataset when using the CLI.