Supervised Fine-Tuning#
Supervised Fine-Tuning (SFT) is the most common approach for adapting a pre-trained language model to specific downstream tasks. This involves fine-tuning the model’s parameters on a labeled dataset of input-output pairs, effectively teaching the model to perform the desired task.
This guide covers datasets used for using SFT datasets in Oumi.
SFT Datasets#
Out-of-the box, we support multiple popular SFT datasets:
Name |
Description |
Reference |
---|---|---|
AlpacaDataset |
In-memory dataset for SFT data. |
|
ArgillaDollyDataset |
Dataset class for the Databricks Dolly 15k curated dataset. |
|
ArgillaMagpieUltraDataset |
Dataset class for the argilla/magpie-ultra-v0.1 dataset. |
|
AyaDataset |
Dataset class for the CohereForAI/aya_dataset dataset. |
|
ChatRAGBenchDataset |
In-memory dataset for SFT data. |
|
ChatqaDataset |
In-memory dataset for SFT data. |
|
ChatqaTatqaDataset |
ChatQA Subclass to handle tatqa subsets. |
|
MagpieProDataset |
Dataset class for the Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1 dataset. |
|
OpenO1SFTDataset |
Synthetic reasoning SFT dataset. |
|
PromptResponseDataset |
Converts HuggingFace Datasets with input/output columns to Message format. |
|
TextSftJsonLinesDataset |
TextSftJsonLinesDataset for loading SFT data in oumi and alpaca formats. |
|
UltrachatH4Dataset |
Dataset class for the HuggingFaceH4/ultrachat_200k dataset. |
Usage#
Configuration#
To use a specific SFT dataset in your Oumi configuration, specify it in the TrainingConfig
.
Here’s an example:
training:
data:
train:
datasets:
- dataset_name: your_sft_dataset_name
split: train
stream: false
collator_name: text_with_padding
In this configuration:
dataset_name
specifies the name of your SFT datasetsplit
selects a specific dataset split (e.g., train, validation, test)stream
enables streaming mode for large datasetscollator_name
specifies the collator to use for batching
Python API#
To use a specific SFT dataset in your code, you can use the build_dataset()
function:
from oumi.builders import build_dataset
from oumi.core.configs import DatasetSplit
from torch.utils.data import DataLoader
# Assume you have your tokenizer initialized
tokenizer = ...
# Build the dataset
dataset = build_dataset(
dataset_name="your_sft_dataset_name",
tokenizer=tokenizer,
dataset_split=DatasetSplit.TRAIN
)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
# Now you can use the dataset in your training loop
for batch in loader:
# Process your batch
...
Adding a New SFT Dataset#
All SFT datasets in Oumi are subclasses of BaseSftDataset
.
To add a new SFT dataset:
Subclass
BaseSftDataset
Implement the
transform_conversation()
method to define the dataset-specific transformation logic.
For example:
from oumi.core.datasets import BaseSftDataset
from oumi.core.types.conversation import Conversation, Message, Role
from oumi.core.registry import register_dataset
@register_dataset("custom_sft_dataset")
class CustomSftDataset(BaseSftDataset):
def __init__(self, config: TrainingConfig,
tokenizer: BaseTokenizer,
dataset_split: DatasetSplit):
super().__init__(config, tokenizer, dataset_split)
# Initialize your dataset here
def transform_conversation(self, example: Dict[str, Any]) -> Conversation:
# Transform the raw example into a Conversation object
# 'example' represents one row of the raw dataset
# Structure of 'example':
# {
# 'input': str, # The user's input or question
# 'output': str # The assistant's response
# }
conversation = Conversation(
messages=[
Message(role=Role.USER, content=example['input']),
Message(role=Role.ASSISTANT, content=example['output'])
]
)
return conversation
Tip
For more advanced SFT dataset implementations, explore the oumi.datasets
module, which contains implementations of several open source datasets.
Using Custom Datasets via the CLI#
See Customizing Oumi to quickly enable your dataset when using the CLI.