Pre-training#
Pre-training is the process of training a language model from scratch, or continuing training on a pre-trained model, using large amounts of unlabeled text data. The most common pre-training method is Causal Language Modeling (CLM), where the model predicts the next token in a sequence, given the preceding tokens.
This guide covers pre-training datasets used for training language models from scratch or continuing pre-training in Oumi.
Supported Datasets#
Out of the box, we support multiple popular pre-training datasets:
Name |
Description |
Reference |
---|---|---|
C4Dataset |
A dataset for pretraining on the Colossal Clean Crawled Corpus (C4). |
|
DolmaDataset |
Dolma: A dataset of 3 trillion tokens from diverse web content. |
|
FalconRefinedWebDataset |
A massive English web dataset built by TII for pretraining large language models. |
|
FineWebEduDataset |
FineWeb-Edu: A high-quality educational dataset filtered from web content. |
|
PileV1Dataset |
The Pile: An 825 GiB diverse, open source language modeling dataset. |
|
RedPajamaDataV1Dataset |
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. |
|
RedPajamaDataV2Dataset |
RedPajama V2 Dataset for training large language models. |
|
SlimPajamaDataset |
SlimPajama-627B: A cleaned and deduplicated version of RedPajama. |
|
StarCoderDataset |
StarCoder Training Dataset used for training StarCoder and StarCoderBase models. |
|
TheStackDataset |
A dataset containing over 6TB of permissively-licensed source code files. |
|
TinyStoriesDataset |
TinyStoriesDataset class for loading and processing the TinyStories dataset. |
|
TinyTextbooksDataset |
A dataset of textbook-like content for training small language models. |
|
WikiTextDataset |
WikiText language modeling dataset. |
|
WikipediaDataset |
Dataset containing cleaned Wikipedia articles in multiple languages. |
|
YouTubeCommonsDataset |
YouTube-Commons Dataset. |
Usage#
Configuration#
To use a specific pre-training dataset in your Oumi configuration, you need to specify it in the TrainingConfig
. Here’s an example of how to configure a pre-training dataset:
training:
data:
train:
datasets:
- dataset_name: your_pretraining_dataset
subset: optional_subset
split: train
stream: true # Recommended for large datasets
pack: true # Enable sequence packing
dataset_kwargs:
seq_length: 4096 # packing sequence length
In this configuration:
dataset_name
specifies the name of your datasetsubset
andsplit
allow you to select a specific dataset split (e.g. train, validation, test) or a dataset subset (if defined by the dataset)stream
enables streaming mode, which is essential for large datasetspack
activates sequence packingdataset_kwargs
allows you to pass additional parameters specific to your dataset
Python API#
To use a specific pre-training dataset in your code, you can leverage the build_dataset_mixture()
function. Here’s an example:
from oumi.builders import build_dataset_mixture
from oumi.core.configs import TrainingConfig, DatasetSplit
from oumi.core.tokenizers import BaseTokenizer
# Assume you have your config and tokenizer initialized
config: TrainingConfig = ...
tokenizer: BaseTokenizer = ...
# Build the dataset
dataset = build_dataset_mixture(
config=config,
tokenizer=tokenizer,
dataset_split=DatasetSplit.TRAIN
)
# Now you can use the dataset in your training loop
for batch in dataset:
# Process your batch
...
The build_dataset_mixture()
function takes care of creating the appropriate dataset based on your configuration. It handles the complexities of dataset initialization, including:
Applying the correct tokenizer
Setting up streaming if enabled
Configuring sequence packing if specified
Handling dataset mixtures if multiple datasets are defined
Adding a New Pre-training Dataset#
All pre-training datasets in Oumi are subclasses of BasePretrainingIterableDataset
.
This class extends BaseIterableDataset
to offer functionality specific to pre-training tasks.
Note
The BasePretrainingIterableDataset
is an abstract base class. You should implement your specific dataset by subclassing it and overriding the transform()
method.
To add a new pretraining dataset, you have to:
Subclass
BasePretrainingIterableDataset
Implement the
transform()
method to define the dataset-specific transformation logic.
For example:
from oumi.core.datasets import BasePretrainingDataset
from oumi.core.registry import register_dataset
@register_dataset("custom_pretraining_dataset")
class CustomPretrainingDataset(BasePretrainingDataset):
"""A custom pretraining dataset."""
default_dataset = "custom_pretraining_name"
def transform(self, data: Dict[str, Any]) -> Dict[str, Any]:
# Transform raw data for pretraining
tokens = self.tokenizer(
data["text"],
max_length=self.max_length,
truncation=True
)
return {
"input_ids": tokens["input_ids"],
"attention_mask": tokens["attention_mask"],
"labels": tokens["input_ids"].copy()
}
Using Custom Datasets via the CLI#
See Customizing Oumi to quickly enable your dataset when using the CLI.