Pre-training#

Pre-training is the process of training a language model from scratch, or continuing training on a pre-trained model, using large amounts of unlabeled text data. The most common pre-training method is Causal Language Modeling (CLM), where the model predicts the next token in a sequence, given the preceding tokens.

This guide covers pre-training datasets used for training language models from scratch or continuing pre-training in Oumi.

Supported Datasets#

Out of the box, we support multiple popular pre-training datasets:

Name

Description

Reference

C4Dataset

A dataset for pretraining on the Colossal Clean Crawled Corpus (C4).

C4Dataset

DolmaDataset

Dolma: A dataset of 3 trillion tokens from diverse web content.

DolmaDataset

FalconRefinedWebDataset

A massive English web dataset built by TII for pretraining large language models.

FalconRefinedWebDataset

FineWebEduDataset

FineWeb-Edu: A high-quality educational dataset filtered from web content.

FineWebEduDataset

PileV1Dataset

The Pile: An 825 GiB diverse, open source language modeling dataset.

PileV1Dataset

RedPajamaDataV1Dataset

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

RedPajamaDataV1Dataset

RedPajamaDataV2Dataset

RedPajama V2 Dataset for training large language models.

RedPajamaDataV2Dataset

SlimPajamaDataset

SlimPajama-627B: A cleaned and deduplicated version of RedPajama.

SlimPajamaDataset

StarCoderDataset

StarCoder Training Dataset used for training StarCoder and StarCoderBase models.

StarCoderDataset

TheStackDataset

A dataset containing over 6TB of permissively-licensed source code files.

TheStackDataset

TinyStoriesDataset

TinyStoriesDataset class for loading and processing the TinyStories dataset.

TinyStoriesDataset

TinyTextbooksDataset

A dataset of textbook-like content for training small language models.

TinyTextbooksDataset

WikiTextDataset

WikiText language modeling dataset.

WikiTextDataset

WikipediaDataset

Dataset containing cleaned Wikipedia articles in multiple languages.

WikipediaDataset

YouTubeCommonsDataset

YouTube-Commons Dataset.

YouTubeCommonsDataset

Usage#

Configuration#

To use a specific pre-training dataset in your Oumi configuration, you need to specify it in the TrainingConfig. Here’s an example of how to configure a pre-training dataset:

training:
  data:
    train:
      datasets:
        - dataset_name: your_pretraining_dataset
          subset: optional_subset
          split: train
          stream: true  # Recommended for large datasets
          pack: true    # Enable sequence packing
          dataset_kwargs:
            seq_length: 4096  # packing sequence length

In this configuration:

  • dataset_name specifies the name of your dataset

  • subset and split allow you to select a specific dataset split (e.g. train, validation, test) or a dataset subset (if defined by the dataset)

  • stream enables streaming mode, which is essential for large datasets

  • pack activates sequence packing

  • dataset_kwargs allows you to pass additional parameters specific to your dataset

Python API#

To use a specific pre-training dataset in your code, you can leverage the build_dataset_mixture() function. Here’s an example:

from oumi.builders import build_dataset_mixture
from oumi.core.configs import TrainingConfig, DatasetSplit
from oumi.core.tokenizers import BaseTokenizer

# Assume you have your config and tokenizer initialized
config: TrainingConfig = ...
tokenizer: BaseTokenizer = ...

# Build the dataset
dataset = build_dataset_mixture(
    config=config,
    tokenizer=tokenizer,
    dataset_split=DatasetSplit.TRAIN
)

# Now you can use the dataset in your training loop
for batch in dataset:
    # Process your batch
    ...

The build_dataset_mixture() function takes care of creating the appropriate dataset based on your configuration. It handles the complexities of dataset initialization, including:

  • Applying the correct tokenizer

  • Setting up streaming if enabled

  • Configuring sequence packing if specified

  • Handling dataset mixtures if multiple datasets are defined

Adding a New Pre-training Dataset#

All pre-training datasets in Oumi are subclasses of BasePretrainingIterableDataset.

This class extends BaseIterableDataset to offer functionality specific to pre-training tasks.

Note

The BasePretrainingIterableDataset is an abstract base class. You should implement your specific dataset by subclassing it and overriding the transform() method.

To add a new pretraining dataset, you have to:

  1. Subclass BasePretrainingIterableDataset

  2. Implement the transform() method to define the dataset-specific transformation logic.

For example:

from oumi.core.datasets import BasePretrainingDataset
from oumi.core.registry import register_dataset

@register_dataset("custom_pretraining_dataset")
class CustomPretrainingDataset(BasePretrainingDataset):
    """A custom pretraining dataset."""

    default_dataset = "custom_pretraining_name"

    def transform(self, data: Dict[str, Any]) -> Dict[str, Any]:
        # Transform raw data for pretraining
        tokens = self.tokenizer(
            data["text"],
            max_length=self.max_length,
            truncation=True
        )
        return {
            "input_ids": tokens["input_ids"],
            "attention_mask": tokens["attention_mask"],
            "labels": tokens["input_ids"].copy()
        }

Using Custom Datasets via the CLI#

See Customizing Oumi to quickly enable your dataset when using the CLI.