oumi.datasets

Contents

oumi.datasets#

Datasets module for the Oumi (Open Universal Machine Intelligence) library.

This module provides various dataset implementations for use in the Oumi framework. These datasets are designed for different machine learning tasks and can be used with the models and training pipelines provided by Oumi.

For more information on the available datasets and their usage, see the oumi.datasets documentation.

Each dataset is implemented as a separate class, inheriting from appropriate base classes in the oumi.core.datasets module. For usage examples and more detailed information on each dataset, please refer to their respective class documentation.

See also

Example

>>> from oumi.datasets import AlpacaDataset
>>> from torch.utils.data import DataLoader
>>> dataset = AlpacaDataset()
>>> train_loader = DataLoader(dataset, batch_size=32)
class oumi.datasets.AlpacaDataset(*, include_system_prompt: bool = True, **kwargs)[source]#

Bases: BaseSftDataset

dataset_name: str#
default_dataset: str | None = 'tatsu-lab/alpaca'#
system_prompt_with_context = 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.'#
system_prompt_without_context = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'#
transform_conversation(example: dict | Series) Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

Parameters:

example (dict or Pandas Series) – An example containing input (optional), instruction, and output entries.

Returns:

The input example converted to Alpaca dictionary format.

Return type:

dict

trust_remote_code: bool#
class oumi.datasets.AlpacaEvalDataset(*, include_system_prompt: bool = False, unused_entries_to_metadata: bool = False, trust_remote_code: bool = True, **kwargs)[source]#

Bases: BaseSftDataset

dataset_name: str#
default_dataset: str | None = 'tatsu-lab/alpaca_eval'#
system_prompt_with_context = 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.'#
system_prompt_without_context = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'#
transform_conversation(example: dict | Series) Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

Parameters:

example (dict or Pandas Series) – An example containing input (optional), instruction entries.

Returns:

The input example converted to Alpaca dictionary format.

Return type:

dict

Note

If unused_entries_to_metadata is set: all example’s entries, other than the expected ones (i.e., input and instruction), are saved as metadata.

trust_remote_code: bool#
class oumi.datasets.ArgillaDollyDataset(*, use_new_fields: bool = True, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the Databricks Dolly 15k curated dataset.

dataset_name: str#
default_dataset: str | None = 'argilla/databricks-dolly-15k-curated-en'#
transform_conversation(example: dict | Series) Conversation[source]#

Transform a dataset example into a Conversation object.

Parameters:

example – A single example from the dataset.

Returns:

A Conversation object containing the transformed messages.

Return type:

Conversation

trust_remote_code: bool#
class oumi.datasets.ArgillaMagpieUltraDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the argilla/magpie-ultra-v0.1 dataset.

dataset_name: str#
default_dataset: str | None = 'argilla/magpie-ultra-v0.1'#
transform_conversation(example: dict | Series) Conversation[source]#

Transform a dataset example into a Conversation object.

trust_remote_code: bool#
class oumi.datasets.AyaDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the CohereForAI/aya_dataset dataset.

dataset_name: str#
default_dataset: str | None = 'CohereForAI/aya_dataset'#
transform_conversation(example: dict | Series) Conversation[source]#

Transform a dataset example into a Conversation object.

trust_remote_code: bool#
class oumi.datasets.C4Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

A dataset for pretraining on the Colossal Clean Crawled Corpus (C4).

The C4 dataset is based on the Common Crawl dataset and is available in multiple variants: ‘en’, ‘en.noclean’, ‘en.noblocklist’, ‘realnewslike’, and ‘multilingual’ (mC4). It is intended for pretraining language models and word representations.

For more details and download instructions, visit: https://huggingface.co/datasets/allenai/c4

References

Paper: https://arxiv.org/abs/1910.10683

Data Fields:
  • - url – URL of the source as a string

  • - text – Text content as a string

  • - timestamp – Timestamp as a string

Dataset Variants:
  • en: 305GB

  • en.noclean: 2.3TB

  • en.noblocklist: 380GB

  • realnewslike: 15GB

  • multilingual (mC4): 9.7TB (108 subsets, one per language)

The dataset is released under the ODC-BY license and is subject to the Common Crawl terms of use.

dataset_name: str#
default_dataset: str | None = 'allenai/c4'#
class oumi.datasets.COCOCaptionsDataset(*, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, limit: int | None = None, trust_remote_code: bool = False, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the HuggingFaceM4/COCO dataset.

dataset_name: str#
default_dataset: str | None = 'HuggingFaceM4/COCO'#
default_prompt = 'Describe this image:'#
transform_conversation(example: dict) Conversation[source]#

Transform a single conversation example into a Conversation object.

trust_remote_code: bool#
class oumi.datasets.ChatRAGBenchDataset(*, split: str = 'test', task: str = 'generation', subset: str | None = None, num_context_docs: int = 5, **kwargs)[source]#

Bases: BaseSftDataset

default_dataset: str = 'nvidia/ChatRAG-Bench'#
default_subset: str = 'doc2dial'#
default_system_message: str = "This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."#
transform_conversation(example: dict | Series) Conversation[source]#

Transforms a given example into a Conversation object.

Parameters:

example (Union[dict, pd.Series]) – The example to transform.

Returns:

The transformed Conversation object.

Return type:

Conversation

class oumi.datasets.ChatqaDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#

Bases: BaseSftDataset

dataset_name: str#
default_dataset: str | None = 'nvidia/ChatQA-Training-Data'#
default_subset: str | None = 'sft'#
transform_conversation(raw_conversation: dict | Series) Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

ChatQA is a conversational question answering dataset. It contains 10 subsets. Some subsets contain grounding documents.

See the dataset page for more information: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data

Parameters:

raw_conversation – The raw conversation example.

Returns:

The preprocessed inputs as an Oumi conversation.

Return type:

dict

trust_remote_code: bool#
class oumi.datasets.ChatqaTatqaDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#

Bases: ChatqaDataset

ChatQA Subclass to handle tatqa subsets.

The tatqa subsets require loading a specific file from the dataset repository, thus requiring us to override the default loading behavior.

dataset_name: str#
default_subset: str | None = 'tatqa-arithmetic'#
trust_remote_code: bool#
class oumi.datasets.DebugClassificationDataset(dataset_size: int = 1000, feature_dim: int = 128, data_type: str = 'float32', num_classes: int = 10, preprocessing_time_ms: float = 0, **kwargs)[source]#

Bases: Dataset

__getitem__(idx)[source]#

Return the data and label at the given index.

__len__()[source]#

Return the size of the dataset.

class oumi.datasets.DebugPretrainingDataset(dataset_size: int = 1000, **kwargs)[source]#

Bases: BasePretrainingDataset

dataset_name: str#
default_dataset: str | None = 'debug_pretraining'#
class oumi.datasets.DebugSftDataset(dataset_size: int = 5, **kwargs)[source]#

Bases: BaseSftDataset

dataset_name: str#
default_dataset: str | None = 'debug_sft'#
transform_conversation(example: dict | Series) Conversation[source]#

Transforms the example into a Conversation object.

trust_remote_code: bool#
class oumi.datasets.DolmaDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

Dolma: A dataset of 3 trillion tokens from diverse web content.

Dolma [1] is a large-scale dataset containing approximately 3 trillion tokens sourced from various web content, academic publications, code, books, and encyclopedic materials. It is designed for language modeling tasks and casual language model training.

The dataset is available in multiple versions, with v1.7 being the latest release used to train OLMo 7B-v1.7. It includes data from sources such as Common Crawl, Refined Web, StarCoder, C4, Reddit, Semantic Scholar, arXiv, StackExchange, and more.

Data Fields:
  • id (str) – Unique identifier for the data entry.

  • text (str) – The main content of the data entry.

  • added (str, optional) – Timestamp indicating when the entry was added to the dataset.

  • created (str, optional) – Timestamp indicating when the original content was created.

  • source (str, optional) – Information about the origin or source of the data.

Note

The dataset is released under the ODC-BY license. Users are bound by the license agreements and terms of use of the original data sources.

Citations

dataset_name: str#
default_dataset: str | None = 'allenai/dolma'#
class oumi.datasets.FalconRefinedWebDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

A massive English web dataset built by TII for pretraining large language models.

The Falcon RefinedWeb dataset is created through stringent filtering and large-scale deduplication of CommonCrawl. It contains about 1B instances (968M individual web pages) for a total of 2.8TB of clean text data.

This dataset is intended primarily for pretraining large language models and can be used on its own or augmented with curated sources.

Dataset Link:

https://huggingface.co/datasets/tiiuae/falcon-refinedweb

Paper:

https://arxiv.org/abs/2306.01116

Features:
  • content (str): The processed and cleaned text contained in the page.

  • url (str): The URL of the webpage crawled to produce the sample.

  • timestamp (timestamp[s]): Timestamp of when the webpage was crawled by CommonCrawl.

  • dump (str): The CommonCrawl dump the sample is a part of.

  • segment (str): The CommonCrawl segment the sample is a part of.

  • image_urls (List[List[str]]): A list of elements in the type [image_url, image_alt_text] for all images found in the content.

Usage:

from datasets import load_dataset rw = load_dataset(“tiiuae/falcon-refinedweb”)

Notes

ODC-By 1.0

Note

  • This public extract is about ~500GB to download, requiring 2.8TB of local storage once unpacked.

  • The dataset may contain sensitive information and biased content.

  • No canonical splits are provided for this dataset.

dataset_name: str#
default_dataset: str | None = 'tiiuae/falcon-refinedweb'#
class oumi.datasets.FineWebEduDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

FineWeb-Edu: A high-quality educational dataset filtered from web content.

This dataset contains 1.3 trillion tokens of educational web pages filtered from the FineWeb dataset using an educational quality classifier. It aims to provide the finest collection of educational content from the web [2].

The dataset is available in multiple configurations:
  • Full dataset (default)

  • Individual CommonCrawl dumps (e.g. CC-MAIN-2024-10)

  • Sample subsets (10BT, 100BT, 350BT tokens)

Key Features:
  • 1.3 trillion tokens of educational content

  • Filtered using a classifier trained on LLama3-70B-Instruct annotations

  • Outperforms other web datasets on educational benchmarks

Note

The dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0.

Citations

dataset_name: str#
default_dataset: str | None = 'HuggingFaceFW/fineweb-edu'#
class oumi.datasets.Flickr30kDataset(*, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, limit: int | None = None, trust_remote_code: bool = False, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the nlphuji/flickr30k dataset.

dataset_name: str#
default_dataset: str | None = 'nlphuji/flickr30k'#
transform_conversation(example: dict) Conversation[source]#

Transform a single conversation example into a Conversation object.

trust_remote_code: bool#
class oumi.datasets.LlavaInstructMixVsftDataset(*, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, limit: int | None = None, trust_remote_code: bool = False, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the HuggingFaceH4/llava-instruct-mix-vsft dataset.

dataset_name: str#
default_dataset: str | None = 'HuggingFaceH4/llava-instruct-mix-vsft'#
transform_conversation(example: dict) Conversation[source]#

Transform a dataset example into a Conversation object.

trust_remote_code: bool#
class oumi.datasets.MagpieProDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1 dataset.

dataset_name: str#
default_dataset: str | None = 'Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1'#
transform_conversation(example: dict | Series) Conversation[source]#

Transform a dataset example into a Conversation object.

trust_remote_code: bool#
class oumi.datasets.OpenO1SFTDataset(**kwargs)[source]#

Bases: PromptResponseDataset

Synthetic reasoning SFT dataset.

dataset_name: str#
default_dataset: str | None = 'O1-OPEN/OpenO1-SFT'#
trust_remote_code: bool#
class oumi.datasets.OrpoDpoMix40kDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, return_tensors: bool = False, **kwargs)[source]#

Bases: BaseExperimentalDpoDataset

Preprocess the ORPO dataset for DPO.

A dataset designed for ORPO (Offline Reinforcement Learning for Preference Optimization) or DPO (Direct Preference Optimization) training.

This dataset is a combination of high-quality DPO datasets, including: - Capybara-Preferences - distilabel-intel-orca-dpo-pairs - ultrafeedback-binarized-preferences-cleaned - distilabel-math-preference-dpo - toxic-dpo-v0.2 - prm_dpo_pairs_cleaned - truthy-dpo-v0.1

Rule-based filtering was applied to remove ‘gptisms’ in the chosen answers.

Data Fields:
  • - source – string

  • - chosen – list of dictionaries with ‘content’ and ‘role’ fields

  • - rejected – list of dictionaries with ‘content’ and ‘role’ fields

  • - prompt – string

  • - question – string

See also

For more information on how to use this dataset, refer to: - Blog post: https://huggingface.co/blog/mlabonne/orpo-llama-3 - Huggingface hub: https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k

dataset_name: str#
default_dataset: str | None = 'mlabonne/orpo-dpo-mix-40k'#
trust_remote_code: bool#
class oumi.datasets.PileV1Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

The Pile: An 825 GiB diverse, open source language modeling dataset.

The Pile is a large-scale English language dataset consisting of 22 smaller, high-quality datasets combined together. It is designed for training large language models and supports various natural language processing tasks [3][4].

Data Fields:
  • text (str) – The main text content.

  • meta (dict) – Metadata about the instance, including ‘pile_set_name’.

Key Features:
  • 825 GiB of diverse text data

  • Primarily in English

  • Supports text generation and fill-mask tasks

  • Includes various subsets like enron_emails, europarl, free_law, etc.

Subsets:
  • all

  • enron_emails

  • europarl

  • free_law

  • hacker_news

  • nih_exporter

  • pubmed

  • pubmed_central

  • ubuntu_irc

  • uspto

  • github

Splits:
  • train

  • validation

  • test

Warning

This dataset contains text from various sources and may include personal or sensitive information. Users should consider potential biases and limitations when using this dataset.

Citations

dataset_name: str#
default_dataset: str | None = 'EleutherAI/pile'#
class oumi.datasets.PromptResponseDataset(*, hf_dataset_path: str = 'O1-OPEN/OpenO1-SFT', prompt_column: str = 'instruction', response_column: str = 'output', **kwargs)[source]#

Bases: BaseSftDataset

Converts HuggingFace Datasets with input/output columns to Message format.

Example

dataset = PromptResponseDataset(hf_dataset_path=”O1-OPEN/OpenO1-SFT”, prompt_column=”instruction”, response_column=”output”)

dataset_name: str#
default_dataset: str | None = 'O1-OPEN/OpenO1-SFT'#
transform_conversation(example: dict | Series) Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

Parameters:

example (dict or Pandas Series) – An example containing input (optional), instruction, and output entries.

Returns:

The input example converted to messages dictionary format.

Return type:

dict

trust_remote_code: bool#
class oumi.datasets.RedPajamaDataV1Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

This dataset contains approximately 1.2 trillion tokens from various sources: Commoncrawl (878B), C4 (175B), GitHub (59B), ArXiv (28B), Wikipedia (24B), and StackExchange (20B) [5].

The dataset is primarily in English, though the Wikipedia slice contains multiple languages.

Dataset Structure:
{
    "text": str,
    "meta": {
        "url": str,
        "timestamp": str,
        "source": str,
        "language": str,
        ...
    },
    "red_pajama_subset": str
}
Subsets:
  • common_crawl

  • c4

  • github

  • arxiv

  • wikipedia

  • stackexchange

See also

Note

The ‘book’ config is defunct and no longer accessible due to reported copyright infringement for the Book3 dataset contained in this config.

Note

Please refer to the licenses of the data subsets you use. Links to the respective licenses can be found in the README.

Citations

dataset_name: str#
default_dataset: str | None = 'togethercomputer/RedPajama-Data-1T'#
class oumi.datasets.RedPajamaDataV2Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

RedPajama V2 Dataset for training large language models.

This dataset includes over 100B text documents from 84 CommonCrawl snapshots, processed using the CCNet pipeline. It contains 30B documents with quality signals and 20B deduplicated documents [5].

The dataset is available in English, German, French, Italian, and Spanish.

Key Features:
  • Over 100B text documents

  • 30B documents with quality annotations

  • 20B unique documents after deduplication

  • Estimated 50.6T tokens in total (30.4T after deduplication)

  • Quality signals for filtering and analysis

  • Minhash signatures for fuzzy deduplication

Note

Citations

dataset_name: str#
default_dataset: str | None = 'togethercomputer/RedPajama-Data-V2'#
class oumi.datasets.SlimPajamaDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

SlimPajama-627B: A cleaned and deduplicated version of RedPajama.

SlimPajama is the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. It was created by cleaning and deduplicating the 1.2T token RedPajama dataset, resulting in a 627B token dataset.

The dataset consists of 59166 jsonl files and is ~895GB compressed. It includes training, validation, and test splits [6].

Key Features:
  • 627B tokens

  • Open-source

  • Curated data sources

  • Extensive deduplication

  • Primarily English language

Data Sources and Proportions:
  • Commoncrawl: 52.2%

  • C4: 26.7%

  • GitHub: 5.2%

  • Books: 4.2%

  • ArXiv: 4.6%

  • Wikipedia: 3.8%

  • StackExchange: 3.3%

Dataset Structure:

Each example is a JSON object with the following structure:

{
    "text": str,
    "meta": {
        "redpajama_set_name": str  # One of the data source names
    }
}

Citations

dataset_name: str#
default_dataset: str | None = 'cerebras/SlimPajama-627B'#
class oumi.datasets.StarCoderDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

StarCoder Training Dataset used for training StarCoder and StarCoderBase models.

This dataset contains 783GB of code in 86 programming languages, including 54GB of GitHub Issues, 13GB of Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, totaling approximately 250 Billion tokens.

The dataset is a cleaned, decontaminated, and near-deduplicated version of The Stack dataset, with PII removed. It includes various programming languages, GitHub issues, Jupyter Notebooks, and GitHub commits.

Data Fields:
  • id – str

  • content – str

  • max_stars_repo_path – str

  • max_stars_repo_name – int

  • max_stars_count – str

Note

GitHub issues, GitHub commits, and Jupyter notebooks subsets have different columns from the rest. It’s recommended to load programming languages separately from these categories: - jupyter-scripts-dedup-filtered - jupyter-structured-clean-dedup - github-issues-filtered-structured - git-commits-cleaned

Subsets (See dataset for full list):
  • python

  • javascript

  • assembly

  • awk

  • git-commits-cleaned

  • github-issues-filtered-structured

Warning

Not all subsets have the same format, in particular: - jupyter-scripts-dedup-filtered - jupyter-structured-clean-dedup - github-issues-filtered-structured - git-commits-cleaned

dataset_name: str#
default_dataset: str | None = 'bigcode/starcoderdata'#
class oumi.datasets.TextSftJsonLinesDataset(dataset_path: str | Path | None = None, data: list[dict[str, Any]] | None = None, format: str | None = None, **kwargs)[source]#

Bases: BaseSftDataset

TextSftJsonLinesDataset for loading SFT data in oumi and alpaca formats.

This dataset class is designed to work with JSON Lines (.jsonl) or JSON (.json) files containing text-based supervised fine-tuning (SFT) data. It supports loading data either from a file or from a provided list of data samples in oumi and alpaca formats.

Supported formats: 1. JSONL or JSON of conversations (Oumi format) 2. JSONL or JSON of Alpaca-style turns (instruction, input, output)

Parameters:
  • dataset_path (Optional[Union[str, Path]]) – Path to the dataset file (.jsonl or .json).

  • data (Optional[List[Dict[str, Any]]]) – List of conversation dicts if not loading from a file.

  • format (Optional[str]) – The format of the data. Either “conversations” or “alpaca”. If not provided, the format will be auto-detected.

  • **kwargs – Additional arguments to pass to the parent class.

Examples

Loading conversations from a JSONL file with auto-detection:
>>> from oumi.datasets import TextSftJsonLinesDataset
>>> dataset = TextSftJsonLinesDataset( 
...     dataset_path="/path/to/your/dataset.jsonl"
... )
Loading Alpaca-style data from a JSON file:
>>> from oumi.datasets import TextSftJsonLinesDataset
>>> dataset = TextSftJsonLinesDataset( 
...     dataset_path="/path/to/your/dataset.json",
...     format="alpaca"
... )
Loading from a list of data samples:
>>> from oumi.datasets import TextSftJsonLinesDataset
>>> data_samples = [
...     {"messages": [{"role": "user", "content": "Hello"},
...                   {"role": "assistant", "content": "Hi there!"}]},
...     {"messages": [{"role": "user", "content": "How are you?"},
...                   {"role": "assistant", "content": "great!"}]}
... ]
>>> dataset = TextSftJsonLinesDataset(
...     data=data_samples,
... )
dataset_name: str#
default_dataset: str | None = 'custom'#
transform_conversation(example: dict) Conversation[source]#

Transform a single conversation example into a Conversation object.

Parameters:

example – The input example containing the messages or Alpaca-style turn.

Returns:

A Conversation object containing the messages.

Return type:

Conversation

trust_remote_code: bool#
class oumi.datasets.TheStackDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

A dataset containing over 6TB of permissively-licensed source code files.

The Stack was created as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs). It serves as a pre-training dataset for Code LLMs, enabling the synthesis of programs from natural language descriptions and other code snippets, and covers 358 programming languages.

The dataset contains code in multiple natural languages, primarily found in comments and docstrings. It supports tasks such as code completion, documentation generation, and auto-completion of code snippets.

Data Fields:
  • - content (string) – The content of the file.

  • - size (integer) – Size of the uncompressed file.

  • - lang (string) – The programming language.

  • - ext (string) – File extension.

  • - avg_line_length (float) – The average line-length of the file.

  • - max_line_length (integer) – The maximum line-length of the file.

  • - alphanum_fraction (float) – The fraction of alphanumeric characters.

  • - hexsha (string) – Unique git hash of file.

  • - max_{stars|forks|issues}_repo_path (string) – Path to file in repo.

  • - max_{stars|forks|issues}_repo_name (string) – Name of repo.

  • - max_{stars|forks|issues}_repo_head_hexsha (string) – Hexsha of repo head.

  • - max_{stars|forks|issues}_repo_licenses (string) – Licenses in repository.

  • - max_{stars|forks|issues}_count (integer) – Number of stars/forks/issues.

  • - max_{stars|forks|issues}_repo_{stars|forks|issues}_min_datetime (string) – First timestamp of a stars/forks/issues event.

  • - max_{stars|forks|issues}_repo_{stars|forks|issues}_max_datetime (string) – Last timestamp of a stars/forks/issues event.

dataset_name: str#
default_dataset: str | None = 'bigcode/the-stack'#
class oumi.datasets.TinyStoriesDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

TinyStoriesDataset class for loading and processing the TinyStories dataset.

This dataset contains synthetically generated short stories with a small vocabulary, created by GPT-3.5 and GPT-4. It is designed for text generation tasks and is available in English.

Note

The dataset is available under the CDLA-Sharing-1.0 license.

dataset_name: str#
default_dataset: str | None = 'roneneldan/TinyStories'#
class oumi.datasets.TinyTextbooksDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

A dataset of textbook-like content for training small language models.

This dataset contains 420,000 textbook documents covering a wide range of topics and concepts. It provides a comprehensive and diverse learning resource for causal language models, focusing on quality over quantity.

The dataset was synthesized using the Nous-Hermes-Llama2-13b model, combining the best of the falcon-refinedweb and minipile datasets to ensure diversity and quality while maintaining a small size.

See also

(https://arxiv.org/abs/2304.08442)

dataset_name: str#
default_dataset: str | None = 'nampdn-ai/tiny-textbooks'#
class oumi.datasets.UltrachatH4Dataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the HuggingFaceH4/ultrachat_200k dataset.

dataset_name: str#
default_dataset: str | None = 'HuggingFaceH4/ultrachat_200k'#
transform_conversation(example: dict | Series) Conversation[source]#

Transform a dataset example into a Conversation object.

trust_remote_code: bool#
class oumi.datasets.VLJsonlinesDataset(dataset_path: str | Path | None = None, data: list | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

VLJsonlinesDataset for loading Vision-Language SFT data in Oumi format.

This dataset class is designed to work with JSON Lines (.jsonl) files containing Vision-Language supervised fine-tuning (SFT) data. It supports loading data either from a file or from a provided list of data samples.

Usage example:
Examples:
Loading from a file:
>>> from oumi.datasets import VLJsonlinesDataset
>>> dataset = VLJsonlinesDataset( 
...     dataset_path="/path/to/your/dataset.jsonl",
... )
Loading from a list of data samples:
>>> from oumi.builders import build_processor, build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> from oumi.datasets import VLJsonlinesDataset
>>> data_samples = [
...     {
...         "messages": [
...             {
...                 "role": "user",
...                 "content": "Describe this image:",
...                 "type": "text"
...             },
...             {
...                 "role": "user",
...                 "content": "path/to/image.jpg",
...                 "type": "image_path"
...             },
...             {
...                 "role": "assistant",
...                 "content": "A scenic view of the puget sound.",
...                 "type": "text",
...             },
...         ]
...     }
... ]
>>> tokenizer = build_tokenizer(
...     ModelParams(model_name="Qwen/Qwen2-1.5B-Instruct")
... )
>>> dataset = VLJsonlinesDataset(
...     data=data_samples,
...     tokenizer=tokenizer,
...     processor_name="openai/clip-vit-base-patch32",
... )
dataset_name: str#
default_dataset: str | None = 'custom'#
transform_conversation(example: dict) Conversation[source]#

Transform a single conversation example into a Conversation object.

trust_remote_code: bool#
class oumi.datasets.WikiTextDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

WikiText language modeling dataset.

The WikiText dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. It is available in two sizes: WikiText-2 (2 million tokens) and WikiText-103 (103 million tokens). Each size comes in two variants: raw (for character-level work) and processed (for word-level work) [7].

The dataset is well-suited for models that can take advantage of long-term dependencies, as it is composed of full articles and retains original case, punctuation, and numbers.

Data Fields:

text (str): The text content of the dataset.

Note

The dataset is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0).

Citations

dataset_name: str#
default_dataset: str | None = 'Salesforce/wikitext'#
class oumi.datasets.WikipediaDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

Dataset containing cleaned Wikipedia articles in multiple languages.

This dataset is built from the Wikipedia dumps (https://dumps.wikimedia.org/) with one subset per language, each containing a single train split. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Data Fields:
  • id (str) – ID of the article.

  • url (str) – URL of the article.

  • title (str) – Title of the article.

  • text (str) – Text content of the article.

Note

All configurations contain a single ‘train’ split.

Languages:

The dataset supports numerous languages. For a full list, see: https://meta.wikimedia.org/wiki/List_of_Wikipedias

Note

The dataset is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

dataset_name: str#
default_dataset: str | None = 'wikimedia/wikipedia'#
class oumi.datasets.YouTubeCommonsDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

YouTube-Commons Dataset.

This dataset is a collection of audio transcripts from 2,063,066 videos shared on YouTube under a CC-By license. It contains 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels), representing nearly 45 billion words.

The corpus is multilingual, with a majority of English-speaking content (71%) for original languages. Automated translations are provided for nearly all videos in English, French, Spanish, German, Russian, Italian, and Dutch.

This dataset aims to expand the availability of conversational data for research in AI, computational social science, and digital humanities.

Data Fields:
  • - video_id – string

  • - video_link – string

  • - title – string

  • - text – string

  • - channel – string

  • - channel_id – string

  • - date – string

  • - license – string

  • - original_language – string

  • - source_language – string

  • - transcription_language – string

  • - word_count – int64

  • - character_count – int64

Note

The text can be used for training models and republished for reproducibility purposes. In accordance with the CC-By license, every YouTube channel is fully credited.

Note

This dataset is licensed under CC-BY-4.0.

dataset_name: str#
default_dataset: str | None = 'PleIAs/YouTube-Commons'#

Subpackages#