oumi.datasets

oumi.datasets#

Datasets module for the Oumi (Open Universal Machine Intelligence) library.

This module provides various dataset implementations for use in the Oumi framework. These datasets are designed for different machine learning tasks and can be used with the models and training pipelines provided by Oumi.

For more information on the available datasets and their usage, see the oumi.datasets documentation.

Each dataset is implemented as a separate class, inheriting from appropriate base classes in the oumi.core.datasets module. For usage examples and more detailed information on each dataset, please refer to their respective class documentation.

See also

oumi.models: Compatible models for use with these datasets.
oumi.core.datasets: Base classes for dataset implementations.

Example

>>> from oumi.datasets import AlpacaDataset
>>> from torch.utils.data import DataLoader
>>> dataset = AlpacaDataset()
>>> train_loader = DataLoader(dataset, batch_size=32)

class oumi.datasets.AlpacaDataset(*, include_system_prompt: bool = True, **kwargs)[source]#

Bases: BaseSftDataset

dataset_name: str#

default_dataset: str | None = 'tatsu-lab/alpaca'#

system_prompt_with_context = 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.'#

system_prompt_without_context = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'#

transform_conversation(example: dict | Series) → Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

Parameters:: example (dict or Pandas Series) – An example containing input (optional), instruction, and output entries.
Returns:: The input example converted to Alpaca dictionary format.
Return type:: dict

trust_remote_code: bool#

class oumi.datasets.AlpacaEvalDataset(*, include_system_prompt: bool = False, unused_entries_to_metadata: bool = False, trust_remote_code: bool = True, **kwargs)[source]#

Bases: BaseSftDataset

dataset_name: str#

default_dataset: str | None = 'tatsu-lab/alpaca_eval'#

system_prompt_with_context = 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.'#

system_prompt_without_context = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'#

transform_conversation(example: dict | Series) → Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

Parameters:: example (dict or Pandas Series) – An example containing input (optional), instruction entries.
Returns:: The input example converted to Alpaca dictionary format.
Return type:: dict

Note

If unused_entries_to_metadata is set: all example’s entries, other than the expected ones (i.e., input and instruction), are saved as metadata.

trust_remote_code: bool#

class oumi.datasets.ArgillaDollyDataset(*, use_new_fields: bool = True, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the Databricks Dolly 15k curated dataset.

dataset_name: str#

default_dataset: str | None = 'argilla/databricks-dolly-15k-curated-en'#

transform_conversation(example: dict | Series) → Conversation[source]#

Transform a dataset example into a Conversation object.

Parameters:: example – A single example from the dataset.
Returns:: A Conversation object containing the transformed messages.
Return type:: Conversation

trust_remote_code: bool#

class oumi.datasets.ArgillaMagpieUltraDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the argilla/magpie-ultra-v0.1 dataset.

dataset_name: str#

default_dataset: str | None = 'argilla/magpie-ultra-v0.1'#

transform_conversation(example: dict | Series) → Conversation[source]#: Transform a dataset example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.AyaDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the CohereForAI/aya_dataset dataset.

dataset_name: str#

default_dataset: str | None = 'CohereForAI/aya_dataset'#

transform_conversation(example: dict | Series) → Conversation[source]#: Transform a dataset example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.C4Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

A dataset for pretraining on the Colossal Clean Crawled Corpus (C4).

The C4 dataset is based on the Common Crawl dataset and is available in multiple variants: ‘en’, ‘en.noclean’, ‘en.noblocklist’, ‘realnewslike’, and ‘multilingual’ (mC4). It is intended for pretraining language models and word representations.

For more details and download instructions, visit: https://huggingface.co/datasets/allenai/c4

References

Paper: https://arxiv.org/abs/1910.10683

Data Fields:

- url – URL of the source as a string
- text – Text content as a string
- timestamp – Timestamp as a string

Dataset Variants:

en: 305GB
en.noclean: 2.3TB
en.noblocklist: 380GB
realnewslike: 15GB
multilingual (mC4): 9.7TB (108 subsets, one per language)

The dataset is released under the ODC-BY license and is subject to the Common Crawl terms of use.

dataset_name: str#

default_dataset: str | None = 'allenai/c4'#

class oumi.datasets.COCOCaptionsDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the HuggingFaceM4/COCO dataset.

dataset_name: str#

default_dataset: str | None = 'HuggingFaceM4/COCO'#

default_prompt = 'Describe this image:'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.ChatRAGBenchDataset(*, split: str = 'test', task: str = 'generation', subset: str | None = None, num_context_docs: int = 5, **kwargs)[source]#

Bases: BaseSftDataset

default_dataset: str = 'nvidia/ChatRAG-Bench'#

default_subset: str = 'doc2dial'#

default_system_message: str = "This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."#

transform_conversation(example: dict | Series) → Conversation[source]#

Transforms a given example into a Conversation object.

Parameters:: example (Union[dict, pd.Series]) – The example to transform.
Returns:: The transformed Conversation object.
Return type:: Conversation

class oumi.datasets.ChatqaDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, **kwargs)[source]#

Bases: BaseSftDataset

dataset_name: str#

default_dataset: str | None = 'nvidia/ChatQA-Training-Data'#

default_subset: str | None = 'sft'#

transform_conversation(raw_conversation: dict | Series) → Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

ChatQA is a conversational question answering dataset. It contains 10 subsets. Some subsets contain grounding documents.

See the dataset page for more information: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data

Parameters:: raw_conversation – The raw conversation example.
Returns:: The preprocessed inputs as an Oumi conversation.
Return type:: dict

trust_remote_code: bool#

class oumi.datasets.ChatqaTatqaDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, **kwargs)[source]#

Bases: ChatqaDataset

ChatQA Subclass to handle tatqa subsets.

The tatqa subsets require loading a specific file from the dataset repository, thus requiring us to override the default loading behavior.

dataset_name: str#

default_subset: str | None = 'tatqa-arithmetic'#

trust_remote_code: bool#

class oumi.datasets.CoALMDataset(*, include_system_prompt: bool = True, **kwargs)[source]#

Bases: AlpacaDataset

Dataset class for the UIUC CoALM dataset.

This dataset follows the same structure as the Alpaca dataset, with instruction, input, and output fields. It is designed for training Conversational Agentic Language Models (CoALM) that can handle both task-oriented dialogue and function calling.

Dataset Sources:

Paper: https://arxiv.org/abs/2502.08820
Project Page: https://emrecanacikgoz.github.io/CoALM/
Repository: oumi-ai/oumi
Dataset: https://huggingface.co/datasets/uiuc-convai/CoALM-IT

Examples

>>> from oumi.datasets import CoALMDataset
>>> dataset = CoALMDataset()
>>> # The dataset will be loaded from HuggingFace with the path
>>> # "uiuc-convai/CoALM-IT" and transformed into the Oumi
>>> # conversation format automatically.

dataset_name: str#

default_dataset: str | None = 'uiuc-convai/CoALM-IT'#

trust_remote_code: bool#

class oumi.datasets.DebugClassificationDataset(dataset_size: int = 1000, feature_dim: int = 128, data_type: str = 'float32', num_classes: int = 10, preprocessing_time_ms: float = 0, **kwargs)[source]#

Bases: Dataset

__getitem__(idx)[source]#: Return the data and label at the given index.

__len__()[source]#: Return the size of the dataset.

class oumi.datasets.DebugPretrainingDataset(dataset_size: int = 1000, **kwargs)[source]#

Bases: BasePretrainingDataset

dataset_name: str#

default_dataset: str | None = 'debug_pretraining'#

class oumi.datasets.DebugSftDataset(dataset_size: int = 5, **kwargs)[source]#

Bases: BaseSftDataset

dataset_name: str#

default_dataset: str | None = 'debug_sft'#

transform_conversation(example: dict | Series) → Conversation[source]#: Transforms the example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.DocmatixDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: TheCauldronDataset

Dataset class for the HuggingFaceM4/Docmatix dataset.

The dataset has the same data layout and format as HuggingFaceM4/the_cauldron (hence it’s defined as a sub-class) but the underlying data is different. Unlike HuggingFaceM4/the_cauldron, the dataset contains many multi-image examples, and fewer subsets.

Be aware that ‘HuggingFaceM4/Docmatix’ is a very large dataset (~0.5TB) that requires a lot of Internet bandwidth to download, and a lot of disk space to store, so only use it if you know what you’re doing.

Using the ‘Docmatix’ dataset in Oumi should become easier after streaming support is supported for custom Oumi datasets (OPE-1021).

dataset_name: str#

default_dataset: str | None = 'HuggingFaceM4/Docmatix'#

trust_remote_code: bool#

class oumi.datasets.DolmaDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

Dolma: A dataset of 3 trillion tokens from diverse web content.

Dolma [1] is a large-scale dataset containing approximately 3 trillion tokens sourced from various web content, academic publications, code, books, and encyclopedic materials. It is designed for language modeling tasks and casual language model training.

The dataset is available in multiple versions, with v1.7 being the latest release used to train OLMo 7B-v1.7. It includes data from sources such as Common Crawl, Refined Web, StarCoder, C4, Reddit, Semantic Scholar, arXiv, StackExchange, and more.

Data Fields:

id (str) – Unique identifier for the data entry.
text (str) – The main content of the data entry.
added (str, optional) – Timestamp indicating when the entry was added to the dataset.
created (str, optional) – Timestamp indicating when the original content was created.
source (str, optional) – Information about the origin or source of the data.

See also

Paper: https://arxiv.org/abs/2402.00159
GitHub project: allenai/dolma
Hugging Face Hub: https://huggingface.co/datasets/allenai/dolma

Note

The dataset is released under the ODC-BY license. Users are bound by the license agreements and terms of use of the original data sources.

Citations

dataset_name: str#

default_dataset: str | None = 'allenai/dolma'#

class oumi.datasets.FalconRefinedWebDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

A massive English web dataset built by TII for pretraining large language models.

The Falcon RefinedWeb dataset is created through stringent filtering and large-scale deduplication of CommonCrawl. It contains about 1B instances (968M individual web pages) for a total of 2.8TB of clean text data.

This dataset is intended primarily for pretraining large language models and can be used on its own or augmented with curated sources.

Dataset Link:

https://huggingface.co/datasets/tiiuae/falcon-refinedweb

Paper:

https://arxiv.org/abs/2306.01116

Features:

content (str): The processed and cleaned text contained in the page.
url (str): The URL of the webpage crawled to produce the sample.
timestamp (timestamp[s]): Timestamp of when the webpage was crawled by CommonCrawl.
dump (str): The CommonCrawl dump the sample is a part of.
segment (str): The CommonCrawl segment the sample is a part of.
image_urls (List[List[str]]): A list of elements in the type [image_url, image_alt_text] for all images found in the content.

Usage:

from datasets import load_dataset rw = load_dataset(“tiiuae/falcon-refinedweb”)

Notes

ODC-By 1.0

Note

This public extract is about ~500GB to download, requiring 2.8TB of local storage once unpacked.
The dataset may contain sensitive information and biased content.
No canonical splits are provided for this dataset.

dataset_name: str#

default_dataset: str | None = 'tiiuae/falcon-refinedweb'#

class oumi.datasets.FineWebEduDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

FineWeb-Edu: A high-quality educational dataset filtered from web content.

This dataset contains 1.3 trillion tokens of educational web pages filtered from the FineWeb dataset using an educational quality classifier. It aims to provide the finest collection of educational content from the web [2].

The dataset is available in multiple configurations:

Full dataset (default)
Individual CommonCrawl dumps (e.g. CC-MAIN-2024-10)
Sample subsets (10BT, 100BT, 350BT tokens)

Key Features:

1.3 trillion tokens of educational content
Filtered using a classifier trained on LLama3-70B-Instruct annotations
Outperforms other web datasets on educational benchmarks

See also

Huggingface hub page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

Note

The dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0.

Citations

dataset_name: str#

default_dataset: str | None = 'HuggingFaceFW/fineweb-edu'#

class oumi.datasets.Flickr30kDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the nlphuji/flickr30k dataset.

dataset_name: str#

default_dataset: str | None = 'nlphuji/flickr30k'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.Geometry3kDataset(*, add_system_instruction: bool = False, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the hiyouga/geometry3k dataset.

dataset_name: str#

default_dataset: str | None = 'hiyouga/geometry3k'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.HuggingFaceDataset(*, hf_dataset_path: str = '', messages_column: str = 'messages', exclude_final_assistant_message: bool = False, **kwargs)[source]#

Bases: BaseSftDataset

Converts HuggingFace Datasets with messages to Oumi Message format.

Example:

dataset = HuggingFaceDataset(
    hf_dataset_path="oumi-ai/oumi-synthetic-document-claims",
    message_column="messages"
)

dataset_name: str#

transform_conversation(example: dict | Series) → Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

Parameters:: example – An example containing messages entries.
Returns:: A Conversation object containing the messages.
Return type:: Conversation

trust_remote_code: bool#

class oumi.datasets.HuggingFaceVisionDataset(*, hf_dataset_path: str, image_column: str, question_column: str, answer_column: str | None = None, system_prompt_column: str | None = None, system_prompt: str | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Converts HuggingFace Vision-Language Datasets to Oumi Message format.

This dataset handles standard HuggingFace datasets that contain:

An image column (containing image data or paths)
A question/prompt column (text input)
An optional answer column (text output)

Example:

dataset = HuggingFaceVisionDataset(
    hf_dataset_path="HuggingFaceM4/VQAv2",
    image_column="image",
    question_column="question",
    answer_column="answer"
)

dataset_name: str#

transform_conversation(example: dict | Series) → Conversation[source]#

Preprocesses the inputs of the example and returns a Conversation.

Parameters:: example – An example containing image, question, and optionally answer data.
Returns:: A Conversation object containing the messages.
Return type:: Conversation

trust_remote_code: bool#

class oumi.datasets.LetterCountGrpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, **kwargs)[source]#

Bases: BaseExperimentalGrpoDataset

Dataset class for the oumi-ai/oumi-letter-count dataset.

A sample from the dataset:

{
    "conversation_id": "oumi_letter_count_0",
    "messages": [
        {
            "content": "Can you let me know how many 'r's are in 'pandered'?",
            "role": "user",
        }
    ],
    "metadata": {
        "letter": "r",
        "letter_count_integer": 1,
        "letter_count_string": "one",
        "unformatted_prompt": "Can you let me know how many {letter}s are in {word}?",
        "word": "pandered",
    },
}

dataset_name: str#

default_dataset: str | None = 'oumi-ai/oumi-letter-count'#

transform(sample: Series) → dict[source]#: Validate and transform the sample into Python dict.

transform_conversation(sample: Series) → Conversation[source]#

Converts the input sample to a Conversation.

Parameters:: sample (dict) – The input example.
Returns:: The resulting conversation.
Return type:: Conversation

trust_remote_code: bool#

class oumi.datasets.LlavaInstructMixVsftDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the HuggingFaceH4/llava-instruct-mix-vsft dataset.

dataset_name: str#

default_dataset: str | None = 'HuggingFaceH4/llava-instruct-mix-vsft'#

transform_conversation(example: dict) → Conversation[source]#: Transform a dataset example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.LmmsLabMultimodalOpenR1Dataset(**kwargs)[source]#

Bases: HuggingFaceVisionDataset

Multimodal Open R1 8K Verified Dataset from LMMS Lab.

A specialized dataset class for the lmms-lab/multimodal-open-r1-8k-verified dataset that contains multimodal reasoning problems with images, problems, and solutions.

HuggingFace Hub: https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified

dataset_name: str#

trust_remote_code: bool#

class oumi.datasets.MagpieProDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1 dataset.

dataset_name: str#

default_dataset: str | None = 'Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1'#

transform_conversation(example: dict | Series) → Conversation[source]#: Transform a dataset example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.MnistSftDataset(*, dataset_name: str | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

MNIST dataset formatted as SFT data.

MNIST is a well-known small dataset, can be useful for quick tests, prototyping, debugging.

dataset_name: str#

default_dataset: str | None = 'ylecun/mnist'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single MNIST example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.OpenO1SFTDataset(**kwargs)[source]#

Bases: PromptResponseDataset

Synthetic reasoning SFT dataset.

dataset_name: str#

default_dataset: str | None = 'O1-OPEN/OpenO1-SFT'#

trust_remote_code: bool#

class oumi.datasets.OpenbmbRlaifVDataset(*args, **kwargs)[source]#

Bases: VisionLanguageDpoDataset

Preprocess the RLAIF-V dataset for DPO.

See also

For more information on how to use this dataset, refer to: - Huggingface hub: https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset

dataset_name: str#

default_dataset: str | None = 'openbmb/RLAIF-V-Dataset'#

trust_remote_code: bool#

class oumi.datasets.OrpoDpoMix40kDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, return_tensors: bool = False, **kwargs)[source]#

Bases: BaseDpoDataset

Preprocess the ORPO dataset for DPO.

A dataset designed for ORPO (Offline Reinforcement Learning for Preference Optimization) or DPO (Direct Preference Optimization) training.

This dataset is a combination of high-quality DPO datasets, including: - Capybara-Preferences - distilabel-intel-orca-dpo-pairs - ultrafeedback-binarized-preferences-cleaned - distilabel-math-preference-dpo - toxic-dpo-v0.2 - prm_dpo_pairs_cleaned - truthy-dpo-v0.1

Rule-based filtering was applied to remove ‘gptisms’ in the chosen answers.

Data Fields:

- source – string
- chosen – list of dictionaries with ‘content’ and ‘role’ fields
- rejected – list of dictionaries with ‘content’ and ‘role’ fields
- prompt – string
- question – string

See also

For more information on how to use this dataset, refer to: - Blog post: https://huggingface.co/blog/mlabonne/orpo-llama-3 - Huggingface hub: https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k

dataset_name: str#

default_dataset: str | None = 'mlabonne/orpo-dpo-mix-40k'#

trust_remote_code: bool#

class oumi.datasets.PileV1Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

The Pile: An 825 GiB diverse, open source language modeling dataset.

The Pile is a large-scale English language dataset consisting of 22 smaller, high-quality datasets combined together. It is designed for training large language models and supports various natural language processing tasks [3][4].

Data Fields:

text (str) – The main text content.
meta (dict) – Metadata about the instance, including ‘pile_set_name’.

Key Features:

825 GiB of diverse text data
Primarily in English
Supports text generation and fill-mask tasks
Includes various subsets like enron_emails, europarl, free_law, etc.

Subsets:

all
enron_emails
europarl
free_law
hacker_news
nih_exporter
pubmed
pubmed_central
ubuntu_irc
uspto
github

Splits:

train
validation
test

See also

Homepage: https://pile.eleuther.ai/
HuggingFace hub: https://huggingface.co/datasets/EleutherAI/pile

Warning

This dataset contains text from various sources and may include personal or sensitive information. Users should consider potential biases and limitations when using this dataset.

Citations

dataset_name: str#

default_dataset: str | None = 'EleutherAI/pile'#

class oumi.datasets.PixmoAskModelAnythingDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the allenai/pixmo-docs dataset.

The dataset is affected by some image URLs having a 404 issue.

dataset_name: str#

default_dataset: str | None = 'allenai/pixmo-ask-model-anything'#

transform_conversation(example: dict) → Conversation[source]#: Transform the example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.PixmoCapDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the allenai/pixmo-cap dataset.

The dataset is affected by some image URLs having a 404 issue.

dataset_name: str#

default_dataset: str | None = 'allenai/pixmo-cap'#

transform_conversation(example: dict) → Conversation[source]#

Transform the example into a Conversation object.

A “transcripts” column is also available but not used yet.

trust_remote_code: bool#

class oumi.datasets.PixmoCapQADataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the allenai/pixmo-cap-qa dataset.

The dataset is affected by some image URLs having a 404 issue.

dataset_name: str#

default_dataset: str | None = 'allenai/pixmo-cap-qa'#

transform_conversation(example: dict) → Conversation[source]#

Transform the example into a Conversation object.

Sample “question”: “[USER] Can you come up with a joke? [ASSISTANT]” It starts with a [USER] and ends with an [ASSISTANT] role tag. The Assistant response appears in the “answer” field.

trust_remote_code: bool#

class oumi.datasets.PromptResponseDataset(*, hf_dataset_path: str = 'O1-OPEN/OpenO1-SFT', prompt_column: str = 'instruction', response_column: str = 'output', **kwargs)[source]#

Bases: BaseSftDataset

Converts HuggingFace Datasets with input/output columns to Message format.

Example

dataset = PromptResponseDataset(hf_dataset_path=”O1-OPEN/OpenO1-SFT”, prompt_column=”instruction”, response_column=”output”)

dataset_name: str#

default_dataset: str | None = 'O1-OPEN/OpenO1-SFT'#

transform_conversation(example: dict | Series) → Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

Parameters:: example (dict or Pandas Series) – An example containing input (optional), instruction, and output entries.
Returns:: The input example converted to messages dictionary format.
Return type:: dict

trust_remote_code: bool#

class oumi.datasets.RedPajamaDataV1Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

This dataset contains approximately 1.2 trillion tokens from various sources: Commoncrawl (878B), C4 (175B), GitHub (59B), ArXiv (28B), Wikipedia (24B), and StackExchange (20B) [5].

The dataset is primarily in English, though the Wikipedia slice contains multiple languages.

Dataset Structure:

{
    "text": str,
    "meta": {
        "url": str,
        "timestamp": str,
        "source": str,
        "language": str,
        ...
    },
    "red_pajama_subset": str
}

Subsets:

common_crawl
c4
github
arxiv
wikipedia
stackexchange

See also

For more information on dataset creation and source data, please refer to the RedPajama GitHub repository: togethercomputer/RedPajama-Data
Hugging Face dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

Note

The ‘book’ config is defunct and no longer accessible due to reported copyright infringement for the Book3 dataset contained in this config.

Note

Please refer to the licenses of the data subsets you use. Links to the respective licenses can be found in the README.

Citations

dataset_name: str#

default_dataset: str | None = 'togethercomputer/RedPajama-Data-1T'#

class oumi.datasets.RedPajamaDataV2Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

RedPajama V2 Dataset for training large language models.

This dataset includes over 100B text documents from 84 CommonCrawl snapshots, processed using the CCNet pipeline. It contains 30B documents with quality signals and 20B deduplicated documents [5].

The dataset is available in English, German, French, Italian, and Spanish.

Key Features:

Over 100B text documents
30B documents with quality annotations
20B unique documents after deduplication
Estimated 50.6T tokens in total (30.4T after deduplication)
Quality signals for filtering and analysis
Minhash signatures for fuzzy deduplication

See also

Hugging Face dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
Blog post: https://together.ai/blog/redpajama-data-v2
GitHub repo: togethercomputer/RedPajama-Data

Note

License: Common Crawl Foundation Terms of Use: https://commoncrawl.org/terms-of-use
Code: Apache 2.0 license

Citations

dataset_name: str#

default_dataset: str | None = 'togethercomputer/RedPajama-Data-V2'#

class oumi.datasets.SlimPajamaDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

SlimPajama-627B: A cleaned and deduplicated version of RedPajama.

SlimPajama is the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. It was created by cleaning and deduplicating the 1.2T token RedPajama dataset, resulting in a 627B token dataset.

The dataset consists of 59166 jsonl files and is ~895GB compressed. It includes training, validation, and test splits [6].

Key Features:

627B tokens
Open-source
Curated data sources
Extensive deduplication
Primarily English language

Data Sources and Proportions:

Commoncrawl: 52.2%
C4: 26.7%
GitHub: 5.2%
Books: 4.2%
ArXiv: 4.6%
Wikipedia: 3.8%
StackExchange: 3.3%

See also

Blog post: https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama
Repository: Cerebras/modelzoo
Hugging Face: https://huggingface.co/datasets/cerebras/SlimPajama-627B

Dataset Structure:

Each example is a JSON object with the following structure:

{
    "text": str,
    "meta": {
        "redpajama_set_name": str  # One of the data source names
    }
}

Citations

dataset_name: str#

default_dataset: str | None = 'cerebras/SlimPajama-627B'#

class oumi.datasets.StarCoderDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

StarCoder Training Dataset used for training StarCoder and StarCoderBase models.

This dataset contains 783GB of code in 86 programming languages, including 54GB of GitHub Issues, 13GB of Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, totaling approximately 250 Billion tokens.

The dataset is a cleaned, decontaminated, and near-deduplicated version of The Stack dataset, with PII removed. It includes various programming languages, GitHub issues, Jupyter Notebooks, and GitHub commits.

Data Fields:

id – str
content – str
max_stars_repo_path – str
max_stars_repo_name – int
max_stars_count – str

See also

Huggingface hub: https://huggingface.co/datasets/bigcode/starcoderdata

Note

GitHub issues, GitHub commits, and Jupyter notebooks subsets have different columns from the rest. It’s recommended to load programming languages separately from these categories: - jupyter-scripts-dedup-filtered - jupyter-structured-clean-dedup - github-issues-filtered-structured - git-commits-cleaned

Subsets (See dataset for full list):

python
javascript
assembly
awk
git-commits-cleaned
github-issues-filtered-structured
…

Warning

Not all subsets have the same format, in particular: - jupyter-scripts-dedup-filtered - jupyter-structured-clean-dedup - github-issues-filtered-structured - git-commits-cleaned

dataset_name: str#

default_dataset: str | None = 'bigcode/starcoderdata'#

class oumi.datasets.TextSftJsonLinesDataset(dataset_path: str | Path | None = None, data: list[dict[str, Any]] | None = None, format: str | None = None, **kwargs)[source]#

Bases: BaseSftDataset

TextSftJsonLinesDataset for loading SFT data in oumi and alpaca formats.

This dataset class is designed to work with JSON Lines (.jsonl) or JSON (.json) files containing text-based supervised fine-tuning (SFT) data. It supports loading data either from a file or from a provided list of data samples in oumi and alpaca formats.

Supported formats: 1. JSONL or JSON of conversations (Oumi format) 2. JSONL or JSON of Alpaca-style turns (instruction, input, output)

Parameters:

dataset_path (Optional[Union[str, Path]]) – Path to the dataset file (.jsonl or .json).
data (Optional[List[Dict[str, Any]]]) – List of conversation dicts if not loading from a file.
format (Optional[str]) – The format of the data. Either “conversations” or “alpaca”. If not provided, the format will be auto-detected.
**kwargs – Additional arguments to pass to the parent class.

Examples

Loading conversations from a JSONL file with auto-detection:

>>> from oumi.datasets import TextSftJsonLinesDataset
>>> dataset = TextSftJsonLinesDataset(
...     dataset_path="/path/to/your/dataset.jsonl"
... )

Loading Alpaca-style data from a JSON file:

>>> from oumi.datasets import TextSftJsonLinesDataset
>>> dataset = TextSftJsonLinesDataset(
...     dataset_path="/path/to/your/dataset.json",
...     format="alpaca"
... )

Loading from a list of data samples:

>>> from oumi.datasets import TextSftJsonLinesDataset
>>> data_samples = [
...     {"messages": [{"role": "user", "content": "Hello"},
...                   {"role": "assistant", "content": "Hi there!"}]},
...     {"messages": [{"role": "user", "content": "How are you?"},
...                   {"role": "assistant", "content": "great!"}]}
... ]
>>> dataset = TextSftJsonLinesDataset(
...     data=data_samples,
... )

dataset_name: str#

default_dataset: str | None = 'custom'#

transform_conversation(example: dict) → Conversation[source]#

Transform a single conversation example into a Conversation object.

Parameters:: example – The input example containing the messages or Alpaca-style turn.
Returns:: A Conversation object containing the messages.
Return type:: Conversation

trust_remote_code: bool#

class oumi.datasets.TheCauldronDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the HuggingFaceM4/the_cauldron dataset.

The HuggingFaceM4/the_cauldron dataset is a comprehensive collection of 50 vision-language datasets, primarily training sets, used for fine-tuning the Idefics2 vision-language model. The datasets cover various domains such as general visual question answering, captioning, OCR, document understanding, chart/figure understanding, table understanding, reasoning, logic, maths, textbook/academic questions, differences between images, and screenshot to code.

dataset_name: str#

default_dataset: str | None = 'HuggingFaceM4/the_cauldron'#

transform_conversation(example: dict[str, Any]) → Conversation[source]#: Transform raw data into a conversation with images.

trust_remote_code: bool#

class oumi.datasets.TheStackDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

A dataset containing over 6TB of permissively-licensed source code files.

The Stack was created as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs). It serves as a pre-training dataset for Code LLMs, enabling the synthesis of programs from natural language descriptions and other code snippets, and covers 358 programming languages.

The dataset contains code in multiple natural languages, primarily found in comments and docstrings. It supports tasks such as code completion, documentation generation, and auto-completion of code snippets.

See also

Huggingface hub: https://huggingface.co/datasets/bigcode/the-stack
Homepage: https://www.bigcode-project.org/
Repository: bigcode-project
Paper: https://arxiv.org/abs/2211.15533

Data Fields:

- content (string) – The content of the file.
- size (integer) – Size of the uncompressed file.
- lang (string) – The programming language.
- ext (string) – File extension.
- avg_line_length (float) – The average line-length of the file.
- max_line_length (integer) – The maximum line-length of the file.
- alphanum_fraction (float) – The fraction of alphanumeric characters.
- hexsha (string) – Unique git hash of file.
- max_{stars|forks|issues}_repo_path (string) – Path to file in repo.
- max_{stars|forks|issues}_repo_name (string) – Name of repo.
- max_{stars|forks|issues}_repo_head_hexsha (string) – Hexsha of repo head.
- max_{stars|forks|issues}_repo_licenses (string) – Licenses in repository.
- max_{stars|forks|issues}_count (integer) – Number of stars/forks/issues.
- max_{stars|forks|issues}_repo_{stars|forks|issues}_min_datetime (string) – First timestamp of a stars/forks/issues event.
- max_{stars|forks|issues}_repo_{stars|forks|issues}_max_datetime (string) – Last timestamp of a stars/forks/issues event.

dataset_name: str#

default_dataset: str | None = 'bigcode/the-stack'#

class oumi.datasets.TinyStoriesDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

TinyStoriesDataset class for loading and processing the TinyStories dataset.

This dataset contains synthetically generated short stories with a small vocabulary, created by GPT-3.5 and GPT-4. It is designed for text generation tasks and is available in English.

See also

Paper: https://arxiv.org/abs/2305.07759
Huggingface hub: https://huggingface.co/datasets/roneneldan/TinyStories

Note

The dataset is available under the CDLA-Sharing-1.0 license.

dataset_name: str#

default_dataset: str | None = 'roneneldan/TinyStories'#

class oumi.datasets.TinyTextbooksDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

A dataset of textbook-like content for training small language models.

This dataset contains 420,000 textbook documents covering a wide range of topics and concepts. It provides a comprehensive and diverse learning resource for causal language models, focusing on quality over quantity.

The dataset was synthesized using the Nous-Hermes-Llama2-13b model, combining the best of the falcon-refinedweb and minipile datasets to ensure diversity and quality while maintaining a small size.

See also

Huggingface hub: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks

Textbooks Are All You Need II: phi-1.5 technical report (https://arxiv.org/abs/2309.05463)

Falcon: A Large Language Model for Search (https://arxiv.org/abs/2306.01116)

The MiniPile Challenge for Data-Efficient Language Models

(https://arxiv.org/abs/2304.08442)

dataset_name: str#

default_dataset: str | None = 'nampdn-ai/tiny-textbooks'#

class oumi.datasets.TldrGrpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, **kwargs)[source]#

Bases: BaseExperimentalGrpoDataset

Dataset class for the trl-lib/tldr dataset.

dataset_name: str#

default_dataset: str | None = 'trl-lib/tldr'#

trust_remote_code: bool#

class oumi.datasets.Tulu3MixtureDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, **kwargs)[source]#

Bases: BaseSftDataset

dataset_name: str#

default_dataset: str | None = 'allenai/tulu-3-sft-mixture'#

transform_conversation(example: dict | Series) → Conversation[source]#

Convert the example into a Conversation.

Parameters:: example (dict or Pandas Series) – An example containing a messages field which is a list of dicts with content and role string fields

trust_remote_code: bool#

class oumi.datasets.UltrachatH4Dataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the HuggingFaceH4/ultrachat_200k dataset.

dataset_name: str#

default_dataset: str | None = 'HuggingFaceH4/ultrachat_200k'#

transform_conversation(example: dict | Series) → Conversation[source]#: Transform a dataset example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.VLJsonlinesDataset(dataset_path: str | Path | None = None, data: list | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

VLJsonlinesDataset for loading Vision-Language SFT data in Oumi format.

This dataset class is designed to work with JSON Lines (.jsonl) files containing Vision-Language supervised fine-tuning (SFT) data. It supports loading data either from a file or from a provided list of data samples.

Usage example:

Examples:

Loading from a file:

>>> from oumi.datasets import VLJsonlinesDataset
>>> dataset = VLJsonlinesDataset(
...     dataset_path="/path/to/your/dataset.jsonl",
... )

Loading from a list of data samples:

>>> from oumi.builders import build_processor, build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> from oumi.datasets import VLJsonlinesDataset
>>> data_samples = [
...     {
...         "messages": [
...             {
...                 "role": "user",
...                 "content": "Describe this image:",
...                 "type": "text"
...             },
...             {
...                 "role": "user",
...                 "content": "path/to/image.jpg",
...                 "type": "image_path"
...             },
...             {
...                 "role": "assistant",
...                 "content": "A scenic view of the puget sound.",
...                 "type": "text",
...             },
...         ]
...     }
... ]
>>> tokenizer = build_tokenizer(
...     ModelParams(model_name="Qwen/Qwen2-1.5B-Instruct")
... )
>>> dataset = VLJsonlinesDataset(
...     data=data_samples,
...     tokenizer=tokenizer,
...     processor_name="openai/clip-vit-base-patch32",
... )

dataset_name: str#

default_dataset: str | None = 'custom'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.VisionDpoJsonlinesDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, data: list[dict] | None = None, **kwargs)[source]#

Bases: VisionLanguageDpoDataset

VisionDpoJsonlinesDataset for loading Vision-Language DPO data in Oumi format.

This dataset class is designed to work with JSON Lines (.jsonl) files containing Vision-Language Direct Preference Optimization (DPO) data. It supports loading data either from a file or from a provided list of data samples.

See VisionLanguageDpoDataset for more details.

Example:

dataset = VisionDpoJsonlinesDataset(
    dataset_path="data/dataset_examples/vision_language_dpo_format.jsonl"
)

dataset_name: str#

default_dataset: str | None = 'vision_dpo_jsonl'#

trust_remote_code: bool#

class oumi.datasets.Vqav2SmallDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the merve/vqav2-small dataset.

dataset_name: str#

default_dataset: str | None = 'merve/vqav2-small'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.WikiTextDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

WikiText language modeling dataset.

The WikiText dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. It is available in two sizes: WikiText-2 (2 million tokens) and WikiText-103 (103 million tokens). Each size comes in two variants: raw (for character-level work) and processed (for word-level work) [7].

The dataset is well-suited for models that can take advantage of long-term dependencies, as it is composed of full articles and retains original case, punctuation, and numbers.

Data Fields:
text (str): The text content of the dataset.

See also

Hugging Face Hub: https://huggingface.co/datasets/Salesforce/wikitext

Note

The dataset is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0).

Citations

dataset_name: str#

default_dataset: str | None = 'Salesforce/wikitext'#

class oumi.datasets.WikipediaDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

Dataset containing cleaned Wikipedia articles in multiple languages.

This dataset is built from the Wikipedia dumps (https://dumps.wikimedia.org/) with one subset per language, each containing a single train split. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Data Fields:

id (str) – ID of the article.
url (str) – URL of the article.
title (str) – Title of the article.
text (str) – Text content of the article.

Note

All configurations contain a single ‘train’ split.

Languages:: The dataset supports numerous languages. For a full list, see: https://meta.wikimedia.org/wiki/List_of_Wikipedias

Note

The dataset is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.

See also

Homepage: https://dumps.wikimedia.org
Hugging Face Hub: https://huggingface.co/datasets/wikimedia/wikipedia

dataset_name: str#

default_dataset: str | None = 'wikimedia/wikipedia'#

class oumi.datasets.WildChatDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, **kwargs)[source]#

Bases: BaseSftDataset

Dataset class for the allenai/WildChat-1M dataset.

dataset_name: str#

default_dataset: str | None = 'allenai/WildChat-1M'#

transform_conversation(example: dict | Series) → Conversation[source]#: Transform a dataset example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.YouTubeCommonsDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BasePretrainingDataset

YouTube-Commons Dataset.

This dataset is a collection of audio transcripts from 2,063,066 videos shared on YouTube under a CC-By license. It contains 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels), representing nearly 45 billion words.

The corpus is multilingual, with a majority of English-speaking content (71%) for original languages. Automated translations are provided for nearly all videos in English, French, Spanish, German, Russian, Italian, and Dutch.

This dataset aims to expand the availability of conversational data for research in AI, computational social science, and digital humanities.

See also

Hugging Face Hub: https://huggingface.co/datasets/PleIAs/YouTube-Commons

Data Fields:

- video_id – string
- video_link – string
- title – string
- text – string
- channel – string
- channel_id – string
- date – string
- license – string
- original_language – string
- source_language – string
- transcription_language – string
- word_count – int64
- character_count – int64

Note

The text can be used for training models and republished for reproducibility purposes. In accordance with the CC-By license, every YouTube channel is fully credited.

Note

This dataset is licensed under CC-BY-4.0.

dataset_name: str#

default_dataset: str | None = 'PleIAs/YouTube-Commons'#

oumi.datasets

Contents

oumi.datasets#

Subpackages#