oumi.datasets#
Datasets module for the Oumi (Open Universal Machine Intelligence) library.
This module provides various dataset implementations for use in the Oumi framework. These datasets are designed for different machine learning tasks and can be used with the models and training pipelines provided by Oumi.
For more information on the available datasets and their usage, see the
oumi.datasets
documentation.
Each dataset is implemented as a separate class, inheriting from appropriate base
classes in the oumi.core.datasets
module. For usage examples and more detailed
information on each dataset, please refer to their respective class documentation.
See also
oumi.models
: Compatible models for use with these datasets.oumi.core.datasets
: Base classes for dataset implementations.
Example
>>> from oumi.datasets import AlpacaDataset
>>> from torch.utils.data import DataLoader
>>> dataset = AlpacaDataset()
>>> train_loader = DataLoader(dataset, batch_size=32)
- class oumi.datasets.AlpacaDataset(*, include_system_prompt: bool = True, **kwargs)[source]#
Bases:
BaseSftDataset
- dataset_name: str#
- default_dataset: str | None = 'tatsu-lab/alpaca'#
- system_prompt_with_context = 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.'#
- system_prompt_without_context = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'#
- transform_conversation(example: dict | Series) Conversation [source]#
Preprocesses the inputs of the example and returns a dictionary.
- Parameters:
example (dict or Pandas Series) – An example containing input (optional), instruction, and output entries.
- Returns:
The input example converted to Alpaca dictionary format.
- Return type:
dict
- trust_remote_code: bool#
- class oumi.datasets.AlpacaEvalDataset(*, include_system_prompt: bool = False, unused_entries_to_metadata: bool = False, trust_remote_code: bool = True, **kwargs)[source]#
Bases:
BaseSftDataset
- dataset_name: str#
- default_dataset: str | None = 'tatsu-lab/alpaca_eval'#
- system_prompt_with_context = 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.'#
- system_prompt_without_context = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'#
- transform_conversation(example: dict | Series) Conversation [source]#
Preprocesses the inputs of the example and returns a dictionary.
- Parameters:
example (dict or Pandas Series) – An example containing input (optional), instruction entries.
- Returns:
The input example converted to Alpaca dictionary format.
- Return type:
dict
Note
If unused_entries_to_metadata is set: all example’s entries, other than the expected ones (i.e., input and instruction), are saved as metadata.
- trust_remote_code: bool#
- class oumi.datasets.ArgillaDollyDataset(*, use_new_fields: bool = True, **kwargs)[source]#
Bases:
BaseSftDataset
Dataset class for the Databricks Dolly 15k curated dataset.
- dataset_name: str#
- default_dataset: str | None = 'argilla/databricks-dolly-15k-curated-en'#
- transform_conversation(example: dict | Series) Conversation [source]#
Transform a dataset example into a Conversation object.
- Parameters:
example – A single example from the dataset.
- Returns:
A Conversation object containing the transformed messages.
- Return type:
- trust_remote_code: bool#
- class oumi.datasets.ArgillaMagpieUltraDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#
Bases:
BaseSftDataset
Dataset class for the argilla/magpie-ultra-v0.1 dataset.
- dataset_name: str#
- default_dataset: str | None = 'argilla/magpie-ultra-v0.1'#
- transform_conversation(example: dict | Series) Conversation [source]#
Transform a dataset example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.AyaDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#
Bases:
BaseSftDataset
Dataset class for the CohereForAI/aya_dataset dataset.
- dataset_name: str#
- default_dataset: str | None = 'CohereForAI/aya_dataset'#
- transform_conversation(example: dict | Series) Conversation [source]#
Transform a dataset example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.C4Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
A dataset for pretraining on the Colossal Clean Crawled Corpus (C4).
The C4 dataset is based on the Common Crawl dataset and is available in multiple variants: ‘en’, ‘en.noclean’, ‘en.noblocklist’, ‘realnewslike’, and ‘multilingual’ (mC4). It is intended for pretraining language models and word representations.
For more details and download instructions, visit: https://huggingface.co/datasets/allenai/c4
References
Paper: https://arxiv.org/abs/1910.10683
- Data Fields:
- url – URL of the source as a string
- text – Text content as a string
- timestamp – Timestamp as a string
- Dataset Variants:
en: 305GB
en.noclean: 2.3TB
en.noblocklist: 380GB
realnewslike: 15GB
multilingual (mC4): 9.7TB (108 subsets, one per language)
The dataset is released under the ODC-BY license and is subject to the Common Crawl terms of use.
- dataset_name: str#
- default_dataset: str | None = 'allenai/c4'#
- class oumi.datasets.COCOCaptionsDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#
Bases:
VisionLanguageSftDataset
Dataset class for the HuggingFaceM4/COCO dataset.
- dataset_name: str#
- default_dataset: str | None = 'HuggingFaceM4/COCO'#
- default_prompt = 'Describe this image:'#
- transform_conversation(example: dict) Conversation [source]#
Transform a single conversation example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.ChatRAGBenchDataset(*, split: str = 'test', task: str = 'generation', subset: str | None = None, num_context_docs: int = 5, **kwargs)[source]#
Bases:
BaseSftDataset
- default_dataset: str = 'nvidia/ChatRAG-Bench'#
- default_subset: str = 'doc2dial'#
- default_system_message: str = "This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."#
- transform_conversation(example: dict | Series) Conversation [source]#
Transforms a given example into a Conversation object.
- Parameters:
example (Union[dict, pd.Series]) – The example to transform.
- Returns:
The transformed Conversation object.
- Return type:
- class oumi.datasets.ChatqaDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#
Bases:
BaseSftDataset
- dataset_name: str#
- default_dataset: str | None = 'nvidia/ChatQA-Training-Data'#
- default_subset: str | None = 'sft'#
- transform_conversation(raw_conversation: dict | Series) Conversation [source]#
Preprocesses the inputs of the example and returns a dictionary.
ChatQA is a conversational question answering dataset. It contains 10 subsets. Some subsets contain grounding documents.
See the dataset page for more information: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data
- Parameters:
raw_conversation – The raw conversation example.
- Returns:
The preprocessed inputs as an Oumi conversation.
- Return type:
dict
- trust_remote_code: bool#
- class oumi.datasets.ChatqaTatqaDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#
Bases:
ChatqaDataset
ChatQA Subclass to handle tatqa subsets.
The tatqa subsets require loading a specific file from the dataset repository, thus requiring us to override the default loading behavior.
- dataset_name: str#
- default_subset: str | None = 'tatqa-arithmetic'#
- trust_remote_code: bool#
- class oumi.datasets.DebugClassificationDataset(dataset_size: int = 1000, feature_dim: int = 128, data_type: str = 'float32', num_classes: int = 10, preprocessing_time_ms: float = 0, **kwargs)[source]#
Bases:
Dataset
- class oumi.datasets.DebugPretrainingDataset(dataset_size: int = 1000, **kwargs)[source]#
Bases:
BasePretrainingDataset
- dataset_name: str#
- default_dataset: str | None = 'debug_pretraining'#
- class oumi.datasets.DebugSftDataset(dataset_size: int = 5, **kwargs)[source]#
Bases:
BaseSftDataset
- dataset_name: str#
- default_dataset: str | None = 'debug_sft'#
- transform_conversation(example: dict | Series) Conversation [source]#
Transforms the example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.DolmaDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
Dolma: A dataset of 3 trillion tokens from diverse web content.
Dolma [1] is a large-scale dataset containing approximately 3 trillion tokens sourced from various web content, academic publications, code, books, and encyclopedic materials. It is designed for language modeling tasks and casual language model training.
The dataset is available in multiple versions, with v1.7 being the latest release used to train OLMo 7B-v1.7. It includes data from sources such as Common Crawl, Refined Web, StarCoder, C4, Reddit, Semantic Scholar, arXiv, StackExchange, and more.
- Data Fields:
id (str) – Unique identifier for the data entry.
text (str) – The main content of the data entry.
added (str, optional) – Timestamp indicating when the entry was added to the dataset.
created (str, optional) – Timestamp indicating when the original content was created.
source (str, optional) – Information about the origin or source of the data.
See also
GitHub project: allenai/dolma
Hugging Face Hub: https://huggingface.co/datasets/allenai/dolma
Note
The dataset is released under the ODC-BY license. Users are bound by the license agreements and terms of use of the original data sources.
Citations
- dataset_name: str#
- default_dataset: str | None = 'allenai/dolma'#
- class oumi.datasets.FalconRefinedWebDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
A massive English web dataset built by TII for pretraining large language models.
The Falcon RefinedWeb dataset is created through stringent filtering and large-scale deduplication of CommonCrawl. It contains about 1B instances (968M individual web pages) for a total of 2.8TB of clean text data.
This dataset is intended primarily for pretraining large language models and can be used on its own or augmented with curated sources.
- Dataset Link:
- Paper:
- Features:
content (str): The processed and cleaned text contained in the page.
url (str): The URL of the webpage crawled to produce the sample.
timestamp (timestamp[s]): Timestamp of when the webpage was crawled by CommonCrawl.
dump (str): The CommonCrawl dump the sample is a part of.
segment (str): The CommonCrawl segment the sample is a part of.
image_urls (List[List[str]]): A list of elements in the type [image_url, image_alt_text] for all images found in the content.
- Usage:
from datasets import load_dataset rw = load_dataset(“tiiuae/falcon-refinedweb”)
Notes
ODC-By 1.0
Note
This public extract is about ~500GB to download, requiring 2.8TB of local storage once unpacked.
The dataset may contain sensitive information and biased content.
No canonical splits are provided for this dataset.
- dataset_name: str#
- default_dataset: str | None = 'tiiuae/falcon-refinedweb'#
- class oumi.datasets.FineWebEduDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
FineWeb-Edu: A high-quality educational dataset filtered from web content.
This dataset contains 1.3 trillion tokens of educational web pages filtered from the FineWeb dataset using an educational quality classifier. It aims to provide the finest collection of educational content from the web [2].
- The dataset is available in multiple configurations:
Full dataset (default)
Individual CommonCrawl dumps (e.g. CC-MAIN-2024-10)
Sample subsets (10BT, 100BT, 350BT tokens)
- Key Features:
1.3 trillion tokens of educational content
Filtered using a classifier trained on LLama3-70B-Instruct annotations
Outperforms other web datasets on educational benchmarks
See also
Huggingface hub page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
Note
The dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0.
Citations
[2] Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu. May 2024. URL: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu, doi:10.57967/hf/2497.
- dataset_name: str#
- default_dataset: str | None = 'HuggingFaceFW/fineweb-edu'#
- class oumi.datasets.Flickr30kDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#
Bases:
VisionLanguageSftDataset
Dataset class for the nlphuji/flickr30k dataset.
- dataset_name: str#
- default_dataset: str | None = 'nlphuji/flickr30k'#
- transform_conversation(example: dict) Conversation [source]#
Transform a single conversation example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.HuggingFaceDataset(*, hf_dataset_path: str = '', messages_column: str = 'messages', exclude_final_assistant_message: bool = False, **kwargs)[source]#
Bases:
BaseSftDataset
Converts HuggingFace Datasets with messages to Oumi Message format.
Example
- dataset = HuggingFaceDataset(
hf_dataset_path=”oumi-ai/oumi-synthetic-document-claims”, message_column=”messages”
)
- dataset_name: str#
- transform_conversation(example: dict | Series) Conversation [source]#
Preprocesses the inputs of the example and returns a dictionary.
- Parameters:
example – An example containing messages entries.
- Returns:
A Conversation object containing the messages.
- Return type:
- trust_remote_code: bool#
- class oumi.datasets.LetterCountGrpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, **kwargs)[source]#
Bases:
BaseExperimentalGrpoDataset
Dataset class for the oumi-ai/oumi-letter-count dataset.
A sample from the dataset: {
“conversation_id”: “oumi_letter_count_0”, “messages”: [
- {
“content”: “Can you let me know how many ‘r’s are in ‘pandered’?”, “role”: “user”,
}
], “metadata”: {
“letter”: “r”, “letter_count_integer”: 1, “letter_count_string”: “one”, “unformatted_prompt”: “Can you let me know how many {letter}s are in {word}?”, “word”: “pandered”,
},
}
- dataset_name: str#
- default_dataset: str | None = 'oumi-ai/oumi-letter-count'#
- transform_conversation(sample: Series) Conversation [source]#
Converts the input sample to a Conversation.
- Parameters:
sample (dict) – The input example.
- Returns:
The resulting conversation.
- Return type:
- trust_remote_code: bool#
- class oumi.datasets.LlavaInstructMixVsftDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#
Bases:
VisionLanguageSftDataset
Dataset class for the HuggingFaceH4/llava-instruct-mix-vsft dataset.
- dataset_name: str#
- default_dataset: str | None = 'HuggingFaceH4/llava-instruct-mix-vsft'#
- transform_conversation(example: dict) Conversation [source]#
Transform a dataset example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.MagpieProDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#
Bases:
BaseSftDataset
Dataset class for the Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1 dataset.
- dataset_name: str#
- default_dataset: str | None = 'Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1'#
- transform_conversation(example: dict | Series) Conversation [source]#
Transform a dataset example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.OpenO1SFTDataset(**kwargs)[source]#
Bases:
PromptResponseDataset
Synthetic reasoning SFT dataset.
- dataset_name: str#
- default_dataset: str | None = 'O1-OPEN/OpenO1-SFT'#
- trust_remote_code: bool#
- class oumi.datasets.OrpoDpoMix40kDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, return_tensors: bool = False, **kwargs)[source]#
Bases:
BaseExperimentalDpoDataset
Preprocess the ORPO dataset for DPO.
A dataset designed for ORPO (Offline Reinforcement Learning for Preference Optimization) or DPO (Direct Preference Optimization) training.
This dataset is a combination of high-quality DPO datasets, including: - Capybara-Preferences - distilabel-intel-orca-dpo-pairs - ultrafeedback-binarized-preferences-cleaned - distilabel-math-preference-dpo - toxic-dpo-v0.2 - prm_dpo_pairs_cleaned - truthy-dpo-v0.1
Rule-based filtering was applied to remove ‘gptisms’ in the chosen answers.
- Data Fields:
- source – string
- chosen – list of dictionaries with ‘content’ and ‘role’ fields
- rejected – list of dictionaries with ‘content’ and ‘role’ fields
- prompt – string
- question – string
See also
For more information on how to use this dataset, refer to: - Blog post: https://huggingface.co/blog/mlabonne/orpo-llama-3 - Huggingface hub: https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k
- dataset_name: str#
- default_dataset: str | None = 'mlabonne/orpo-dpo-mix-40k'#
- trust_remote_code: bool#
- class oumi.datasets.PileV1Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
The Pile: An 825 GiB diverse, open source language modeling dataset.
The Pile is a large-scale English language dataset consisting of 22 smaller, high-quality datasets combined together. It is designed for training large language models and supports various natural language processing tasks [3][4].
- Data Fields:
text (str) – The main text content.
meta (dict) – Metadata about the instance, including ‘pile_set_name’.
- Key Features:
825 GiB of diverse text data
Primarily in English
Supports text generation and fill-mask tasks
Includes various subsets like enron_emails, europarl, free_law, etc.
- Subsets:
all
enron_emails
europarl
free_law
hacker_news
nih_exporter
pubmed
pubmed_central
ubuntu_irc
uspto
github
- Splits:
train
validation
test
See also
Homepage: https://pile.eleuther.ai/
HuggingFace hub: https://huggingface.co/datasets/EleutherAI/pile
Warning
This dataset contains text from various sources and may include personal or sensitive information. Users should consider potential biases and limitations when using this dataset.
Citations
[3] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, and others. The pile: an 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
[4] Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile. arXiv preprint arXiv:2201.07311, 2022.
- dataset_name: str#
- default_dataset: str | None = 'EleutherAI/pile'#
- class oumi.datasets.PixmoAskModelAnythingDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#
Bases:
VisionLanguageSftDataset
Dataset class for the allenai/pixmo-docs dataset.
The dataset is affected by some image URLs having a 404 issue.
- dataset_name: str#
- default_dataset: str | None = 'allenai/pixmo-ask-model-anything'#
- transform_conversation(example: dict) Conversation [source]#
Transform the example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.PixmoCapDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#
Bases:
VisionLanguageSftDataset
Dataset class for the allenai/pixmo-cap dataset.
The dataset is affected by some image URLs having a 404 issue.
- dataset_name: str#
- default_dataset: str | None = 'allenai/pixmo-cap'#
- transform_conversation(example: dict) Conversation [source]#
Transform the example into a Conversation object.
A “transcripts” column is also available but not used yet.
- trust_remote_code: bool#
- class oumi.datasets.PixmoCapQADataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#
Bases:
VisionLanguageSftDataset
Dataset class for the allenai/pixmo-cap-qa dataset.
The dataset is affected by some image URLs having a 404 issue.
- dataset_name: str#
- default_dataset: str | None = 'allenai/pixmo-cap-qa'#
- transform_conversation(example: dict) Conversation [source]#
Transform the example into a Conversation object.
Sample “question”: “[USER] Can you come up with a joke? [ASSISTANT]” It starts with a [USER] and ends with an [ASSISTANT] role tag. The Assistant response appears in the “answer” field.
- trust_remote_code: bool#
- class oumi.datasets.PromptResponseDataset(*, hf_dataset_path: str = 'O1-OPEN/OpenO1-SFT', prompt_column: str = 'instruction', response_column: str = 'output', **kwargs)[source]#
Bases:
BaseSftDataset
Converts HuggingFace Datasets with input/output columns to Message format.
Example
dataset = PromptResponseDataset(hf_dataset_path=”O1-OPEN/OpenO1-SFT”, prompt_column=”instruction”, response_column=”output”)
- dataset_name: str#
- default_dataset: str | None = 'O1-OPEN/OpenO1-SFT'#
- transform_conversation(example: dict | Series) Conversation [source]#
Preprocesses the inputs of the example and returns a dictionary.
- Parameters:
example (dict or Pandas Series) – An example containing input (optional), instruction, and output entries.
- Returns:
The input example converted to messages dictionary format.
- Return type:
dict
- trust_remote_code: bool#
- class oumi.datasets.RedPajamaDataV1Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
This dataset contains approximately 1.2 trillion tokens from various sources: Commoncrawl (878B), C4 (175B), GitHub (59B), ArXiv (28B), Wikipedia (24B), and StackExchange (20B) [5].
The dataset is primarily in English, though the Wikipedia slice contains multiple languages.
- Dataset Structure:
{ "text": str, "meta": { "url": str, "timestamp": str, "source": str, "language": str, ... }, "red_pajama_subset": str }
- Subsets:
common_crawl
c4
github
arxiv
wikipedia
stackexchange
See also
For more information on dataset creation and source data, please refer to the RedPajama GitHub repository: togethercomputer/RedPajama-Data
Hugging Face dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
Note
The ‘book’ config is defunct and no longer accessible due to reported copyright infringement for the Book3 dataset contained in this config.
Note
Please refer to the licenses of the data subsets you use. Links to the respective licenses can be found in the README.
Citations
[5] (1,2) Together Computer. Redpajama: an open source recipe to reproduce llama training dataset. April 2023. URL: togethercomputer/RedPajama-Data.
- dataset_name: str#
- default_dataset: str | None = 'togethercomputer/RedPajama-Data-1T'#
- class oumi.datasets.RedPajamaDataV2Dataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
RedPajama V2 Dataset for training large language models.
This dataset includes over 100B text documents from 84 CommonCrawl snapshots, processed using the CCNet pipeline. It contains 30B documents with quality signals and 20B deduplicated documents [5].
The dataset is available in English, German, French, Italian, and Spanish.
- Key Features:
Over 100B text documents
30B documents with quality annotations
20B unique documents after deduplication
Estimated 50.6T tokens in total (30.4T after deduplication)
Quality signals for filtering and analysis
Minhash signatures for fuzzy deduplication
See also
Hugging Face dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
Blog post: https://together.ai/blog/redpajama-data-v2
GitHub repo: togethercomputer/RedPajama-Data
Note
License: Common Crawl Foundation Terms of Use: https://commoncrawl.org/terms-of-use
Code: Apache 2.0 license
Citations
- dataset_name: str#
- default_dataset: str | None = 'togethercomputer/RedPajama-Data-V2'#
- class oumi.datasets.SlimPajamaDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
SlimPajama-627B: A cleaned and deduplicated version of RedPajama.
SlimPajama is the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. It was created by cleaning and deduplicating the 1.2T token RedPajama dataset, resulting in a 627B token dataset.
The dataset consists of 59166 jsonl files and is ~895GB compressed. It includes training, validation, and test splits [6].
- Key Features:
627B tokens
Open-source
Curated data sources
Extensive deduplication
Primarily English language
- Data Sources and Proportions:
Commoncrawl: 52.2%
C4: 26.7%
GitHub: 5.2%
Books: 4.2%
ArXiv: 4.6%
Wikipedia: 3.8%
StackExchange: 3.3%
See also
- Dataset Structure:
Each example is a JSON object with the following structure:
{ "text": str, "meta": { "redpajama_set_name": str # One of the data source names } }
Citations
[6] Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, June 2023. URL: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
- dataset_name: str#
- default_dataset: str | None = 'cerebras/SlimPajama-627B'#
- class oumi.datasets.StarCoderDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
StarCoder Training Dataset used for training StarCoder and StarCoderBase models.
This dataset contains 783GB of code in 86 programming languages, including 54GB of GitHub Issues, 13GB of Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, totaling approximately 250 Billion tokens.
The dataset is a cleaned, decontaminated, and near-deduplicated version of The Stack dataset, with PII removed. It includes various programming languages, GitHub issues, Jupyter Notebooks, and GitHub commits.
- Data Fields:
id – str
content – str
max_stars_repo_path – str
max_stars_repo_name – int
max_stars_count – str
See also
Huggingface hub: https://huggingface.co/datasets/bigcode/starcoderdata
Note
GitHub issues, GitHub commits, and Jupyter notebooks subsets have different columns from the rest. It’s recommended to load programming languages separately from these categories: - jupyter-scripts-dedup-filtered - jupyter-structured-clean-dedup - github-issues-filtered-structured - git-commits-cleaned
- Subsets (See dataset for full list):
python
javascript
assembly
awk
git-commits-cleaned
github-issues-filtered-structured
…
Warning
Not all subsets have the same format, in particular: - jupyter-scripts-dedup-filtered - jupyter-structured-clean-dedup - github-issues-filtered-structured - git-commits-cleaned
- dataset_name: str#
- default_dataset: str | None = 'bigcode/starcoderdata'#
- class oumi.datasets.TextSftJsonLinesDataset(dataset_path: str | Path | None = None, data: list[dict[str, Any]] | None = None, format: str | None = None, **kwargs)[source]#
Bases:
BaseSftDataset
TextSftJsonLinesDataset for loading SFT data in oumi and alpaca formats.
This dataset class is designed to work with JSON Lines (.jsonl) or JSON (.json) files containing text-based supervised fine-tuning (SFT) data. It supports loading data either from a file or from a provided list of data samples in oumi and alpaca formats.
Supported formats: 1. JSONL or JSON of conversations (Oumi format) 2. JSONL or JSON of Alpaca-style turns (instruction, input, output)
- Parameters:
dataset_path (Optional[Union[str, Path]]) – Path to the dataset file (.jsonl or .json).
data (Optional[List[Dict[str, Any]]]) – List of conversation dicts if not loading from a file.
format (Optional[str]) – The format of the data. Either “conversations” or “alpaca”. If not provided, the format will be auto-detected.
**kwargs – Additional arguments to pass to the parent class.
Examples
- Loading conversations from a JSONL file with auto-detection:
>>> from oumi.datasets import TextSftJsonLinesDataset >>> dataset = TextSftJsonLinesDataset( ... dataset_path="/path/to/your/dataset.jsonl" ... )
- Loading Alpaca-style data from a JSON file:
>>> from oumi.datasets import TextSftJsonLinesDataset >>> dataset = TextSftJsonLinesDataset( ... dataset_path="/path/to/your/dataset.json", ... format="alpaca" ... )
- Loading from a list of data samples:
>>> from oumi.datasets import TextSftJsonLinesDataset >>> data_samples = [ ... {"messages": [{"role": "user", "content": "Hello"}, ... {"role": "assistant", "content": "Hi there!"}]}, ... {"messages": [{"role": "user", "content": "How are you?"}, ... {"role": "assistant", "content": "great!"}]} ... ] >>> dataset = TextSftJsonLinesDataset( ... data=data_samples, ... )
- dataset_name: str#
- default_dataset: str | None = 'custom'#
- transform_conversation(example: dict) Conversation [source]#
Transform a single conversation example into a Conversation object.
- Parameters:
example – The input example containing the messages or Alpaca-style turn.
- Returns:
A Conversation object containing the messages.
- Return type:
- trust_remote_code: bool#
- class oumi.datasets.TheStackDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
A dataset containing over 6TB of permissively-licensed source code files.
The Stack was created as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs). It serves as a pre-training dataset for Code LLMs, enabling the synthesis of programs from natural language descriptions and other code snippets, and covers 358 programming languages.
The dataset contains code in multiple natural languages, primarily found in comments and docstrings. It supports tasks such as code completion, documentation generation, and auto-completion of code snippets.
See also
Huggingface hub: https://huggingface.co/datasets/bigcode/the-stack
Homepage: https://www.bigcode-project.org/
Repository: bigcode-project
- Data Fields:
- content (string) – The content of the file.
- size (integer) – Size of the uncompressed file.
- lang (string) – The programming language.
- ext (string) – File extension.
- avg_line_length (float) – The average line-length of the file.
- max_line_length (integer) – The maximum line-length of the file.
- alphanum_fraction (float) – The fraction of alphanumeric characters.
- hexsha (string) – Unique git hash of file.
- max_{stars|forks|issues}_repo_path (string) – Path to file in repo.
- max_{stars|forks|issues}_repo_name (string) – Name of repo.
- max_{stars|forks|issues}_repo_head_hexsha (string) – Hexsha of repo head.
- max_{stars|forks|issues}_repo_licenses (string) – Licenses in repository.
- max_{stars|forks|issues}_count (integer) – Number of stars/forks/issues.
- max_{stars|forks|issues}_repo_{stars|forks|issues}_min_datetime (string) – First timestamp of a stars/forks/issues event.
- max_{stars|forks|issues}_repo_{stars|forks|issues}_max_datetime (string) – Last timestamp of a stars/forks/issues event.
- dataset_name: str#
- default_dataset: str | None = 'bigcode/the-stack'#
- class oumi.datasets.TinyStoriesDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
TinyStoriesDataset class for loading and processing the TinyStories dataset.
This dataset contains synthetically generated short stories with a small vocabulary, created by GPT-3.5 and GPT-4. It is designed for text generation tasks and is available in English.
See also
Huggingface hub: https://huggingface.co/datasets/roneneldan/TinyStories
Note
The dataset is available under the CDLA-Sharing-1.0 license.
- dataset_name: str#
- default_dataset: str | None = 'roneneldan/TinyStories'#
- class oumi.datasets.TinyTextbooksDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
A dataset of textbook-like content for training small language models.
This dataset contains 420,000 textbook documents covering a wide range of topics and concepts. It provides a comprehensive and diverse learning resource for causal language models, focusing on quality over quantity.
The dataset was synthesized using the Nous-Hermes-Llama2-13b model, combining the best of the falcon-refinedweb and minipile datasets to ensure diversity and quality while maintaining a small size.
See also
Huggingface hub: https://huggingface.co/datasets/nampdn-ai/tiny-textbooks
Textbooks Are All You Need II: phi-1.5 technical report (https://arxiv.org/abs/2309.05463)
Falcon: A Large Language Model for Search (https://arxiv.org/abs/2306.01116)
The MiniPile Challenge for Data-Efficient Language Models
- dataset_name: str#
- default_dataset: str | None = 'nampdn-ai/tiny-textbooks'#
- class oumi.datasets.TldrGrpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, **kwargs)[source]#
Bases:
BaseExperimentalGrpoDataset
Dataset class for the trl-lib/tldr dataset.
- dataset_name: str#
- default_dataset: str | None = 'trl-lib/tldr'#
- trust_remote_code: bool#
- class oumi.datasets.Tulu3MixtureDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#
Bases:
BaseSftDataset
- dataset_name: str#
- default_dataset: str | None = 'allenai/tulu-3-sft-mixture'#
- transform_conversation(example: dict | Series) Conversation [source]#
Convert the example into a Conversation.
- Parameters:
example (dict or Pandas Series) – An example containing a messages field which is a list of dicts with content and role string fields
- trust_remote_code: bool#
- class oumi.datasets.UltrachatH4Dataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#
Bases:
BaseSftDataset
Dataset class for the HuggingFaceH4/ultrachat_200k dataset.
- dataset_name: str#
- default_dataset: str | None = 'HuggingFaceH4/ultrachat_200k'#
- transform_conversation(example: dict | Series) Conversation [source]#
Transform a dataset example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.VLJsonlinesDataset(dataset_path: str | Path | None = None, data: list | None = None, **kwargs)[source]#
Bases:
VisionLanguageSftDataset
VLJsonlinesDataset for loading Vision-Language SFT data in Oumi format.
This dataset class is designed to work with JSON Lines (.jsonl) files containing Vision-Language supervised fine-tuning (SFT) data. It supports loading data either from a file or from a provided list of data samples.
- Usage example:
- Examples:
- Loading from a file:
>>> from oumi.datasets import VLJsonlinesDataset >>> dataset = VLJsonlinesDataset( ... dataset_path="/path/to/your/dataset.jsonl", ... )
- Loading from a list of data samples:
>>> from oumi.builders import build_processor, build_tokenizer >>> from oumi.core.configs import ModelParams >>> from oumi.datasets import VLJsonlinesDataset >>> data_samples = [ ... { ... "messages": [ ... { ... "role": "user", ... "content": "Describe this image:", ... "type": "text" ... }, ... { ... "role": "user", ... "content": "path/to/image.jpg", ... "type": "image_path" ... }, ... { ... "role": "assistant", ... "content": "A scenic view of the puget sound.", ... "type": "text", ... }, ... ] ... } ... ] >>> tokenizer = build_tokenizer( ... ModelParams(model_name="Qwen/Qwen2-1.5B-Instruct") ... ) >>> dataset = VLJsonlinesDataset( ... data=data_samples, ... tokenizer=tokenizer, ... processor_name="openai/clip-vit-base-patch32", ... )
- dataset_name: str#
- default_dataset: str | None = 'custom'#
- transform_conversation(example: dict) Conversation [source]#
Transform a single conversation example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.WikiTextDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
WikiText language modeling dataset.
The WikiText dataset is a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. It is available in two sizes: WikiText-2 (2 million tokens) and WikiText-103 (103 million tokens). Each size comes in two variants: raw (for character-level work) and processed (for word-level work) [7].
The dataset is well-suited for models that can take advantage of long-term dependencies, as it is composed of full articles and retains original case, punctuation, and numbers.
- Data Fields:
text (str): The text content of the dataset.
See also
Hugging Face Hub: https://huggingface.co/datasets/Salesforce/wikitext
Note
The dataset is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0).
Citations
[7] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. 2016. arXiv:1609.07843.
- dataset_name: str#
- default_dataset: str | None = 'Salesforce/wikitext'#
- class oumi.datasets.WikipediaDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
Dataset containing cleaned Wikipedia articles in multiple languages.
This dataset is built from the Wikipedia dumps (https://dumps.wikimedia.org/) with one subset per language, each containing a single train split. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
- Data Fields:
id (str) – ID of the article.
url (str) – URL of the article.
title (str) – Title of the article.
text (str) – Text content of the article.
Note
All configurations contain a single ‘train’ split.
- Languages:
The dataset supports numerous languages. For a full list, see: https://meta.wikimedia.org/wiki/List_of_Wikipedias
Note
The dataset is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.
See also
Homepage: https://dumps.wikimedia.org
Hugging Face Hub: https://huggingface.co/datasets/wikimedia/wikipedia
- dataset_name: str#
- default_dataset: str | None = 'wikimedia/wikipedia'#
- class oumi.datasets.WildChatDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#
Bases:
BaseSftDataset
Dataset class for the allenai/WildChat-1M dataset.
- dataset_name: str#
- default_dataset: str | None = 'allenai/WildChat-1M'#
- transform_conversation(example: dict | Series) Conversation [source]#
Transform a dataset example into a Conversation object.
- trust_remote_code: bool#
- class oumi.datasets.YouTubeCommonsDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BasePretrainingDataset
YouTube-Commons Dataset.
This dataset is a collection of audio transcripts from 2,063,066 videos shared on YouTube under a CC-By license. It contains 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels), representing nearly 45 billion words.
The corpus is multilingual, with a majority of English-speaking content (71%) for original languages. Automated translations are provided for nearly all videos in English, French, Spanish, German, Russian, Italian, and Dutch.
This dataset aims to expand the availability of conversational data for research in AI, computational social science, and digital humanities.
See also
Hugging Face Hub: https://huggingface.co/datasets/PleIAs/YouTube-Commons
- Data Fields:
- video_id – string
- video_link – string
- title – string
- text – string
- channel – string
- channel_id – string
- date – string
- license – string
- original_language – string
- source_language – string
- transcription_language – string
- word_count – int64
- character_count – int64
Note
The text can be used for training models and republished for reproducibility purposes. In accordance with the CC-By license, every YouTube channel is fully credited.
Note
This dataset is licensed under CC-BY-4.0.
- dataset_name: str#
- default_dataset: str | None = 'PleIAs/YouTube-Commons'#