Datasets module for the Oumi (Open Universal Machine Intelligence) library.
This module provides various dataset implementations for use in the Oumi framework.
These datasets are designed for different machine learning tasks and can be used
with the models and training pipelines provided by Oumi.
For more information on the available datasets and their usage, see the
oumi.datasets documentation.
Each dataset is implemented as a separate class, inheriting from appropriate base
classes in the oumi.core.datasets module. For usage examples and more detailed
information on each dataset, please refer to their respective class documentation.
See also
oumi.models: Compatible models for use with these datasets.
A dataset for pretraining on the Colossal Clean Crawled Corpus (C4).
The C4 dataset is based on the Common Crawl dataset and is available in
multiple variants: ‘en’, ‘en.noclean’, ‘en.noblocklist’, ‘realnewslike’,
and ‘multilingual’ (mC4). It is intended for pretraining language models
and word representations.
Dolma: A dataset of 3 trillion tokens from diverse web content.
Dolma [1] is a large-scale dataset containing
approximately 3 trillion tokens sourced from various web content, academic
publications, code, books, and encyclopedic materials. It is designed for
language modeling tasks and casual language model training.
The dataset is available in multiple versions, with v1.7 being the latest
release used to train OLMo 7B-v1.7. It includes data from sources such as
Common Crawl, Refined Web, StarCoder, C4, Reddit, Semantic Scholar, arXiv,
StackExchange, and more.
Data Fields:
id (str) – Unique identifier for the data entry.
text (str) – The main content of the data entry.
added (str, optional) – Timestamp indicating when the entry was added
to the dataset.
created (str, optional) – Timestamp indicating when the original content
was created.
source (str, optional) – Information about the origin or source of the
data.
A massive English web dataset built by TII for pretraining large language models.
The Falcon RefinedWeb dataset is created through stringent filtering and
large-scale deduplication of CommonCrawl. It contains about 1B instances
(968M individual web pages) for a total of 2.8TB of clean text data.
This dataset is intended primarily for pretraining large language models and
can be used on its own or augmented with curated sources.
FineWeb-Edu: A high-quality educational dataset filtered from web content.
This dataset contains 1.3 trillion tokens of educational web pages filtered
from the FineWeb dataset using an educational quality classifier. It aims to
provide the finest collection of educational content from the web
[2].
The dataset is available in multiple configurations:
The Pile: An 825 GiB diverse, open source language modeling dataset.
The Pile is a large-scale English language dataset consisting of 22 smaller,
high-quality datasets combined together. It is designed for training large
language models and supports various natural language processing tasks
[3][4].
Data Fields:
text (str) – The main text content.
meta (dict) – Metadata about the instance, including ‘pile_set_name’.
Key Features:
825 GiB of diverse text data
Primarily in English
Supports text generation and fill-mask tasks
Includes various subsets like enron_emails, europarl, free_law, etc.
This dataset contains text from various sources and may include
personal or sensitive information. Users should consider potential biases
and limitations when using this dataset.
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
This dataset contains approximately 1.2 trillion tokens from various sources:
Commoncrawl (878B), C4 (175B), GitHub (59B), ArXiv (28B), Wikipedia (24B),
and StackExchange (20B) [5].
The dataset is primarily in English, though the Wikipedia slice contains
multiple languages.
RedPajama V2 Dataset for training large language models.
This dataset includes over 100B text documents from 84 CommonCrawl snapshots,
processed using the CCNet pipeline. It contains 30B documents with quality
signals and 20B deduplicated documents [5].
The dataset is available in English, German, French, Italian, and Spanish.
Key Features:
Over 100B text documents
30B documents with quality annotations
20B unique documents after deduplication
Estimated 50.6T tokens in total (30.4T after deduplication)
SlimPajama-627B: A cleaned and deduplicated version of RedPajama.
SlimPajama is the largest extensively deduplicated, multi-corpora, open-source
dataset for training large language models. It was created by cleaning and
deduplicating the 1.2T token RedPajama dataset, resulting in a 627B token dataset.
The dataset consists of 59166 jsonl files and is ~895GB compressed. It includes
training, validation, and test splits [6].
StarCoder Training Dataset used for training StarCoder and StarCoderBase models.
This dataset contains 783GB of code in 86 programming languages, including 54GB
of GitHub Issues, 13GB of Jupyter notebooks in scripts and text-code pairs, and
32GB of GitHub commits, totaling approximately 250 Billion tokens.
The dataset is a cleaned, decontaminated, and near-deduplicated version of
The Stack dataset, with PII removed. It includes various programming languages,
GitHub issues, Jupyter Notebooks, and GitHub commits.
GitHub issues, GitHub commits, and Jupyter notebooks subsets have different
columns from the rest. It’s recommended to load programming languages separately
from these categories:
- jupyter-scripts-dedup-filtered
- jupyter-structured-clean-dedup
- github-issues-filtered-structured
- git-commits-cleaned
Subsets (See dataset for full list):
python
javascript
assembly
awk
git-commits-cleaned
github-issues-filtered-structured
…
Warning
Not all subsets have the same format, in particular:
- jupyter-scripts-dedup-filtered
- jupyter-structured-clean-dedup
- github-issues-filtered-structured
- git-commits-cleaned
TextSftJsonLinesDataset for loading SFT data in oumi and alpaca formats.
This dataset class is designed to work with JSON Lines (.jsonl) or
JSON (.json) files containing text-based supervised fine-tuning (SFT) data.
It supports loading data either from a file or from a provided list of data
samples in oumi and alpaca formats.
Supported formats:
1. JSONL or JSON of conversations (Oumi format)
2. JSONL or JSON of Alpaca-style turns (instruction, input, output)
Parameters:
dataset_path (Optional[Union[str, Path]]) – Path to the dataset file
(.jsonl or .json).
data (Optional[List[Dict[str, Any]]]) – List of conversation dicts if not
loading from a file.
format (Optional[str]) – The format of the data. Either “conversations” or
“alpaca”. If not provided, the format will be auto-detected.
**kwargs – Additional arguments to pass to the parent class.
Examples
Loading conversations from a JSONL file with auto-detection:
A dataset containing over 6TB of permissively-licensed source code files.
The Stack was created as part of the BigCode Project, an open scientific
collaboration working on the responsible development of Large Language Models
for Code (Code LLMs). It serves as a pre-training dataset for Code LLMs,
enabling the synthesis of programs from natural language descriptions and
other code snippets, and covers 358 programming languages.
The dataset contains code in multiple natural languages, primarily found in
comments and docstrings. It supports tasks such as code completion,
documentation generation, and auto-completion of code snippets.
TinyStoriesDataset class for loading and processing the TinyStories dataset.
This dataset contains synthetically generated short stories with a small
vocabulary, created by GPT-3.5 and GPT-4. It is designed for text generation
tasks and is available in English.
A dataset of textbook-like content for training small language models.
This dataset contains 420,000 textbook documents covering a wide range of topics
and concepts. It provides a comprehensive and diverse learning resource for
causal language models, focusing on quality over quantity.
The dataset was synthesized using the Nous-Hermes-Llama2-13b model, combining
the best of the falcon-refinedweb and minipile datasets to ensure diversity and
quality while maintaining a small size.
VLJsonlinesDataset for loading Vision-Language SFT data in Oumi format.
This dataset class is designed to work with JSON Lines (.jsonl) files containing
Vision-Language supervised fine-tuning (SFT) data. It supports loading data either
from a file or from a provided list of data samples.
The WikiText dataset is a collection of over 100 million tokens extracted from
verified Good and Featured articles on Wikipedia. It is available in two sizes:
WikiText-2 (2 million tokens) and WikiText-103 (103 million tokens). Each size
comes in two variants: raw (for character-level work) and processed (for
word-level work) [7].
The dataset is well-suited for models that can take advantage of long-term
dependencies, as it is composed of full articles and retains original case,
punctuation, and numbers.
Dataset containing cleaned Wikipedia articles in multiple languages.
This dataset is built from the Wikipedia dumps (https://dumps.wikimedia.org/)
with one subset per language, each containing a single train split.
Each example contains the content of one full Wikipedia article
with cleaning to strip markdown and unwanted sections (references, etc.).
Data Fields:
id (str) – ID of the article.
url (str) – URL of the article.
title (str) – Title of the article.
text (str) – Text content of the article.
Note
All configurations contain a single ‘train’ split.
This dataset is a collection of audio transcripts from 2,063,066 videos shared on
YouTube under a CC-By license. It contains 22,709,724 original and automatically
translated transcripts from 3,156,703 videos (721,136 individual channels),
representing nearly 45 billion words.
The corpus is multilingual, with a majority of English-speaking content (71%) for
original languages. Automated translations are provided for nearly all videos in
English, French, Spanish, German, Russian, Italian, and Dutch.
This dataset aims to expand the availability of conversational data for research
in AI, computational social science, and digital humanities.
The text can be used for training models and republished for reproducibility
purposes. In accordance with the CC-By license, every YouTube channel is fully
credited.