oumi.core.datasets

Contents

oumi.core.datasets#

Core datasets module for the Oumi (Open Universal Machine Intelligence) library.

This module provides base classes for different types of datasets used in the Oumi framework. These base classes serve as foundations for creating custom datasets for various machine learning tasks.

These base classes can be extended to create custom datasets tailored to specific machine learning tasks within the Oumi framework.

For more detailed information on each class, please refer to their respective documentation.

class oumi.core.datasets.BaseExperimentalDpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, return_tensors: bool = False, **kwargs)[source]#

Bases: BaseMapDataset

Preprocess the samples to the Oumi format.

Warning

This class is experimental and subject to change.

dataset_name: str#
transform(sample: dict) dict[source]#

Transform the samples to the Oumi format.

transform_preference(samples: dict) dict[source]#

Transform the samples to the Oumi format.

trust_remote_code: bool#
class oumi.core.datasets.BaseIterableDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, subset: str | None = None, split: str | None = None, trust_remote_code: bool = False, stream: bool = True, **kwargs)[source]#

Bases: IterDataPipe, ABC

Abstract base class for iterable datasets.

__iter__()[source]#

Iterates over the dataset.

property data: Iterable[Any]#

Returns the underlying dataset data.

dataset_name: str#
dataset_path: str | None = None#
default_dataset: str | None = None#
default_subset: str | None = None#
iter_raw()[source]#

Iterates over the raw dataset.

to_hf(return_iterable: bool = True) IterableDataset[source]#

Converts the dataset to a Hugging Face dataset.

abstract transform(sample: Any) dict[str, Any][source]#

Preprocesses the inputs in the given sample.

Parameters:

sample (Any) – A sample from the dataset.

Returns:

A dictionary containing the preprocessed input data.

Return type:

dict

trust_remote_code: bool = False#
class oumi.core.datasets.BaseMapDataset(*, dataset_name: str | None, dataset_path: str | None = None, subset: str | None = None, split: str | None = None, trust_remote_code: bool = False, transform_num_workers: str | int | None = None, **kwargs)[source]#

Bases: MapDataPipe, ABC

Abstract base class for map datasets.

__getitem__(idx: int) dict[source]#

Gets the item at the specified index.

Parameters:

idx (int) – The index of the item to retrieve.

Returns:

The item at the specified index.

Return type:

dict

__len__() int[source]#

Gets the number of items in the dataset.

Returns:

The number of items in the dataset.

Return type:

int

as_generator() Generator[dict[str, Any], None, None][source]#

Returns a generator for the dataset.

property data: DataFrame#

Returns the underlying dataset data.

dataset_name: str#
dataset_path: str | None = None#
default_dataset: str | None = None#
default_subset: str | None = None#
raw(idx: int) Series[source]#

Returns the raw data at the specified index.

Parameters:

idx (int) – The index of the data to retrieve.

Returns:

The raw data at the specified index.

Return type:

pd.Series

to_hf(return_iterable: bool = False) Dataset | IterableDataset[source]#

Converts the dataset to a Hugging Face dataset.

Parameters:

return_iterable – Whether to return an iterable dataset. Iterable datasets aren’t cached to disk, which can sometimes be advantageous. For example, if transformed examples are very large (e.g., if pixel_values are large for multimodal data), or if you don’t want to post-process the whole dataset before training starts.

Returns:

A HuggingFace dataset. Can be datasets.Dataset or datasets.IterableDataset depending on the value of return_iterable.

abstract transform(sample: Series) dict[source]#

Preprocesses the inputs in the given sample.

Parameters:

sample (dict) – A dictionary containing the input data.

Returns:

A dictionary containing the preprocessed input data.

Return type:

dict

transform_num_workers: str | int | None = None#
trust_remote_code: bool#
class oumi.core.datasets.BasePretrainingDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BaseIterableDataset

Base class for pretraining iterable datasets.

This class extends BaseIterableDataset to provide functionality specific to pretraining tasks.

Variables:
  • tokenizer (BaseTokenizer) – The tokenizer used for text encoding.

  • seq_length (int) – The desired sequence length for model inputs.

  • concat_token_id (int) – The ID of the token used to concatenate documents.

Example

>>> from transformers import AutoTokenizer
>>> from oumi.builders import build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> from oumi.core.datasets import BasePretrainingDataset
>>> tokenizer = build_tokenizer(ModelParams(model_name="gpt2"))
>>> dataset = BasePretrainingDataset(
...     dataset_name="wikimedia/wikipedia",
...     subset="20231101.en",
...     split="train",
...     tokenizer=tokenizer,
...     seq_length=512
... )
>>> example = next(iter(dataset))
__iter__()[source]#

Iterates over the dataset and yields samples of a specified sequence length.

The underlying dataset is a stream of documents. Each document is expected to contain a text field self._dataset_text_field that will be tokenized. Training samples are then yielded in sequences of length self.seq_length.

Given this iterator might return samples from different documents, we optionally use self.concat_token_id to separate the sequences from different documents.

dataset_name: str#
tokenize(text: str) list[int][source]#

Tokenizes the given text.

Should not apply any padding/truncation to allow for packing.

transform(sample: Any) list[int][source]#

Preprocesses the inputs in the given sample.

class oumi.core.datasets.BaseSftDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#

Bases: BaseMapDataset, ABC

In-memory dataset for SFT data.

property assistant_only: bool#

Gets whether the dataset is set to train only on assistant turns.

conversation(idx: int) Conversation[source]#

Returns the conversation at the specified index.

Parameters:

idx (int) – The index of the conversation to retrieve.

Returns:

The conversation at the specified index.

Return type:

str

conversations() list[Conversation][source]#

Returns a list of all conversations.

dataset_name: str#
default_dataset: str | None = None#
prompt(idx: int) str[source]#

Returns the prompt at the specified index.

Parameters:

idx (int) – The index of the prompt to retrieve.

Returns:

The prompt at the specified index.

Return type:

str

property task: str#

Gets the task mode for the dataset.

The generated prompt is often different for generation vs SFT tasks.

property text_col: str#

Gets the text target column.

The generated text will be stored in this column.

tokenize(sample: dict | Series | Conversation, tokenize: bool = True) dict[source]#

Applies the chat template carried by the tokenizer to the input example.

Parameters:
  • sample (Dict) – Mapping messages to a List containing the (ordered) messages exchanged within a single chat dialogue. Each item of example[“messages”] is a dict mapping the content of the message and the role of the one relayed it. E.g., role == ‘user’ or role == ‘assistant’.

  • tokenize (bool) – Whether to tokenize the messages or not.

Raises:
  • NotImplementedError – Currently only the sft task mode is supported.

  • ValueError – if requested task is not in “sft” or “generation”

Returns:

It adds a text key in the input example dictionary, mapped to a string carrying the messages to the tokenizer’s chat format.

Return type:

Dict

transform(sample: Series) dict[source]#

Preprocesses the inputs in the given sample.

abstract transform_conversation(example: dict | Series) Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

Parameters:

example (dict) – The example containing the input and instruction.

Returns:

The preprocessed inputs as a dictionary.

Return type:

dict

trust_remote_code: bool#
class oumi.core.datasets.PackedSftDataset(base_dataset: BaseSftDataset, max_seq_len: int, show_progress: bool = True, split_samples: bool = False, concat_token_id: int | None = None, pad_token_id: int | None = None, enable_padding: bool = True, **kwargs)[source]#

Bases: BaseMapDataset

A dataset that packs samples from a base SFT dataset to maximize efficiency.

__getitem__(idx: int) dict[str, Tensor][source]#

Get a pack from the dataset by index.

dataset_name: str#
transform(example: dict) dict[source]#

No-op transform.

trust_remote_code: bool#
class oumi.core.datasets.PretrainingAsyncTextDataset(tokenizer: PreTrainedTokenizerBase | None, dataset: Dataset, dataset_text_field: str | None = None, formatting_func: Callable | None = None, infinite: bool = False, seq_length: int = 1024, sequence_buffer_size: int = 1024, eos_token_id: int = 0, shuffle: bool = False, append_concat_token: bool = True, add_special_tokens: bool = True, pretokenized: bool = True)[source]#

Bases: IterableDataset

Iterable dataset that returns constant length chunks of tokens.

Prefetches, formats, and tokenizes asynchronously from main thread.

Based on TRL’s ConstantLengthDataset class.

__iter__()[source]#

Iterates through the dataset with most work on a separate thread.

class oumi.core.datasets.VisionLanguageSftDataset(*, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, limit: int | None = None, trust_remote_code: bool = False, **kwargs)[source]#

Bases: BaseSftDataset, ABC

Abstract dataset for vision-language models.

This class extends BaseSftDataset to provide functionality specific to vision-language tasks. It handles the processing of both image and text data.

Note

This dataset is designed to work with models that can process both image and text inputs simultaneously, such as CLIP, BLIP, or other multimodal architectures.

Example

>>> from oumi.builders import build_processor, build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> from oumi.core.types.conversation import Conversation
>>> from oumi.core.datasets import VisionLanguageSftDataset
>>> class MyVisionLanguageSftDataset(VisionLanguageSftDataset):
...     def transform_conversation(self, example: dict):
...         # Implement the abstract method
...         # Convert the raw example into a Conversation object
...         pass
>>> tokenizer = build_tokenizer(
...     ModelParams(model_name="Qwen/Qwen2-1.5B-Instruct")
... )
>>> dataset = MyVisionLanguageSftDataset( 
...     tokenizer=tokenizer,
...     processor_name="openai/clip-vit-base-patch32",
...     dataset_name="coco_captions",
...     split="train"
... )
>>> sample = next(iter(dataset))  
>>> print(sample.keys()) 
dataset_name: str#
transform(sample: dict) dict[source]#

Transforms an Oumi conversation into a dictionary of inputs for a model.

Parameters:

sample (dict) – A dictionary representing a single conversation example.

Returns:

A dictionary of inputs for a model.

Return type:

dict

abstract transform_conversation(example: dict) Conversation[source]#

Transforms a raw example into an Oumi Conversation object.

Parameters:

example (dict) – A dictionary representing a single conversation example.

Returns:

A Conversation object representing the conversation.

Return type:

Conversation

trust_remote_code: bool#