oumi.core.datasets#
Core datasets module for the Oumi (Open Universal Machine Intelligence) library.
This module provides base classes for different types of datasets used in the Oumi framework. These base classes serve as foundations for creating custom datasets for various machine learning tasks.
These base classes can be extended to create custom datasets tailored to specific machine learning tasks within the Oumi framework.
For more detailed information on each class, please refer to their respective documentation.
- class oumi.core.datasets.BaseExperimentalDpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, return_tensors: bool = False, **kwargs)[source]#
Bases:
BaseMapDataset
Preprocess the samples to the Oumi format.
Warning
This class is experimental and subject to change.
- dataset_name: str#
- trust_remote_code: bool#
- class oumi.core.datasets.BaseIterableDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, subset: str | None = None, split: str | None = None, trust_remote_code: bool = False, stream: bool = True, **kwargs)[source]#
Bases:
IterDataPipe
,ABC
Abstract base class for iterable datasets.
- property data: Iterable[Any]#
Returns the underlying dataset data.
- dataset_name: str#
- dataset_path: str | None = None#
- default_dataset: str | None = None#
- default_subset: str | None = None#
- to_hf(return_iterable: bool = True) IterableDataset [source]#
Converts the dataset to a Hugging Face dataset.
- abstract transform(sample: Any) dict[str, Any] [source]#
Preprocesses the inputs in the given sample.
- Parameters:
sample (Any) – A sample from the dataset.
- Returns:
A dictionary containing the preprocessed input data.
- Return type:
dict
- trust_remote_code: bool = False#
- class oumi.core.datasets.BaseMapDataset(*, dataset_name: str | None, dataset_path: str | None = None, subset: str | None = None, split: str | None = None, trust_remote_code: bool = False, transform_num_workers: str | int | None = None, **kwargs)[source]#
Bases:
MapDataPipe
,ABC
Abstract base class for map datasets.
- __getitem__(idx: int) dict [source]#
Gets the item at the specified index.
- Parameters:
idx (int) – The index of the item to retrieve.
- Returns:
The item at the specified index.
- Return type:
dict
- __len__() int [source]#
Gets the number of items in the dataset.
- Returns:
The number of items in the dataset.
- Return type:
int
- property data: DataFrame#
Returns the underlying dataset data.
- dataset_name: str#
- dataset_path: str | None = None#
- default_dataset: str | None = None#
- default_subset: str | None = None#
- raw(idx: int) Series [source]#
Returns the raw data at the specified index.
- Parameters:
idx (int) – The index of the data to retrieve.
- Returns:
The raw data at the specified index.
- Return type:
pd.Series
- to_hf(return_iterable: bool = False) Dataset | IterableDataset [source]#
Converts the dataset to a Hugging Face dataset.
- Parameters:
return_iterable – Whether to return an iterable dataset. Iterable datasets aren’t cached to disk, which can sometimes be advantageous. For example, if transformed examples are very large (e.g., if pixel_values are large for multimodal data), or if you don’t want to post-process the whole dataset before training starts.
- Returns:
A HuggingFace dataset. Can be datasets.Dataset or datasets.IterableDataset depending on the value of return_iterable.
- abstract transform(sample: Series) dict [source]#
Preprocesses the inputs in the given sample.
- Parameters:
sample (dict) – A dictionary containing the input data.
- Returns:
A dictionary containing the preprocessed input data.
- Return type:
dict
- transform_num_workers: str | int | None = None#
- trust_remote_code: bool#
- class oumi.core.datasets.BasePretrainingDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BaseIterableDataset
Base class for pretraining iterable datasets.
This class extends BaseIterableDataset to provide functionality specific to pretraining tasks.
- Variables:
tokenizer (BaseTokenizer) – The tokenizer used for text encoding.
seq_length (int) – The desired sequence length for model inputs.
concat_token_id (int) – The ID of the token used to concatenate documents.
Example
>>> from transformers import AutoTokenizer >>> from oumi.builders import build_tokenizer >>> from oumi.core.configs import ModelParams >>> from oumi.core.datasets import BasePretrainingDataset >>> tokenizer = build_tokenizer(ModelParams(model_name="gpt2")) >>> dataset = BasePretrainingDataset( ... dataset_name="wikimedia/wikipedia", ... subset="20231101.en", ... split="train", ... tokenizer=tokenizer, ... seq_length=512 ... ) >>> example = next(iter(dataset))
- __iter__()[source]#
Iterates over the dataset and yields samples of a specified sequence length.
The underlying dataset is a stream of documents. Each document is expected to contain a text field self._dataset_text_field that will be tokenized. Training samples are then yielded in sequences of length self.seq_length.
Given this iterator might return samples from different documents, we optionally use self.concat_token_id to separate the sequences from different documents.
- dataset_name: str#
- class oumi.core.datasets.BaseSftDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, **kwargs)[source]#
Bases:
BaseMapDataset
,ABC
In-memory dataset for SFT data.
- property assistant_only: bool#
Gets whether the dataset is set to train only on assistant turns.
- conversation(idx: int) Conversation [source]#
Returns the conversation at the specified index.
- Parameters:
idx (int) – The index of the conversation to retrieve.
- Returns:
The conversation at the specified index.
- Return type:
str
- conversations() list[Conversation] [source]#
Returns a list of all conversations.
- dataset_name: str#
- default_dataset: str | None = None#
- prompt(idx: int) str [source]#
Returns the prompt at the specified index.
- Parameters:
idx (int) – The index of the prompt to retrieve.
- Returns:
The prompt at the specified index.
- Return type:
str
- property task: str#
Gets the task mode for the dataset.
The generated prompt is often different for generation vs SFT tasks.
- property text_col: str#
Gets the text target column.
The generated text will be stored in this column.
- tokenize(sample: dict | Series | Conversation, tokenize: bool = True) dict [source]#
Applies the chat template carried by the tokenizer to the input example.
- Parameters:
sample (Dict) – Mapping messages to a List containing the (ordered) messages exchanged within a single chat dialogue. Each item of example[“messages”] is a dict mapping the content of the message and the role of the one relayed it. E.g., role == ‘user’ or role == ‘assistant’.
tokenize (bool) – Whether to tokenize the messages or not.
- Raises:
NotImplementedError – Currently only the sft task mode is supported.
ValueError – if requested task is not in “sft” or “generation”
- Returns:
It adds a text key in the input example dictionary, mapped to a string carrying the messages to the tokenizer’s chat format.
- Return type:
Dict
- abstract transform_conversation(example: dict | Series) Conversation [source]#
Preprocesses the inputs of the example and returns a dictionary.
- Parameters:
example (dict) – The example containing the input and instruction.
- Returns:
The preprocessed inputs as a dictionary.
- Return type:
dict
- trust_remote_code: bool#
- class oumi.core.datasets.PackedSftDataset(base_dataset: BaseSftDataset, max_seq_len: int, show_progress: bool = True, split_samples: bool = False, concat_token_id: int | None = None, pad_token_id: int | None = None, enable_padding: bool = True, **kwargs)[source]#
Bases:
BaseMapDataset
A dataset that packs samples from a base SFT dataset to maximize efficiency.
- dataset_name: str#
- trust_remote_code: bool#
- class oumi.core.datasets.PretrainingAsyncTextDataset(tokenizer: PreTrainedTokenizerBase | None, dataset: Dataset, dataset_text_field: str | None = None, formatting_func: Callable | None = None, infinite: bool = False, seq_length: int = 1024, sequence_buffer_size: int = 1024, eos_token_id: int = 0, shuffle: bool = False, append_concat_token: bool = True, add_special_tokens: bool = True, pretokenized: bool = True)[source]#
Bases:
IterableDataset
Iterable dataset that returns constant length chunks of tokens.
Prefetches, formats, and tokenizes asynchronously from main thread.
Based on TRL’s ConstantLengthDataset class.
- class oumi.core.datasets.VisionLanguageSftDataset(*, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, limit: int | None = None, trust_remote_code: bool = False, **kwargs)[source]#
Bases:
BaseSftDataset
,ABC
Abstract dataset for vision-language models.
This class extends BaseSftDataset to provide functionality specific to vision-language tasks. It handles the processing of both image and text data.
Note
This dataset is designed to work with models that can process both image and text inputs simultaneously, such as CLIP, BLIP, or other multimodal architectures.
Example
>>> from oumi.builders import build_processor, build_tokenizer >>> from oumi.core.configs import ModelParams >>> from oumi.core.types.conversation import Conversation >>> from oumi.core.datasets import VisionLanguageSftDataset >>> class MyVisionLanguageSftDataset(VisionLanguageSftDataset): ... def transform_conversation(self, example: dict): ... # Implement the abstract method ... # Convert the raw example into a Conversation object ... pass >>> tokenizer = build_tokenizer( ... ModelParams(model_name="Qwen/Qwen2-1.5B-Instruct") ... ) >>> dataset = MyVisionLanguageSftDataset( ... tokenizer=tokenizer, ... processor_name="openai/clip-vit-base-patch32", ... dataset_name="coco_captions", ... split="train" ... ) >>> sample = next(iter(dataset)) >>> print(sample.keys())
- dataset_name: str#
- transform(sample: dict) dict [source]#
Transforms an Oumi conversation into a dictionary of inputs for a model.
- Parameters:
sample (dict) – A dictionary representing a single conversation example.
- Returns:
A dictionary of inputs for a model.
- Return type:
dict
- abstract transform_conversation(example: dict) Conversation [source]#
Transforms a raw example into an Oumi Conversation object.
- Parameters:
example (dict) – A dictionary representing a single conversation example.
- Returns:
A Conversation object representing the conversation.
- Return type:
- trust_remote_code: bool#