oumi.datasets.vision_language

oumi.datasets.vision_language#

Vision-Language datasets module.

class oumi.datasets.vision_language.COCOCaptionsDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the HuggingFaceM4/COCO dataset.

dataset_name: str#

default_dataset: str | None = 'HuggingFaceM4/COCO'#

default_prompt = 'Describe this image:'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.vision_language.DocmatixDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: TheCauldronDataset

Dataset class for the HuggingFaceM4/Docmatix dataset.

The dataset has the same data layout and format as HuggingFaceM4/the_cauldron (hence it’s defined as a sub-class) but the underlying data is different. Unlike HuggingFaceM4/the_cauldron, the dataset contains many multi-image examples, and fewer subsets.

Be aware that ‘HuggingFaceM4/Docmatix’ is a very large dataset (~0.5TB) that requires a lot of Internet bandwidth to download, and a lot of disk space to store, so only use it if you know what you’re doing.

Using the ‘Docmatix’ dataset in Oumi should become easier after streaming support is supported for custom Oumi datasets (OPE-1021).

dataset_name: str#

default_dataset: str | None = 'HuggingFaceM4/Docmatix'#

trust_remote_code: bool#

class oumi.datasets.vision_language.Flickr30kDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the nlphuji/flickr30k dataset.

dataset_name: str#

default_dataset: str | None = 'nlphuji/flickr30k'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.vision_language.Geometry3kDataset(*, add_system_instruction: bool = False, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the hiyouga/geometry3k dataset.

dataset_name: str#

default_dataset: str | None = 'hiyouga/geometry3k'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.vision_language.HuggingFaceVisionDataset(*, hf_dataset_path: str, image_column: str, question_column: str, answer_column: str | None = None, system_prompt_column: str | None = None, system_prompt: str | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Converts HuggingFace Vision-Language Datasets to Oumi Message format.

This dataset handles standard HuggingFace datasets that contain: - An image column (containing image data or paths) - A question/prompt column (text input) - An optional answer column (text output)

Example

dataset = HuggingFaceVisionDataset(: hf_dataset_path=”HuggingFaceM4/VQAv2”, image_column=”image”, question_column=”question”, answer_column=”answer”

)

dataset_name: str#

transform_conversation(example: dict | Series) → Conversation[source]#

Preprocesses the inputs of the example and returns a Conversation.

Parameters:: example – An example containing image, question, and optionally answer data.
Returns:: A Conversation object containing the messages.
Return type:: Conversation

trust_remote_code: bool#

class oumi.datasets.vision_language.LlavaInstructMixVsftDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the HuggingFaceH4/llava-instruct-mix-vsft dataset.

dataset_name: str#

default_dataset: str | None = 'HuggingFaceH4/llava-instruct-mix-vsft'#

transform_conversation(example: dict) → Conversation[source]#: Transform a dataset example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.vision_language.LmmsLabMultimodalOpenR1Dataset(**kwargs)[source]#

Bases: HuggingFaceVisionDataset

Multimodal Open R1 8K Verified Dataset from LMMS Lab.

A specialized dataset class for the lmms-lab/multimodal-open-r1-8k-verified dataset that contains multimodal reasoning problems with images, problems, and solutions.

HuggingFace Hub: https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified

dataset_name: str#

trust_remote_code: bool#

class oumi.datasets.vision_language.MnistSftDataset(*, dataset_name: str | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

MNIST dataset formatted as SFT data.

MNIST is a well-known small dataset, can be useful for quick tests, prototyping, debugging.

dataset_name: str#

default_dataset: str | None = 'ylecun/mnist'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single MNIST example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.vision_language.PixmoAskModelAnythingDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the allenai/pixmo-docs dataset.

The dataset is affected by some image URLs having a 404 issue.

dataset_name: str#

default_dataset: str | None = 'allenai/pixmo-ask-model-anything'#

transform_conversation(example: dict) → Conversation[source]#: Transform the example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.vision_language.PixmoCapDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the allenai/pixmo-cap dataset.

The dataset is affected by some image URLs having a 404 issue.

dataset_name: str#

default_dataset: str | None = 'allenai/pixmo-cap'#

transform_conversation(example: dict) → Conversation[source]#

Transform the example into a Conversation object.

A “transcripts” column is also available but not used yet.

trust_remote_code: bool#

class oumi.datasets.vision_language.PixmoCapQADataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the allenai/pixmo-cap-qa dataset.

The dataset is affected by some image URLs having a 404 issue.

dataset_name: str#

default_dataset: str | None = 'allenai/pixmo-cap-qa'#

transform_conversation(example: dict) → Conversation[source]#

Transform the example into a Conversation object.

Sample “question”: “[USER] Can you come up with a joke? [ASSISTANT]” It starts with a [USER] and ends with an [ASSISTANT] role tag. The Assistant response appears in the “answer” field.

trust_remote_code: bool#

class oumi.datasets.vision_language.TheCauldronDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the HuggingFaceM4/the_cauldron dataset.

The HuggingFaceM4/the_cauldron dataset is a comprehensive collection of 50 vision-language datasets, primarily training sets, used for fine-tuning the Idefics2 vision-language model. The datasets cover various domains such as general visual question answering, captioning, OCR, document understanding, chart/figure understanding, table understanding, reasoning, logic, maths, textbook/academic questions, differences between images, and screenshot to code.

dataset_name: str#

default_dataset: str | None = 'HuggingFaceM4/the_cauldron'#

transform_conversation(example: dict[str, Any]) → Conversation[source]#: Transform raw data into a conversation with images.

trust_remote_code: bool#

class oumi.datasets.vision_language.VLJsonlinesDataset(dataset_path: str | Path | None = None, data: list | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

VLJsonlinesDataset for loading Vision-Language SFT data in Oumi format.

This dataset class is designed to work with JSON Lines (.jsonl) files containing Vision-Language supervised fine-tuning (SFT) data. It supports loading data either from a file or from a provided list of data samples.

Usage example:

Examples:

Loading from a file:

>>> from oumi.datasets import VLJsonlinesDataset
>>> dataset = VLJsonlinesDataset(
...     dataset_path="/path/to/your/dataset.jsonl",
... )

Loading from a list of data samples:

>>> from oumi.builders import build_processor, build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> from oumi.datasets import VLJsonlinesDataset
>>> data_samples = [
...     {
...         "messages": [
...             {
...                 "role": "user",
...                 "content": "Describe this image:",
...                 "type": "text"
...             },
...             {
...                 "role": "user",
...                 "content": "path/to/image.jpg",
...                 "type": "image_path"
...             },
...             {
...                 "role": "assistant",
...                 "content": "A scenic view of the puget sound.",
...                 "type": "text",
...             },
...         ]
...     }
... ]
>>> tokenizer = build_tokenizer(
...     ModelParams(model_name="Qwen/Qwen2-1.5B-Instruct")
... )
>>> dataset = VLJsonlinesDataset(
...     data=data_samples,
...     tokenizer=tokenizer,
...     processor_name="openai/clip-vit-base-patch32",
... )

dataset_name: str#

default_dataset: str | None = 'custom'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

class oumi.datasets.vision_language.VisionDpoJsonlinesDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, data: list[dict] | None = None, **kwargs)[source]#

Bases: VisionLanguageDpoDataset

VisionDpoJsonlinesDataset for loading Vision-Language DPO data in Oumi format.

This dataset class is designed to work with JSON Lines (.jsonl) files containing Vision-Language Direct Preference Optimization (DPO) data. It supports loading data either from a file or from a provided list of data samples.

See VisionLanguageDpoDataset for more details.

Example

dataset = VisionDpoJsonlinesDataset(: dataset_path=”data/dataset_examples/vision_language_dpo_format.jsonl”

)

dataset_name: str#

default_dataset: str | None = 'vision_dpo_jsonl'#

trust_remote_code: bool#

class oumi.datasets.vision_language.Vqav2SmallDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: VisionLanguageSftDataset

Dataset class for the merve/vqav2-small dataset.

dataset_name: str#

default_dataset: str | None = 'merve/vqav2-small'#

transform_conversation(example: dict) → Conversation[source]#: Transform a single conversation example into a Conversation object.

trust_remote_code: bool#

oumi.datasets.vision_language

Contents

oumi.datasets.vision_language#