Vision-Language Datasets

Vision-Language Datasets#

Vision-Language Supervised Finetuning (VL-SFT) extends the concept of Supervised Fine-Tuning (SFT) to handle both images and text. This enables the model to understand and reason about visual information, opening up a wide range of multimodal applications.

This guide covers Vision-Language datasets used for instruction tuning and supervised learning in Oumi.

VL-SFT Datasets#

Name	Description	Reference
COCOCaptionsDataset	Dataset class for the `HuggingFaceM4/COCO` dataset.	`COCOCaptionsDataset`
DocmatixDataset	Dataset class for the `HuggingFaceM4/Docmatix` dataset.	`DocmatixDataset`
Flickr30kDataset	Dataset class for the `nlphuji/flickr30k` dataset.	`Flickr30kDataset`
Geometry3kDataset	Dataset class for the `hiyouga/geometry3k` dataset.	`Geometry3kDataset`
HuggingFaceVisionDataset	Converts HuggingFace Vision-Language Datasets to Oumi Message format.	`HuggingFaceVisionDataset`
LlavaInstructMixVsftDataset	Dataset class for the `HuggingFaceH4/llava-instruct-mix-vsft` dataset.	`LlavaInstructMixVsftDataset`
LmmsLabMultimodalOpenR1Dataset	Multimodal Open R1 8K Verified Dataset from LMMS Lab.	`LmmsLabMultimodalOpenR1Dataset`
MnistSftDataset	MNIST dataset formatted as SFT data.	`MnistSftDataset`
OpenbmbRlaifVDataset	Preprocess the RLAIF-V dataset for DPO.	`OpenbmbRlaifVDataset`
PixmoAskModelAnythingDataset	Dataset class for the `allenai/pixmo-docs` dataset.	`PixmoAskModelAnythingDataset`
PixmoCapDataset	Dataset class for the `allenai/pixmo-cap` dataset.	`PixmoCapDataset`
PixmoCapQADataset	Dataset class for the `allenai/pixmo-cap-qa` dataset.	`PixmoCapQADataset`
TheCauldronDataset	Dataset class for the `HuggingFaceM4/the_cauldron` dataset.	`TheCauldronDataset`
VLJsonlinesDataset	VLJsonlinesDataset for loading Vision-Language SFT data in Oumi format.	`VLJsonlinesDataset`
VisionDpoJsonlinesDataset	VisionDpoJsonlinesDataset for loading Vision-Language DPO data in Oumi format.	`VisionDpoJsonlinesDataset`
Vqav2SmallDataset	Dataset class for the `merve/vqav2-small` dataset.	`Vqav2SmallDataset`

Usage#

Configuration#

The configuration for VL-SFT datasets is similar to regular SFT datasets, with some additional parameters for image processing. Here’s an example:

training:
  data:
    train:
      collator_name: vision_language_with_padding
      collator_kwargs: {}  # Optional: Additional collator parameters
      datasets:
        - dataset_name: "your_vl_sft_dataset_name"
          split: "train"
          trust_remote_code: False # Set to true if needed for model-specific processors
          transform_num_workers: "auto"
          dataset_kwargs:
            processor_name: "meta-llama/Llama-3.2-11B-Vision-Instruct" # Model-specific processor
            return_tensors: True

In this configuration:

dataset_name: Name of the vision-language dataset
collator_kwargs: Optional additional parameters for the collator (e.g., allow_multi_image_inputs: false)
trust_remote_code: Enable for model-specific processors that use downloaded scripts
transform_num_workers: Number of workers for image processing
processor_name: Vision model processor to use

Python API#

Using a VL-SFT dataset in code is similar to using a regular SFT dataset, with the main difference being in the batch contents:

from oumi.builders import build_dataset, build_processor, build_tokenizer
from oumi.core.configs import DatasetSplit, ModelParams
from torch.utils.data import DataLoader

# Assume you have your tokenizer and image processor initialized
model_params: ModelParams = ...
trust_remote_code: bool = False # `True` if model-specific processor requires it
tokenizer: BaseTokenizer = build_tokenizer(model_params)
processor: BaseProcessor = build_processor(
        model_params.model_name, tokenizer, trust_remote_code=trust_remote_code
)

# Build the dataset
dataset = build_dataset(
    dataset_name="your_vl_sft_dataset_name",
    tokenizer=tokenizer,
    split=DatasetSplit.TRAIN,
    dataset_kwargs=dict(processor=processor),
    trust_remote_code=trust_remote_code,
)

# Create dataloader
loader = DataLoader(dataset, batch_size=16, shuffle=True)

# Now you can use the dataset in your training loop
for batch in loader:
    # Process your batch
    # Note: batch will contain both text and image data
    ...

Batch Contents#

Vision-language batches typically include:

input_ids: Text token IDs
attention_mask: Text attention mask
pixel_values: Processed image tensors
image_attention_mask: Image attention mask
Additional model-specific keys

Tip

VL-SFT batches typically include additional keys for image data, such as pixel_values or cross_attention_mask, depending on the specific dataset and model architecture.

Custom VL-SFT Datasets#

VisionLanguageSftDataset Base Class#

All VL-SFT datasets in Oumi are subclasses of VisionLanguageSftDataset. This class extends the functionality of BaseSftDataset to handle image data alongside text.

Adding a New VL-SFT Dataset#

To add a new VL-SFT dataset, follow these steps:

Subclass VisionLanguageSftDataset
Implement the transform_conversation() method to handle both text and image data.

Here’s a basic example, which loads data from the hypothetical example/foo HuggingFace dataset (image + text), and formats the data as Oumi Conversation-s for SFT tuning:

from oumi.core.datasets import VisionLanguageSftDataset
from oumi.core.registry import register_dataset
from oumi.core.types.conversation import ContentItem, Conversation, Message, Role, Type

@register_dataset("your_vl_sft_dataset_name")
class CustomVLDataset(VisionLanguageSftDataset):
    """Dataset class for the `example/foo` dataset."""
    default_dataset = "example/foo" # Name of the original HuggingFace dataset (image + text)

    def transform_conversation(self, example: Dict[str, Any]) -> Conversation:
        """Transform raw data into a conversation with images."""
        # Transform the raw example into a Conversation object
        # 'example' represents one row of the raw dataset
        # Structure of 'example':
        # {
        #     'image_bytes': bytes,  # PNG bytes of the image
        #     'question': str,       # The user's question about the image
        #     'answer': str          # The assistant's response
        # }
        conversation = Conversation(
            messages=[
                Message(role=Role.USER, content=[
                    ContentItem(type=Type.IMAGE_BINARY, binary=example['image_bytes']),
                    ContentItem(type=Type.TEXT, content=example['question']),
                ]),
                Message(role=Role.ASSISTANT, content=example['answer'])
            ]
        )

        return conversation

Note

The key difference in VL-SFT datasets is the inclusion of image data, typically represented as an additional ContentItem with type=Type.IMAGE_BINARY, type=Type.IMAGE_PATH or Type.IMAGE_URL.

For more advanced VL-SFT dataset implementations, explore the oumi.datasets.vision_language module.

Using Custom Datasets via the CLI#

See Customizing Oumi to quickly enable your dataset when using the CLI.