Vision-Language Datasets#
Vision-Language Supervised Finetuning (VL-SFT) extends the concept of Supervised Fine-Tuning (SFT) to handle both images and text. This enables the model to understand and reason about visual information, opening up a wide range of multimodal applications.
This guide covers Vision-Language datasets used for instruction tuning and supervised learning in Oumi.
VL-SFT Datasets#
Name |
Description |
Reference |
---|---|---|
COCOCaptionsDataset |
Dataset class for the |
|
Flickr30kDataset |
Dataset class for the |
|
LlavaInstructMixVsftDataset |
Dataset class for the |
|
MnistSftDataset |
MNIST dataset formatted as SFT data. |
|
VLJsonlinesDataset |
VLJsonlinesDataset for loading Vision-Language SFT data in Oumi format. |
|
Vqav2SmallDataset |
Dataset class for the |
Usage#
Configuration#
The configuration for VL-SFT datasets is similar to regular SFT datasets, with some additional parameters for image processing. Here’s an example:
training:
data:
train:
collator_name: vision_language_with_padding
datasets:
- dataset_name: "your_vl_sft_dataset_name"
split: "train"
trust_remote_code: False # Set to true if needed for model-specific processors
transform_num_workers: "auto"
dataset_kwargs:
processor_name: "meta-llama/Llama-3.2-11B-Vision-Instruct" # Model-specific processor
return_tensors: True
In this configuration:
dataset_name
: Name of the vision-language datasettrust_remote_code
: Enable for model-specific processors that use downloaded scriptstransform_num_workers
: Number of workers for image processingprocessor_name
: Vision model processor to use
Python API#
Using a VL-SFT dataset in code is similar to using a regular SFT dataset, with the main difference being in the batch contents:
from oumi.builders import build_dataset, build_processor, build_tokenizer
from oumi.core.configs import DatasetSplit, ModelParams
from torch.utils.data import DataLoader
# Assume you have your tokenizer and image processor initialized
model_params: ModelParams = ...
trust_remote_code: bool = False # `True` if model-specific processor requires it
tokenizer: BaseTokenizer = build_tokenizer(model_params)
processor: BaseProcessor = build_processor(
model_params.model_name, tokenizer, trust_remote_code=trust_remote_code
)
# Build the dataset
dataset = build_dataset(
dataset_name="your_vl_sft_dataset_name",
tokenizer=tokenizer,
split=DatasetSplit.TRAIN,
dataset_kwargs=dict(processor=processor),
trust_remote_code=trust_remote_code,
)
# Create dataloader
loader = DataLoader(dataset, batch_size=16, shuffle=True)
# Now you can use the dataset in your training loop
for batch in loader:
# Process your batch
# Note: batch will contain both text and image data
...
Batch Contents#
Vision-language batches typically include:
input_ids
: Text token IDsattention_mask
: Text attention maskpixel_values
: Processed image tensorsimage_attention_mask
: Image attention maskAdditional model-specific keys
Tip
VL-SFT batches typically include additional keys for image data, such as pixel_values
or cross_attention_mask
, depending on the specific dataset and model architecture.
Custom VL-SFT Datasets#
VisionLanguageSftDataset Base Class#
All VL-SFT datasets in Oumi are subclasses of VisionLanguageSftDataset
. This class extends the functionality of BaseSftDataset
to handle image data alongside text.
Adding a New VL-SFT Dataset#
To add a new VL-SFT dataset, follow these steps:
Subclass
VisionLanguageSftDataset
Implement the
transform_conversation()
method to handle both text and image data.
Here’s a basic example, which loads data from the hypothetical example/foo
HuggingFace dataset (image + text),
and formats the data as Oumi Conversation
-s for SFT tuning:
from oumi.core.datasets import VisionLanguageSftDataset
from oumi.core.registry import register_dataset
from oumi.core.types.conversation import ContentItem, Conversation, Message, Role, Type
@register_dataset("your_vl_sft_dataset_name")
class CustomVLDataset(VisionLanguageSftDataset):
"""Dataset class for the `example/foo` dataset."""
default_dataset = "example/foo" # Name of the original HuggingFace dataset (image + text)
def transform_conversation(self, example: Dict[str, Any]) -> Conversation:
"""Transform raw data into a conversation with images."""
# Transform the raw example into a Conversation object
# 'example' represents one row of the raw dataset
# Structure of 'example':
# {
# 'image_bytes': bytes, # PNG bytes of the image
# 'question': str, # The user's question about the image
# 'answer': str # The assistant's response
# }
conversation = Conversation(
messages=[
Message(role=Role.USER, content=[
ContentItem(type=Type.IMAGE_BINARY, binary=example['image_bytes']),
ContentItem(type=Type.TEXT, content=example['question']),
]),
Message(role=Role.ASSISTANT, content=example['answer'])
]
)
return conversation
Note
The key difference in VL-SFT datasets is the inclusion of image data, typically represented as an additional ContentItem
with type=Type.IMAGE_BINARY
, type=Type.IMAGE_PATH
or Type.IMAGE_URL
.
For more advanced VL-SFT dataset implementations, explore the oumi.datasets.vision_language
module.
Using Custom Datasets via the CLI#
See Customizing Oumi to quickly enable your dataset when using the CLI.