Other Datasets#
In addition to the common LLM dataset formats (e.g., Pretraining, SFT, VL-SFT), Oumi infrastructure also allows users to define arbitrary ad-hoc datasets, which can be used not just for text-centric LLM models, but for alternative model architectures and applications such as Vision models (e.g., convolutional networks), scientific computing, etc.
This can be accomplished by defining a subclass of BaseMapDataset
or BaseIterableDataset
. A BaseIterableDataset
is great for data streamed online, or for large datasets (e.g., hundreds of GBs) due to its lazy loading behavior, while BaseMapDataset
should be the default choice for everything else (e.g., datasets that can be fully loaded into memory).
To give a concrete example, let’s show how to add support for datasets stored in Numpy .npz
file format:
NumPy Dataset#
The popular numpy
library defines .npy
and .npz
file formats [details],
which can be used to save arbitrary multi-dimensional arrays (np.ndarray
):
.npy
file contains a singlenp.ndarray
.npz
is an archive that contains a collection of multiplenp.ndarray
-s, with optional support for data compression
Adding a New Numpy (.npz) Dataset#
To add a new Oumi dataset that can load data from .npz
files, follow these steps:
Subclass
BaseMapDataset
Implement the methods to handle initialization, data loading, and data transformation.
Here’s a basic example, which shows how to do that:
from pathlib import Path
from typing import Optional, Union
from typing_extensions import override
import numpy as np
import pandas as pd
from oumi.core.datasets import BaseMapDataset
from oumi.core.registry import register_dataset
@register_dataset("npz_file")
class NpzDataset(BaseMapDataset):
"""Loads dataset from Numpy .npz archive."""
default_dataset = "custom"
def __init__(
self,
*,
dataset_name: Optional[str] = None,
dataset_path: Optional[Union[str, Path]] = None,
split: Optional[str] = None,
npz_split_col: Optional[str] = None,
npz_allow_pickle: bool = False,
**kwargs,
) -> None:
"""Initializes a new instance of the NpzDataset class.
Args:
dataset_name: Dataset name.
dataset_path: Path to .npz file.
split: Dataset split. If unspecified then the whole dataset is loaded.
npz_split_col: Name of '.npz' array containing dataset split info.
If unspecified, then the name "split" is assumed by default.
npz_allow_pickle: Whether pickle is allowed when loading data from the '.npz' archive.
**kwargs: Additional arguments to pass to the parent class.
Raises:
ValueError: If dataset_path is not provided, or
if .npz file contains data in unexpected format.
"""
if not dataset_path:
raise ValueError("`dataset_path` must be provided")
super().__init__(
dataset_name=dataset_name,
dataset_path=(str(dataset_path) if dataset_path is not None else None),
split=split,
**kwargs,
)
self._npz_allow_pickle = npz_allow_pickle
self._npz_split_col = npz_split_col
dataset_path = Path(dataset_path)
if not dataset_path.is_file():
raise ValueError(f"Path is not a file! '{dataset_path}'")
elif dataset_path.suffix.lower() != ".npz":
raise ValueError(f"File extension is not '.npz'! '{dataset_path}'")
self._data = self._load_data()
@staticmethod
def _to_list(x: np.ndarray) -> list:
# `pd.DataFrame` expects Python lists for columns (elements can still be `ndarray`)
if len(x.shape) > 1:
return [x[i, ...] for i in range(x.shape[0])]
return x.tolist()
@override
def _load_data(self) -> pd.DataFrame:
data_dict: dict[str, np.ndarray] = {}
if not self.dataset_path:
raise ValueError("dataset_path is empty!")
with np.load(self.dataset_path, allow_pickle=self._npz_allow_pickle) as npzfile:
feature_names = list(sorted(npzfile.files))
if len(feature_names) == 0:
raise ValueError(
f"'.npz' archive contains no data! '{self.dataset_path}'"
)
num_examples = None
for feature_name in feature_names:
col_data = npzfile[feature_name]
assert isinstance(col_data, np.ndarray)
if num_examples is None:
num_examples = col_data.shape[0]
elif num_examples != col_data.shape[0]:
raise ValueError(
"Inconsistent number of examples for features "
f"'{feature_name}' and '{feature_names[0]}': "
f"{col_data.shape[0]} vs {num_examples}!"
)
data_dict[feature_name] = self._to_list(col_data)
dataframe = pd.DataFrame(data_dict)
split_feature_name = (self._npz_split_col or "split") if self.split else None
if split_feature_name:
if split_feature_name not in dataframe:
raise ValueError(
f"'.npz' doesn't contain data split info: '{split_feature_name}'!"
)
dataframe = pd.DataFrame(
dataframe[dataframe[split_feature_name] == self.split].drop(
split_feature_name, axis=1
),
copy=True,
)
return dataframe
@override
def transform(self, sample: pd.Series) -> dict:
"""Preprocesses the inputs in the given sample."""
# Just return the data as a `dict`.
return sample.to_dict()
Note
The .npz
file format can be used to load images, vector fields, financial, medical/health data, and other new data types.
To use the custom dataset, add the following section to your TrainingConfig
:
...
data:
train:
datasets:
- dataset_name: "npz_file" # Custom dataset type defined above for .npz archives
dataset_path: "/your_dir/mnist.npz" # File name of your `.npz` archive
split: "train"
...
You can review the ➿ Training CNN on Custom Dataset notebook for a complete example. Additional information is available in → Custom Models.
Using Custom Datasets via the CLI#
See Customizing Oumi to quickly enable your dataset when using the CLI.