Other Datasets

Other Datasets#

In addition to the common LLM dataset formats (e.g., Pretraining, SFT, VL-SFT), Oumi infrastructure also allows users to define arbitrary ad-hoc datasets, which can be used not just for text-centric LLM models, but for alternative model architectures and applications such as Vision models (e.g., convolutional networks), scientific computing, etc.

This can be accomplished by defining a subclass of BaseMapDataset or BaseIterableDataset. A BaseIterableDataset is great for data streamed online, or for large datasets (e.g., hundreds of GBs) due to its lazy loading behavior, while BaseMapDataset should be the default choice for everything else (e.g., datasets that can be fully loaded into memory).

To give a concrete example, let’s show how to add support for datasets stored in Numpy .npz file format:

NumPy Dataset#

The popular numpy library defines .npy and .npz file formats [details], which can be used to save arbitrary multi-dimensional arrays (np.ndarray):

.npy file contains a single np.ndarray
.npz is an archive that contains a collection of multiple np.ndarray-s, with optional support for data compression

Adding a New Numpy (.npz) Dataset#

To add a new Oumi dataset that can load data from .npz files, follow these steps:

Subclass BaseMapDataset
Implement the methods to handle initialization, data loading, and data transformation.

Here’s a basic example, which shows how to do that:

from pathlib import Path
from typing import Optional, Union
from typing_extensions import override
import numpy as np
import pandas as pd

from oumi.core.datasets import BaseMapDataset
from oumi.core.registry import register_dataset


@register_dataset("npz_file")
class NpzDataset(BaseMapDataset):
    """Loads dataset from Numpy .npz archive."""

    default_dataset = "custom"

    def __init__(
        self,
        *,
        dataset_name: Optional[str] = None,
        dataset_path: Optional[Union[str, Path]] = None,
        split: Optional[str] = None,
        npz_split_col: Optional[str] = None,
        npz_allow_pickle: bool = False,
        **kwargs,
    ) -> None:
        """Initializes a new instance of the NpzDataset class.

        Args:
            dataset_name: Dataset name.
            dataset_path: Path to .npz file.
            split: Dataset split. If unspecified then the whole dataset is loaded.
            npz_split_col: Name of '.npz' array containing dataset split info.
                If unspecified, then the name "split" is assumed by default.
            npz_allow_pickle: Whether pickle is allowed when loading data from the '.npz' archive.
            **kwargs: Additional arguments to pass to the parent class.
        Raises:
            ValueError: If dataset_path is not provided, or
                if .npz file contains data in unexpected format.
        """
        if not dataset_path:
            raise ValueError("`dataset_path` must be provided")
        super().__init__(
            dataset_name=dataset_name,
            dataset_path=(str(dataset_path) if dataset_path is not None else None),
            split=split,
            **kwargs,
        )
        self._npz_allow_pickle = npz_allow_pickle
        self._npz_split_col = npz_split_col

        dataset_path = Path(dataset_path)
        if not dataset_path.is_file():
            raise ValueError(f"Path is not a file! '{dataset_path}'")
        elif dataset_path.suffix.lower() != ".npz":
            raise ValueError(f"File extension is not '.npz'! '{dataset_path}'")

        self._data = self._load_data()

    @staticmethod
    def _to_list(x: np.ndarray) -> list:
        # `pd.DataFrame` expects Python lists for columns (elements can still be `ndarray`)
        if len(x.shape) > 1:
            return [x[i, ...] for i in range(x.shape[0])]
        return x.tolist()

    @override
    def _load_data(self) -> pd.DataFrame:
        data_dict: dict[str, np.ndarray] = {}
        if not self.dataset_path:
            raise ValueError("dataset_path is empty!")
        with np.load(self.dataset_path, allow_pickle=self._npz_allow_pickle) as npzfile:
            feature_names = list(sorted(npzfile.files))
            if len(feature_names) == 0:
                raise ValueError(
                    f"'.npz' archive contains no data! '{self.dataset_path}'"
                )
            num_examples = None
            for feature_name in feature_names:
                col_data = npzfile[feature_name]
                assert isinstance(col_data, np.ndarray)
                if num_examples is None:
                    num_examples = col_data.shape[0]
                elif num_examples != col_data.shape[0]:
                    raise ValueError(
                        "Inconsistent number of examples for features "
                        f"'{feature_name}' and '{feature_names[0]}': "
                        f"{col_data.shape[0]} vs {num_examples}!"
                    )
                data_dict[feature_name] = self._to_list(col_data)

        dataframe = pd.DataFrame(data_dict)

        split_feature_name = (self._npz_split_col or "split") if self.split else None
        if split_feature_name:
            if split_feature_name not in dataframe:
                raise ValueError(
                    f"'.npz' doesn't contain data split info: '{split_feature_name}'!"
                )
            dataframe = pd.DataFrame(
                dataframe[dataframe[split_feature_name] == self.split].drop(
                    split_feature_name, axis=1
                ),
                copy=True,
            )
        return dataframe

    @override
    def transform(self, sample: pd.Series) -> dict:
        """Preprocesses the inputs in the given sample."""
        # Just return the data as a `dict`.
        return sample.to_dict()

Note

The .npz file format can be used to load images, vector fields, financial, medical/health data, and other new data types.

To use the custom dataset, add the following section to your TrainingConfig:

...
data:
  train:
    datasets:
      - dataset_name: "npz_file" # Custom dataset type defined above for .npz archives
        dataset_path: "/your_dir/mnist.npz" # File name of your `.npz` archive
        split: "train"
...

You can review the ➿ Training CNN on Custom Dataset notebook for a complete example. Additional information is available in → Custom Models.

Using Custom Datasets via the CLI#

See Customizing Oumi to quickly enable your dataset when using the CLI.