oumi.core.synthesis#

Submodules#

oumi.core.synthesis.dataset_ingestion module#

class oumi.core.synthesis.dataset_ingestion.DatasetPath(path: str)[source]#

Bases: object

Path to a dataset in some storage location.

get_file_extension() str[source]#

Get the file extension.

get_path_str() str[source]#

Get the path.

get_storage_type() DatasetStorageType[source]#

Get the storage type.

class oumi.core.synthesis.dataset_ingestion.DatasetReader[source]#

Bases: object

Reads a dataset from some storage location.

Supports: - HuggingFace - Local files (JSONL, CSV, TSV, Parquet, JSON) - Glob patterns

read(data_source: DatasetSource) list[dict][source]#

Read the data from the data path.

class oumi.core.synthesis.dataset_ingestion.DatasetStorageType(value)[source]#

Bases: Enum

Storage location for a dataset (local, HuggingFace, etc.).

HF = 'hf'#

HuggingFace

LOCAL = 'local'#

Local files

oumi.core.synthesis.planner module#

class oumi.core.synthesis.planner.DatasetPlanner[source]#

Bases: object

plan(synthesis_params: GeneralSynthesisParams, sample_count: int) list[dict][source]#

Setup the dataset’s attributes for inference.

This function will create a list of dictionaries, with each dictionary representing a sample of the dataset with a particular attribute value for each attribute.

  • Dataset sources are used to populate the dataset plan with values for the attributes, with each sample of a dataset source being used round-robin.

  • Permutable attributes have their values sampled from a distribution.

  • Combination sampling overrides the distribution for particular attribute value combinations.

The final list of dictionaries will be used to create a dataset.

Parameters:
  • synthesis_params – The synthesis parameters.

  • sample_count – The number of samples to plan.

Returns:

A list of dictionaries, each representing a sample of the dataset with the attribute values for each attribute.