oumi.builders#

Builders module for the Oumi (Open Universal Machine Intelligence) library.

This module provides builder functions to construct and configure different components of the Oumi framework, including datasets, models, optimizers, and trainers.

The builder functions encapsulate the complexity of creating these components, allowing for easier setup and configuration of machine learning experiments.

oumi.builders.build_chat_template(template_name: str) str[source]#

Builds a chat template based on code name.

Parameters:

template_name – the code name describing the chat-template.

Raises:

FileNotFoundError – if the requested template file does not exist.

Returns:

a jinja-based chat-template.

Return type:

str

oumi.builders.build_collator_from_config(config: TrainingConfig, tokenizer: PreTrainedTokenizerBase | None) Callable | None[source]#

Creates data collator if specified in config.

oumi.builders.build_data_collator(collator_name: str, tokenizer: PreTrainedTokenizerBase, *, max_length: int | None, label_ignore_index: int | None = -100, **kwargs) Callable[source]#

Builds a data collator based on the given collator name.

Parameters:
  • collator_name

    The name of the collator to build. Supported values are:

    • ”text_with_padding”: Uses TextCollatorWithPadding.

    • ”vision_language_with_padding”: Uses VisionLanguageCollatorWithPadding.

  • tokenizer – A tokenizer.

  • max_length – An optional maximum sequence length.

  • label_ignore_index – If set, then label values of tokens that shouldn’t contribute to the loss computation will be replaced by this special value. For example, this can be PAD, or image tokens. PyTorch convention is to use -100 as the ignore_index label. Refer to the ignore_index parameter of torch.nn.CrossEntropyLoss() for more details.

  • **kwargs – Additional keyword arguments to pass to the collator constructor.

Returns:

The data collator function or class.

Return type:

Callable

Raises:

ValueError – If an unsupported collator name is provided.

oumi.builders.build_dataset(dataset_name: str, tokenizer: PreTrainedTokenizerBase | None, seed: int | None = None, stream: bool = False, pack: bool = False, use_torchdata: bool | None = None, **kwargs) ConstantLengthDataset | DatasetType | PretrainingAsyncTextDataset[source]#

Builds a dataset from a dataset name.

Please refer to DatasetParams & DatasetSplitParams for a description of the all the arguments.

oumi.builders.build_dataset_from_params(dataset_params: DatasetParams, tokenizer: PreTrainedTokenizerBase | None, seed: int | None = None, stream: bool = False, pack: bool = False, use_torchdata: bool | None = None) ConstantLengthDataset | DatasetType | PretrainingAsyncTextDataset[source]#

Builds a dataset from a dataset params object.

Please refer to DatasetParams & DatasetSplitParams for a description of all the arguments.

oumi.builders.build_dataset_mixture(config: TrainingConfig, tokenizer: PreTrainedTokenizerBase | None, dataset_split: DatasetSplit, seed: int | None = None) ConstantLengthDataset | DatasetType | PretrainingAsyncTextDataset[source]#

Builds a dataset for the specified split.

Parameters:
  • config – The training config.

  • tokenizer – The tokenizer object to use for preprocessing.

  • dataset_split – The split of the dataset to load.

  • seed – If specified, a seed used for random sampling.

  • kwargs – Keyword arguments.

Returns:

The built dataset for dataset_split.

Return type:

dataset

oumi.builders.build_metrics_function(config: TrainingParams) Callable | None[source]#

Builds the metrics function.

oumi.builders.build_model(model_params: ModelParams, peft_params: PeftParams | None = None, **kwargs) Module[source]#

Builds and returns a model based on the provided Oumi configuration.

Parameters:
  • model_params – The model parameters.

  • peft_params – The PEFT parameters.

  • kwargs (dict, optional) – Additional keyword arguments for model loading.

Returns:

The built model.

Return type:

model

oumi.builders.build_optimizer(model: Module, config: TrainingParams) Optimizer[source]#

Builds and returns a PyTorch optimizer based on the provided configuration.

See pytorch documentation for more information on available optimizers: https://pytorch.org/docs/stable/optim.html

Parameters:
  • model – The model whose parameters will be optimized.

  • config – The configuration object containing optimizer parameters.

Returns:

The constructed PyTorch optimizer.

Return type:

Optimizer

oumi.builders.build_peft_model(base_model, use_gradient_checkpointing: bool, peft_params: PeftParams)[source]#

Builds a PEFT model based on the given base model and params.

Parameters:
  • base_model – The base model to build the PEFT model on.

  • use_gradient_checkpointing – Enable/disable gradient checkpointing.

  • peft_params – The desired params for LORA.

Returns:

The built PEFT model.

oumi.builders.build_processor(processor_name: str, tokenizer: PreTrainedTokenizerBase, *, trust_remote_code: bool = False) BaseProcessor[source]#

Builds a processor.

Parameters:
  • processor_name – A name of the processor (usually, equals to a model name).

  • tokenizer – A tokenizer to use with the processor.

  • trust_remote_code – Whether to allow loading remote code for this processor Some processors come with downloadable executable Python files, which can be a potential security risk, unless it’s from a trusted source.

Returns:

The newly created processor.

Return type:

BaseProcessor

oumi.builders.build_tokenizer(model_params: ModelParams) PreTrainedTokenizer | PreTrainedTokenizerFast[source]#

Builds and returns a tokenizer based on the provided Oumi configuration.

Parameters:

model_params (ModelParams) – The model parameters.

Returns:

The tokenizer object built from the configuration.

Return type:

tokenizer

oumi.builders.build_trainer(trainer_type: TrainerType, processor: BaseProcessor | None) Callable[[...], BaseTrainer][source]#

Builds a trainer creator functor based on the provided configuration.

Parameters:
  • trainer_type (TrainerType) – Enum indicating the type of training.

  • processor – An optional processor.

Returns:

A builder function that can create an appropriate trainer based on the trainer type specified in the configuration. All function arguments supplied by caller are forwarded to the trainer’s constructor.

Raises:

NotImplementedError – If the trainer type specified in the configuration is not supported.

oumi.builders.build_training_callbacks(config: TrainingConfig, model: Module, profiler: Any | None) list[TrainerCallback][source]#

Builds the training callbacks for the given training config and model.

This function creates a list of callback objects to be used during training. It includes callbacks for performance metrics, profiling, telemetry, and Model Flops Utilization (MFU) logging based on the provided configuration.

Parameters:
  • config – The training configuration object.

  • model – The PyTorch model being trained. This is needed to calculate the number of parameters for MFU (Model Flops Utilization) logging, and to determine the model’s data type for accurate MFU calculations.

  • profiler – The profiler object, if profiling is enabled.

Returns:

A list of callback objects to be used during training.

Return type:

List[BaseTrainerCallback]

Note

  • MFU logging is only supported on GPU and is skipped for PEFT models or training with non-packed datasets.

oumi.builders.is_image_text_llm(model_params: ModelParams) bool[source]#

Determines whether the model is a basic image+text LLM.