oumi.core.configs.params#
Submodules#
oumi.core.configs.params.base_params module#
- class oumi.core.configs.params.base_params.BaseParams[source]#
Bases:
object
Base class for all parameter classes.
This class provides a common interface for all parameter classes, and provides a finalize_and_validate method to recursively validate the parameters.
Subclasses should implement the __finalize_and_validate__ method to perform custom validation logic.
- __finalize_and_validate__() None [source]#
Finalizes and validates the parameters of this object.
This method can be overridden by subclasses to implement custom validation logic.
In case of validation errors, this method should raise a ValueError or other appropriate exception.
oumi.core.configs.params.data_params module#
- class oumi.core.configs.params.data_params.DataParams(train: oumi.core.configs.params.data_params.DatasetSplitParams = <factory>, test: oumi.core.configs.params.data_params.DatasetSplitParams = <factory>, validation: oumi.core.configs.params.data_params.DatasetSplitParams = <factory>)[source]#
Bases:
BaseParams
- get_split(split: DatasetSplit) DatasetSplitParams [source]#
A public getting for individual dataset splits.
- test: DatasetSplitParams#
The input datasets used for testing.
- train: DatasetSplitParams#
The input datasets used for training.
- validation: DatasetSplitParams#
The input datasets used for validation.
- class oumi.core.configs.params.data_params.DatasetParams(dataset_name: str = '???', dataset_path: Optional[str] = None, subset: Optional[str] = None, split: str = 'train', dataset_kwargs: dict[str, typing.Any] = <factory>, sample_count: Optional[int] = None, mixture_proportion: Optional[float] = None, shuffle: bool = False, seed: Optional[int] = None, shuffle_buffer_size: int = 1000, trust_remote_code: bool = False, transform_num_workers: Union[int, str, NoneType] = None)[source]#
Bases:
BaseParams
- dataset_kwargs: dict[str, Any]#
Keyword arguments to pass to the dataset constructor.
These arguments will be passed directly to the dataset constructor.
- dataset_name: str = '???'#
The name of the dataset to load. Required.
This field is used to retrieve the appropriate class from the dataset registry that can be used to instantiate and preprocess the data.
If dataset_path is not specified, then the raw data will be automatically downloaded from the huggingface hub or oumi registry. Otherwise, the dataset will be loaded from the specified dataset_path.
- dataset_path: str | None = None#
The path to the dataset to load.
This can be used to load a dataset of type dataset_name from a custom path.
If dataset_path is not specified, then the raw data will be automatically downloaded from the huggingface hub or oumi registry.
- mixture_proportion: float | None = None#
- The proportion of examples from this dataset relative to other datasets
in the mixture.
If specified, all datasets must supply this value. Must be a float in the range [0, 1.0]. The mixture_proportion for all input datasets must sum to 1.
Examples are sampled after the dataset has been sampled using sample_count if specified.
- sample_count: int | None = None#
The number of examples to sample from the dataset.
Must be non-negative. If sample_count is larger than the size of the dataset, then the required additional examples are sampled by looping over the original dataset.
- seed: int | None = None#
The random seed used for shuffling the dataset before sampling.
If set to None, shuffling will be non-deterministic.
- shuffle: bool = False#
Whether to shuffle the dataset before any sampling occurs.
- shuffle_buffer_size: int = 1000#
The size of the shuffle buffer used for shuffling the dataset before sampling.
- split: str = 'train'#
The split of the dataset to load.
This is typically one of “train”, “test”, or “validation”. Defaults to “train”.
- subset: str | None = None#
The subset of the dataset to load.
This is usually a subfolder within the dataset root.
- transform_num_workers: int | str | None = None#
Number of subprocesses to use for dataset post-processing (ds.transform()).
Multiprocessing is disabled by default (None).
You can also use the special value “auto” to let oumi automatically select the number of subprocesses.
Using multiple processes can speed-up processing e.g., for large or multi-modal datasets.
The parameter is only supported for Map (non-iterable) datasets.
- trust_remote_code: bool = False#
Whether to trust remote code when loading the dataset.
- class oumi.core.configs.params.data_params.DatasetSplit(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
Enum representing the split for a dataset.
- TEST = 'test'#
- TRAIN = 'train'#
- VALIDATION = 'validation'#
- class oumi.core.configs.params.data_params.DatasetSplitParams(datasets: list[oumi.core.configs.params.data_params.DatasetParams] = <factory>, collator_name: Optional[str] = None, pack: bool = False, stream: bool = False, target_col: Optional[str] = None, mixture_strategy: str = 'first_exhausted', seed: Optional[int] = None, use_async_dataset: bool = False, use_torchdata: Optional[bool] = None)[source]#
Bases:
BaseParams
- collator_name: str | None = None#
Name of Oumi data collator.
Data collator controls how to form a mini-batch from individual dataset elements.
Valid options are:
- “text_with_padding”: Dynamically pads the inputs received to
the longest length.
- “vision_language_with_padding”: Uses VisionLanguageCollator
for image+text multi-modal data.
If None, then a default collator will be assigned.
- datasets: list[DatasetParams]#
The input datasets used for training.
This will later be split into train, test, and validation.
- mixture_strategy: str = 'first_exhausted'#
The strategy for mixing multiple datasets.
When multiple datasets are provided, this parameter determines how they are combined. Two strategies are available:
FIRST_EXHAUSTED: Samples from all datasets until one is fully represented in the mixture. This is the default strategy.
ALL_EXHAUSTED: Samples from all datasets until each one is fully represented in the mixture. This may lead to significant oversampling.
- pack: bool = False#
Whether to pack the text into constant-length chunks.
Each chunk will be the size of the model’s max input length. This will stream the dataset, and tokenize on the fly if the dataset isn’t already tokenized (i.e. has an input_ids column).
- seed: int | None = None#
The random seed used for mixing this dataset split, if specified.
If set to None mixing will be non-deterministic.
- stream: bool = False#
Whether to stream the dataset.
- target_col: str | None = None#
The dataset column name containing the input for training/testing/validation.
- Deprecated:
This parameter is deprecated and will be removed in the future.
- use_async_dataset: bool = False#
Whether to use the PretrainingAsyncTextDataset instead of ConstantLengthDataset.
- Deprecated:
This parameter is deprecated and will be removed in the future.
- use_torchdata: bool | None = None#
Whether to use the torchdata library for dataset loading and processing.
If set to None, this setting may be auto-inferred.
- class oumi.core.configs.params.data_params.MixtureStrategy(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str
,Enum
Enum representing the supported mixture strategies for datasets.
- ALL_EXHAUSTED = 'all_exhausted'#
- FIRST_EXHAUSTED = 'first_exhausted'#
oumi.core.configs.params.evaluation_params module#
- class oumi.core.configs.params.evaluation_params.AlpacaEvalTaskParams(evaluation_platform: str = '???', task_name: str | None = None, num_samples: int | None = None, eval_kwargs: dict[str, ~typing.Any] = <factory>, version: float | None = 2.0)[source]#
Bases:
EvaluationTaskParams
Parameters for the AlpacaEval evaluation framework.
AlpacaEval is an LLM-based automatic evaluation suite that is fast, cheap, replicable, and validated against 20K human annotations. The latest version (AlpacaEval 2.0) contains 805 prompts (tatsu-lab/alpaca_eval), which are open-ended questions. A model annotator (judge) is used to evaluate the quality of model’s responses for these questions and calculates win rates vs. reference responses. The default judge is GPT4 Turbo.
- eval_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the evaluation function.
This allows for passing any evaluation-specific parameters that are not covered by other fields in TaskParams classes.
- version: float | None = 2.0#
1.0 or 2.0 (default).
- Type:
The version of AlpacaEval to use. Options
- class oumi.core.configs.params.evaluation_params.CustomEvaluationParams(data: ~oumi.core.configs.params.data_params.DatasetSplitParams = <factory>)[source]#
Bases:
BaseParams
Parameters for running custom evaluations.
- data: DatasetSplitParams#
Parameters for the dataset split to be used in evaluation.
This includes specifications for train, validation, and test splits, as well as any data preprocessing parameters.
- class oumi.core.configs.params.evaluation_params.EvaluationPlatform(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
Enum representing the evaluation platform to use.
- ALPACA_EVAL = 'alpaca_eval'#
- LM_HARNESS = 'lm_harness'#
- class oumi.core.configs.params.evaluation_params.EvaluationTaskParams(evaluation_platform: str = '???', task_name: str | None = None, num_samples: int | None = None, eval_kwargs: dict[str, ~typing.Any] = <factory>)[source]#
Bases:
BaseParams
Configuration parameters for model evaluation tasks.
Supported platforms:
LM Harness: Framework for evaluating language models on standard benchmarks. A list of all supported tasks can be found at: EleutherAI/lm-evaluation-harness.
Alpaca Eval: Framework for evaluating language models on instruction-following and quality of responses on open-ended questions.
Examples
# LM Harness evaluation on MMLU params = EvaluationTaskParams( evaluation_platform="lm_harness", task_name="mmlu", eval_kwargs={"num_fewshot": 5} )
# Alpaca Eval 2.0 evaluation params = EvaluationTaskParams( evaluation_platform="alpaca_eval" )
- eval_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the evaluation function.
This allows for passing any evaluation-specific parameters that are not covered by other fields in TaskParams classes.
- evaluation_platform: str = '???'#
The evaluation platform to use for the current task.
- get_evaluation_platform() EvaluationPlatform [source]#
Returns the evaluation platform as an Enum.
- get_evaluation_platform_task_params()[source]#
Returns the evaluation platform-specific task parameters.
- static list_evaluation_platforms() str [source]#
Returns a string listing all available evaluation platforms.
- num_samples: int | None = None#
Number of samples/examples to evaluate from this dataset.
Mostly for debugging, in order to reduce the runtime. If not set (None): the entire dataset is evaluated. If set, this must be a positive integer.
- task_name: str | None = None#
The task to evaluate.
- class oumi.core.configs.params.evaluation_params.LMHarnessTaskParams(evaluation_platform: str = '???', task_name: str | None = None, num_samples: int | None = None, eval_kwargs: dict[str, ~typing.Any] = <factory>, num_fewshot: int | None = None)[source]#
Bases:
EvaluationTaskParams
Parameters for the LM Harness evaluation framework.
LM Harness is a comprehensive benchmarking suite for evaluating language models across various tasks.
- eval_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the evaluation function.
This allows for passing any evaluation-specific parameters that are not covered by other fields in TaskParams classes.
- num_fewshot: int | None = None#
Number of few-shot examples (with responses) to add in the prompt, in order to teach the model how to respond to the specific dataset’s prompts.
If not set (None): LM Harness will decide the value. If set to 0: no few-shot examples will be added in the prompt.
oumi.core.configs.params.fsdp_params module#
- class oumi.core.configs.params.fsdp_params.AutoWrapPolicy(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str
,Enum
The auto wrap policies for FullyShardedDataParallel (FSDP).
- NO_WRAP = 'NO_WRAP'#
No automatic wrapping is performed.
- SIZE_BASED_WRAP = 'SIZE_BASED_WRAP'#
Wraps layers based on parameter count.
- TRANSFORMER_BASED_WRAP = 'TRANSFORMER_BASED_WRAP'#
Wraps layers based on the transformer block layer.
- class oumi.core.configs.params.fsdp_params.BackwardPrefetch(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str
,Enum
The backward prefetch options for FullyShardedDataParallel (FSDP).
- BACKWARD_POST = 'BACKWARD_POST'#
Enables less overlap but requires less memory usage.
- BACKWARD_PRE = 'BACKWARD_PRE'#
Enables the most overlap but increases memory usage the most.
- NO_PREFETCH = 'NO_PREFETCH'#
Disables backward prefetching altogether.
- class oumi.core.configs.params.fsdp_params.FSDPParams(enable_fsdp: bool = False, sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD, cpu_offload: bool = False, mixed_precision: str | None = None, backward_prefetch: BackwardPrefetch = BackwardPrefetch.BACKWARD_PRE, forward_prefetch: bool = False, use_orig_params: bool | None = None, state_dict_type: StateDictType = StateDictType.FULL_STATE_DICT, auto_wrap_policy: AutoWrapPolicy = AutoWrapPolicy.NO_WRAP, min_num_params: int = 100000, transformer_layer_cls: str | None = None, sync_module_states: bool = True)[source]#
Bases:
BaseParams
Configuration options for Pytorch’s FullyShardedDataParallel (FSDP) training.
- auto_wrap_policy: AutoWrapPolicy = 'NO_WRAP'#
Policy for automatically wrapping layers in FSDP.
- backward_prefetch: BackwardPrefetch = 'BACKWARD_PRE'#
Determines when to prefetch the next set of parameters.
Improves throughput by enabling communication and computation overlap in the backward pass at the cost of slightly increased memory usage.
- Options:
- BACKWARD_PRE: Enables the most overlap but increases memory
usage the most. This prefetches the next set of parameters before the current set of parameters’ gradient computation.
- BACKWARD_POST: Enables less overlap but requires less memory
usage. This prefetches the next set of parameters after the current set of parameters’ gradient computation.
- NO_PREFETCH: Disables backward prefetching altogether. This has no overlap and
does not increase memory usage. This may degrade throughput significantly.
- cpu_offload: bool = False#
If True, offloads parameters and gradients to CPU when not in use.
- enable_fsdp: bool = False#
If True, enables FullyShardedDataParallel training.
Allows training larger models by sharding models and gradients across multiple GPUs.
- forward_prefetch: bool = False#
If True, prefetches the forward pass results.
- min_num_params: int = 100000#
Minimum number of parameters for a layer to be wrapped when using size_based policy. This has no effect when using transformer_based policy.
- mixed_precision: str | None = None#
Enables mixed precision training.
Options: None, “fp16”, “bf16”.
- sharding_strategy: ShardingStrategy = 'FULL_SHARD'#
Determines how to shard model parameters across GPUs.
See
torch.distributed.fsdp.api.ShardingStrategy
for more details.- Options:
- FULL_SHARD: Shards model parameters, gradients, and optimizer states.
Provides the most memory efficiency but may impact performance.
- SHARD_GRAD_OP: Shards gradients and optimizer states, but not model
parameters. Balances memory savings and performance.
- HYBRID_SHARD: Shards model parameters within a node and replicates them
across nodes.
- NO_SHARD: No sharding is applied. Parameters, gradients, and optimizer states
are kept in full on each GPU.
- HYBRID_SHARD_ZERO2: Apply SHARD_GRAD_OP within a node, and replicate
parameters across nodes.
Warning
- NO_SHARD option is deprecated and will be removed in a future release.
Please use DistributedDataParallel (DDP) instead.
- state_dict_type: StateDictType = 'FULL_STATE_DICT'#
Specifies the type of state dict to use for checkpointing.
- sync_module_states: bool = True#
If True, synchronizes module states across processes.
When enabled, each FSDP module broadcasts parameters and buffers from rank 0 to ensure replication across ranks.
- transformer_layer_cls: str | None = None#
Class name for transformer layers when using transformer_based policy.
This has no effect when using size_based policy.
- use_orig_params: bool | None = None#
If True, uses the PyTorch Module’s original parameters for FSDP.
For more information, see: https://pytorch.org/docs/stable/fsdp.html. If not specified, it will be automatically inferred based on other config values.
- class oumi.core.configs.params.fsdp_params.ShardingStrategy(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str
,Enum
The sharding strategies for FullyShardedDataParallel (FSDP).
See
torch.distributed.fsdp.ShardingStrategy
for more details.- FULL_SHARD = 'FULL_SHARD'#
Shards model parameters, gradients, and optimizer states. Provides the most memory efficiency but may impact performance.
- HYBRID_SHARD = 'HYBRID_SHARD'#
Shards model parameters within a node and replicates them across nodes.
- HYBRID_SHARD_ZERO2 = 'HYBRID_SHARD_ZERO2'#
Apply SHARD_GRAD_OP within a node, and replicate parameters across nodes.
- NO_SHARD = 'NO_SHARD'#
No sharding is applied. Parameters, gradients, and optimizer states are kept in full on each GPU.
- SHARD_GRAD_OP = 'SHARD_GRAD_OP'#
Shards gradients and optimizer states, but not model parameters. Balances memory savings and performance.
- class oumi.core.configs.params.fsdp_params.StateDictType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str
,Enum
The supported state dict types for FullyShardedDataParallel (FSDP).
This controls how the model’s state dict will be saved during checkpointing, and how it can be consumed afterwards.
- FULL_STATE_DICT = 'FULL_STATE_DICT'#
The state dict will be saved in a non-sharded, unflattened format.
This is similar to checkpointing without FSDP.
- LOCAL_STATE_DICT = 'LOCAL_STATE_DICT'#
The state dict will be saved in a sharded, flattened format.
Since it’s flattened, this can only be used by FSDP.
- SHARDED_STATE_DICT = 'SHARDED_STATE_DICT'#
The state dict will be saved in a sharded, unflattened format.
This can be used by other parallel schemes.
oumi.core.configs.params.generation_params module#
- class oumi.core.configs.params.generation_params.GenerationParams(max_new_tokens: int = 256, batch_size: Optional[int] = 1, exclude_prompt_from_response: bool = True, seed: Optional[int] = None, temperature: float = 0.0, top_p: float = 1.0, frequency_penalty: float = 0.0, presence_penalty: float = 0.0, stop_strings: Optional[list[str]] = None, stop_token_ids: Optional[list[int]] = None, logit_bias: dict[typing.Any, float] = <factory>, min_p: float = 0.0, use_cache: bool = False, num_beams: int = 1, use_sampling: bool = False, guided_decoding: Optional[oumi.core.configs.params.guided_decoding_params.GuidedDecodingParams] = None)[source]#
Bases:
BaseParams
- batch_size: int | None = 1#
The number of sequences to generate in parallel.
Larger batch sizes can improve throughput but require more memory. Default is 1.
The value must either be positive or None, in which case the behavior is dependent on the downstream application. For example, LM Harness will automatically determine the largest batch size that will fit in memory.
For inference, this parameter is only used in NativeTextInferenceEngine.
- exclude_prompt_from_response: bool = True#
Whether to trim the model’s response and remove the prepended prompt.
- frequency_penalty: float = 0.0#
Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim.
- guided_decoding: GuidedDecodingParams | None = None#
Parameters for guided decoding.
- logit_bias: dict[Any, float]#
Modify the likelihood of specified tokens appearing in the completion.
Keys are tokens (specified by their token ID in the tokenizer), and values are the bias (-100 to 100). Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.
- max_new_tokens: int = 256#
The maximum number of new tokens to generate.
This limits the length of the generated text to prevent excessively long outputs. Default is 256 tokens.
- min_p: float = 0.0#
Sets a minimum probability threshold for token selection.
Tokens with probabilities below this threshold are filtered out before top-p or top-k sampling. This can help prevent the selection of highly improbable tokens. Default is 0.0 (no minimum threshold).
- num_beams: int = 1#
Number of beams for beam search. 1 means no beam search. Larger number of beams will make for a more thorough search for probable output token sequences, at the cost of increased computation time. Default is 1.
- presence_penalty: float = 0.0#
Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics.
- seed: int | None = None#
Seed to use for random number determinism. If specified, APIs may use this parameter to make a best-effort at determinism.
- stop_strings: list[str] | None = None#
List of sequences where the API will stop generating further tokens.
- stop_token_ids: list[int] | None = None#
List of token ids for which the API will stop generating further tokens. This is only supported in VLLMInferenceEngine and NativeTextInferenceEngine.
- temperature: float = 0.0#
Controls randomness in the output.
Higher values (e.g., 1.0) make output more random, while lower values (e.g., 0.2) make it more focused and deterministic.
- top_p: float = 1.0#
An alternative to temperature, called nucleus sampling.
It sets the cumulative probability threshold for token selection. For example, 0.9 means only considering the tokens comprising the top 90% probability mass.
- use_cache: bool = False#
Whether to use the model’s internal cache (key/value attentions) to speed up generation. Default is False.
- use_sampling: bool = False#
Whether to use sampling for next-token generation. If False, uses greedy decoding. Default is False.
oumi.core.configs.params.guided_decoding_params module#
- class oumi.core.configs.params.guided_decoding_params.GuidedDecodingParams(json: Any | None = None, regex: str | None = None, choice: list[str] | None = None)[source]#
Bases:
BaseParams
Parameters for guided decoding.
The parameters are mutually exclusive. Only one of the parameters can be specified at a time.
- choice: list[str] | None = None#
List of allowed choices for the output.
Restricts model output to one of the provided choices. Useful for forcing the model to select from a predefined set of options.
- json: Any | None = None#
JSON schema, Pydantic model, or string to guide the output format.
Can be a dict containing a JSON schema, a Pydantic model class, or a string containing JSON schema. Used to enforce structured output from the model.
- regex: str | None = None#
Regular expression pattern to guide the output format.
Pattern that the model output must match. Can be used to enforce specific text formats or patterns.
oumi.core.configs.params.model_params module#
- class oumi.core.configs.params.model_params.ModelParams(model_name: str = '???', adapter_model: Optional[str] = None, tokenizer_name: Optional[str] = None, tokenizer_pad_token: Optional[str] = None, tokenizer_kwargs: dict[str, typing.Any] = <factory>, model_max_length: Optional[int] = None, load_pretrained_weights: bool = True, trust_remote_code: bool = False, torch_dtype_str: str = 'float32', compile: bool = False, chat_template: Optional[str] = None, attn_implementation: Optional[str] = None, device_map: Optional[str] = 'auto', model_kwargs: dict[str, typing.Any] = <factory>, enable_liger_kernel: bool = False, shard_for_eval: bool = False, freeze_layers: list[str] = <factory>)[source]#
Bases:
BaseParams
- adapter_model: str | None = None#
The path to an adapter model to be applied on top of the base model.
If provided, this adapter will be loaded and applied to the base model. The adapter path could alternatively be specified in model_name.
- attn_implementation: str | None = None#
The attention implementation to use.
Valid options include:
None: Use the default attention implementation (spda for torch>=2.1.1, else eager)
“sdpa”: Use PyTorch’s scaled dot-product attention
“flash_attention_2”: Use Flash Attention 2 for potentially faster computation. Requires “flash-attn” package to be installed
“eager”: Manual implementation of attention
- chat_template: str | None = None#
The chat template to use for formatting inputs.
If provided, this template will be used to format multi-turn conversations for models that support chat-like interactions.
Note
Different models may require specific chat templates. Consult the model’s documentation for the appropriate template to use.
- compile: bool = False#
Whether to JIT compile the model.
For training, do not set this param, and instead set TrainingParams.compile.
- device_map: str | None = 'auto'#
Specifies how to distribute the model’s layers across available devices.
“auto”: Automatically distribute the model across available devices
None: Load the entire model on the default device
Note
“auto” is generally recommended as it optimizes device usage, especially for large models that don’t fit on a single GPU.
- enable_liger_kernel: bool = False#
Whether to enable the Liger kernel for potential performance improvements.
Liger is an optimized CUDA kernel that can accelerate certain operations.
Tip
Enabling this may improve performance, but ensure compatibility with your model and hardware before use in production.
- freeze_layers: list[str]#
A list of layer names to freeze during training.
These layers will have their parameters set to not require gradients, effectively preventing them from being updated during the training process. This is useful for fine-tuning specific parts of a model while keeping other parts fixed.
- load_pretrained_weights: bool = True#
Whether to load the pretrained model’s weights.
If True, the model will be initialized with pretrained weights. If False, the model will be initialized from the pretrained config without loading weights.
- model_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the model’s constructor.
This allows for passing any model-specific parameters that are not covered by other fields in ModelParams.
Note
Use this for model-specific parameters or to enable experimental features.
- model_max_length: int | None = None#
The maximum sequence length the model can handle.
If specified, this will override the default max length of the model’s config.
Note
Setting this to a larger value may increase memory usage but allow for processing longer inputs. Ensure your hardware can support the chosen length.
- model_name: str = '???'#
The name or path of the model or LoRA adapter to use.
This can be a model identifier from the Oumi registry, HuggingFace Hub, or a path to a local directory containing model files.
The LoRA adapter can be specified here instead of in adapter_model. If so, this value is copied to adapter_model, and the appropriate base model is set here instead. The base model could either be in the same directory as the adapter, or specified in the adapter’s config file.
- shard_for_eval: bool = False#
Whether to shard the model for evaluation.
This is needed for large models that do not fit on a single GPU. It is used as the value for the parallelize argument in LM Harness.
- tokenizer_kwargs: dict[str, Any]#
Additional keyword arguments to pass into the tokenizer’s constructor.
This allows for passing any tokenizer-specific parameters that are not covered by other fields in ModelParams.
- tokenizer_name: str | None = None#
The name or path of the tokenizer to use.
If None, the tokenizer associated with model_name will be used. Specify this if you want to use a different tokenizer than the default for the model.
- tokenizer_pad_token: str | None = None#
The padding token used by the tokenizer.
If this is set, it will override the default padding token of the tokenizer and the padding token optionally defined in the tokenizer_kwargs.
- torch_dtype_str: str = 'float32'#
The data type to use for the model’s parameters as a string.
Valid options are: - “float32” or “f32” or “float” for 32-bit floating point - “float16” or “f16” or “half” for 16-bit floating point - “bfloat16” or “bf16” for brain floating point - “float64” or “f64” or “double” for 64-bit floating point
This string will be converted to the corresponding torch.dtype. Defaults to “float32” for full precision.
- trust_remote_code: bool = False#
Whether to allow loading remote code when loading the model.
If True, this allows loading and executing code from the model’s repository, which can be a security risk. Only set to True for models you trust.
Defaults to False for safety.
oumi.core.configs.params.peft_params module#
- class oumi.core.configs.params.peft_params.LoraWeightInitialization(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str
,Enum
Enum representing the supported weight initializations for LoRA adapters.
- DEFAULT = 'default'#
- EVA = 'eva'#
- GAUSSIAN = 'gaussian'#
- LOFTQ = 'loftq'#
- OLORA = 'olora'#
- PISA = 'pissa'#
- PISSA_NITER = 'pissa_niter_[number of iters]'#
- RANDOM = 'random'#
- class oumi.core.configs.params.peft_params.PeftParams(lora_r: int = 8, lora_alpha: int = 8, lora_dropout: float = 0.0, lora_target_modules: Optional[list[str]] = None, lora_modules_to_save: Optional[list[str]] = None, lora_bias: str = 'none', lora_init_weights: oumi.core.configs.params.peft_params.LoraWeightInitialization = <LoraWeightInitialization.DEFAULT: 'default'>, lora_task_type: peft.utils.peft_types.TaskType = <TaskType.CAUSAL_LM: 'CAUSAL_LM'>, q_lora: bool = False, q_lora_bits: int = 4, bnb_4bit_quant_type: str = 'fp4', use_bnb_nested_quant: bool = False, bnb_4bit_quant_storage: str = 'uint8', bnb_4bit_compute_dtype: str = 'float32', peft_save_mode: oumi.core.configs.params.peft_params.PeftSaveMode = <PeftSaveMode.ADAPTER_ONLY: 'adapter_only'>)[source]#
Bases:
BaseParams
- bnb_4bit_compute_dtype: str = 'float32'#
Compute type of the quantized parameters. It can be different than the input type, e.g., it can be set to a lower precision for improved speed.
The string will be converted to the corresponding torch.dtype.
Valid string options are: - “float32” for 32-bit floating point - “float16” for 16-bit floating point - “bfloat16” for brain floating point - “float64” for 64-bit floating point
Defaults to “float16” for half precision.
- bnb_4bit_quant_storage: str = 'uint8'#
The storage type for packing quantized 4-bit parameters.
Defaults to ‘uint8’ for efficient storage.
- bnb_4bit_quant_type: str = 'fp4'#
The type of 4-bit quantization to use.
Can be ‘fp4’ (float point 4) or ‘nf4’ (normal float 4).
- lora_alpha: int = 8#
The scaling factor for the LoRA update.
This value is typically set equal to lora_r or 2*lora_r for stable training.
- lora_bias: str = 'none'#
Bias type for LoRA.
Can be ‘none’, ‘all’ or ‘lora_only’: - ‘none’: No biases are trained. - ‘all’: All biases in the model are trained. - ‘lora_only’: Only biases in LoRA layers are trained.
If ‘all’ or ‘lora_only’, the corresponding biases will be updated during training. Note that this means even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation.
For more details, see: huggingface/peft
- lora_dropout: float = 0.0#
The dropout probability applied to LoRA layers.
This helps prevent overfitting in the adaptation layers.
- lora_init_weights: LoraWeightInitialization = 'default'#
Passing LoraWeightInitialization.DEFAULT will use the underlying reference implementation of the corresponding model from Microsoft.
- Other valid (LoraWeightInitialization) options include:
“random” which will use fully random initialization and is discouraged.
“gaussian” for Gaussian initialization.
“eva” for Explained Variance Adaptation (EVA) (https://arxiv.org/abs/2410.07170).
“loftq” for improved performance when LoRA is combined with with quantization (https://arxiv.org/abs/2310.08659).
“olora” for Orthonormal Low-Rank Adaptation of Large Language Models (OLoRA) (https://arxiv.org/html/2406.01775v1).
“pissa” for Principal Singular values and Singular vectors Adaptation (PiSSA) (https://arxiv.org/abs/2404.02948).
- For more information, see HF:
- lora_modules_to_save: list[str] | None = None#
List of module names to unfreeze and train alongside LoRA parameters.
These modules will be fully fine-tuned, not adapted using LoRA. Use this to selectively train certain parts of the model in full precision.
- lora_r: int = 8#
The rank of the update matrices in LoRA.
A higher value allows for more expressive adaptations but increases the number of trainable parameters.
- lora_target_modules: list[str] | None = None#
List of module names to apply LoRA to.
If None, LoRA will be applied to all linear layers in the model. Specify module names to selectively apply LoRA to certain parts of the model.
- lora_task_type: TaskType = 'CAUSAL_LM'#
The task type for LoRA adaptation.
Defaults to CAUSAL_LM (Causal Language Modeling).
- peft_save_mode: PeftSaveMode = 'adapter_only'#
How to save the final model during PEFT training.
This option is only used if TrainingParams.save_final_model is True. By default, only the model adapter is saved to reduce disk usage. Options are defined in the PeftSaveMode enum and include: - ADAPTER_ONLY: Only save the model adapter. - ADAPTER_AND_BASE_MODEL: Save the base model in addition to the adapter. - MERGED: Merge the adapter and base model’s weights and save as a single model.
- q_lora: bool = False#
Whether to use quantization for LoRA (Q-LoRA).
If True, enables quantization for more memory-efficient fine-tuning.
- q_lora_bits: int = 4#
The number of bits to use for quantization in Q-LoRA.
This is only used if q_lora is True.
Defaults to 4-bit quantization.
- to_bits_and_bytes() BitsAndBytesConfig [source]#
Creates a configuration for quantized models via BitsAndBytes.
The resulting configuration uses the instantiated peft parameters.
- use_bnb_nested_quant: bool = False#
Whether to use nested quantization.
Nested quantization can provide additional memory savings.
- class oumi.core.configs.params.peft_params.PeftSaveMode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
Enum representing how to save the final model during PEFT training.
While models saved with any of these options can be loaded by Oumi, those saved with ADAPTER_ONLY are not self-contained; the base model will be loaded separately from the local HF cache or downloaded from HF Hub if not in the cache.
- ADAPTER_AND_BASE_MODEL = 'adapter_and_base_model'#
Save the base model in addition to the adapter.
This is similar to ADAPTER_ONLY, but the base model’s weights are also saved in the same directory as the adapter weights, making the output dir self-contained.
- ADAPTER_ONLY = 'adapter_only'#
Only save the model adapter.
Note that when loading this saved model, the base model will be loaded separately from the local HF cache or downloaded from HF Hub.
- MERGED = 'merged'#
Merge the adapter and base model’s weights and save as a single model.
Note that the resulting model is a standard HF Transformers model, and is no longer a PEFT model. A copy of the adapter before merging is saved in the “adapter/” subdirectory.
oumi.core.configs.params.profiler_params module#
- class oumi.core.configs.params.profiler_params.ProfilerParams(save_dir: Optional[str] = None, enable_cpu_profiling: bool = False, enable_cuda_profiling: bool = False, record_shapes: bool = False, profile_memory: bool = False, with_stack: bool = False, with_flops: bool = False, with_modules: bool = False, row_limit: int = 50, schedule: oumi.core.configs.params.profiler_params.ProfilerScheduleParams = <factory>)[source]#
Bases:
BaseParams
- enable_cpu_profiling: bool = False#
Whether to profile CPU activity.
Corresponds to torch.profiler.ProfilerActivity.CPU.
- enable_cuda_profiling: bool = False#
Whether to profile CUDA.
Corresponds to torch.profiler.ProfilerActivity.CUDA.
- profile_memory: bool = False#
Track tensor memory allocation/deallocation.
- record_shapes: bool = False#
Save information about operator’s input shapes.
- row_limit: int = 50#
Max number of rows to include into profiling report tables.
Set to -1 to make it unlimited.
- save_dir: str | None = None#
Directory where the profiling data will be saved to.
If not specified and profiling is enabled, then the profiler sub-dir will be used under output_dir.
- schedule: ProfilerScheduleParams#
Parameters that define what subset of training steps to profile.
- with_flops: bool = False#
Record module hierarchy (including function names) corresponding to the callstack of the op.
- with_modules: bool = False#
Use formula to estimate the FLOPs (floating point operations) of specific operators (matrix multiplication and 2D convolution).
- with_stack: bool = False#
Record source information (file and line number) for the ops.
- class oumi.core.configs.params.profiler_params.ProfilerScheduleParams(enable_schedule: bool = False, wait: int = 0, warmup: int = 1, active: int = 3, repeat: int = 1, skip_first: int = 1)[source]#
Bases:
BaseParams
Parameters that define what subset of training steps to profile.
Keeping profiling enabled for all training steps may be impractical as it may result in out-of-memory errors, extremely large trace files, and may interfere with regular training performance. This config can be used to enable PyTorch profiler only for a small number of training steps, which is not affected by such issues, and may still provide a useful signal for performance analysis.
- active: int = 3#
The number of training steps to do active recording (ProfilerAction.RECORD) in each profiling cycle.
- enable_schedule: bool = False#
Whether profiling schedule is enabled.
If False, then profiling is enabled for the entire process duration, and all schedule parameters below will be ignored.
- repeat: int = 1#
The optional number of profiling cycles.
Each cycle includes wait + warmup + active steps. The zero value means that the cycles will continue until the profiling is finished.
- skip_first: int = 1#
The number of initial training steps to skip at the beginning of profiling session (ProfilerAction.NONE).
- wait: int = 0#
The number of training steps to skip at the beginning of each profiling cycle (ProfilerAction.NONE). Each cycle includes wait + warmup + active steps.
- warmup: int = 1#
The number of training steps to do profiling warmup (ProfilerAction.WARMUP) in each profiling cycle.
oumi.core.configs.params.remote_params module#
- class oumi.core.configs.params.remote_params.RemoteParams(api_url: str | None = None, api_key: str | None = None, api_key_env_varname: str | None = None, max_retries: int = 3, connection_timeout: float = 20.0, num_workers: int = 1, politeness_policy: float = 0.0, batch_completion_window: str | None = '24h')[source]#
Bases:
BaseParams
Parameters for running inference against a remote API.
- api_key: str | None = None#
API key to use for authentication.
- api_key_env_varname: str | None = None#
Name of the environment variable containing the API key for authentication.
- api_url: str | None = None#
URL of the API endpoint to use for inference.
- batch_completion_window: str | None = '24h'#
Time window for batch completion. Currently only ‘24h’ is supported.
Only used for batch inference.
- connection_timeout: float = 20.0#
Timeout in seconds for a request to an API.
- max_retries: int = 3#
Maximum number of retries to attempt when calling an API.
- num_workers: int = 1#
Number of workers to use for parallel inference.
- politeness_policy: float = 0.0#
Politeness policy to use when calling an API.
If greater than zero, this is the amount of time in seconds a worker will sleep before making a subsequent request.
oumi.core.configs.params.telemetry_params module#
- class oumi.core.configs.params.telemetry_params.TelemetryParams(telemetry_dir: str | None = 'telemetry', collect_telemetry_for_all_ranks: bool = False, track_gpu_temperature: bool = False)[source]#
Bases:
BaseParams
- collect_telemetry_for_all_ranks: bool = False#
Whether to collect telemetry for all ranks.
By default, only the main rank’s telemetry stats are collected and saved.
- telemetry_dir: str | None = 'telemetry'#
Directory where the telemetry data will be saved to.
If not specified, then telemetry files will be written under output_dir. If a relative path is specified, then files will be written in a telemetry_dir sub-directory in output_dir.
- track_gpu_temperature: bool = False#
Whether to record GPU temperature.
If save_telemetry_for_all_ranks is False, only the first GPU’s temperature is tracked. Otherwise, temperature is recorded for all GPUs.
oumi.core.configs.params.training_params module#
- class oumi.core.configs.params.training_params.MixedPrecisionDtype(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str
,Enum
Enum representing the dtype used for mixed precision training.
For more details on mixed-precision training, see: https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html
- BF16 = 'bf16'#
Similar to fp16 mixed precision, but with bf16 instead.
This requires Ampere or higher NVIDIA architecture, or using CPU or Ascend NPU.
- FP16 = 'fp16'#
fp16 mixed precision.
Requires ModelParams.torch_dtype (the dtype of the model weights) to be fp32. The model weights and optimizer state are fp32, but some ops will run in fp16 to improve training speed.
- NONE = 'none'#
No mixed precision.
Uses ModelParams.torch_dtype as the dtype for all tensors (model weights, optimizer state, activations, etc.).
- class oumi.core.configs.params.training_params.SchedulerType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
str
,Enum
Enum representing the supported learning rate schedulers.
For optional args for each scheduler, see src/oumi/builders/lr_schedules.py.
- CONSTANT = 'constant'#
Constant scheduler.
Keeps the learning rate constant throughout training.
- COSINE = 'cosine'#
Cosine scheduler.
Decays the learning rate following the decreasing part of a cosine curve.
- COSINE_WITH_MIN_LR = 'cosine_with_min_lr'#
Cosine with a minimum learning rate scheduler.
Similar to cosine scheduler, but maintains a minimum learning rate at the end.
- COSINE_WITH_RESTARTS = 'cosine_with_restarts'#
Cosine with restarts scheduler.
Decays the learning rate following a cosine curve with periodic restarts.
- LINEAR = 'linear'#
Linear scheduler.
Decreases the learning rate linearly from the initial value to 0 over the course of training.
- class oumi.core.configs.params.training_params.TrainerType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
Enum representing the supported trainers.
- HF = 'hf'#
Generic HuggingFace trainer from transformers library.
This is the standard trainer provided by the Hugging Face Transformers library, suitable for a wide range of training tasks.
- OUMI = 'oumi'#
Custom generic trainer implementation.
This is a custom trainer implementation specific to the Oumi project, designed to provide additional flexibility and features.
- TRL_DPO = 'trl_dpo'#
Direct Preference Optimization trainer from trl library.
This trainer implements the Direct Preference Optimization algorithm for fine-tuning language models based on human preferences.
- TRL_SFT = 'trl_sft'#
Supervised fine-tuning trainer from trl library.
This trainer is specifically designed for supervised fine-tuning tasks using the TRL (Transformer Reinforcement Learning) library.
- class oumi.core.configs.params.training_params.TrainingParams(use_peft: bool = False, trainer_type: oumi.core.configs.params.training_params.TrainerType = <TrainerType.HF: 'hf'>, enable_gradient_checkpointing: bool = False, gradient_checkpointing_kwargs: dict[str, typing.Any] = <factory>, output_dir: str = 'output', per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, gradient_accumulation_steps: int = 1, max_steps: int = -1, num_train_epochs: int = 3, save_epoch: bool = False, save_steps: int = 500, save_final_model: bool = True, seed: int = 42, run_name: Optional[str] = None, metrics_function: Optional[str] = None, log_level: str = 'info', dep_log_level: str = 'warning', enable_wandb: bool = False, enable_tensorboard: bool = True, logging_strategy: str = 'steps', logging_dir: Optional[str] = None, logging_steps: int = 50, logging_first_step: bool = False, eval_strategy: str = 'no', eval_steps: int = 500, learning_rate: float = 5e-05, lr_scheduler_type: str = 'linear', lr_scheduler_kwargs: dict[str, typing.Any] = <factory>, warmup_ratio: Optional[float] = None, warmup_steps: Optional[int] = None, optimizer: str = 'adamw_torch', weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, sgd_momentum: float = 0.0, mixed_precision_dtype: oumi.core.configs.params.training_params.MixedPrecisionDtype = <MixedPrecisionDtype.NONE: 'none'>, compile: bool = False, include_performance_metrics: bool = False, include_alternative_mfu_metrics: bool = False, log_model_summary: bool = False, resume_from_checkpoint: Optional[str] = None, try_resume_from_last_checkpoint: bool = False, dataloader_num_workers: Union[int, str] = 0, dataloader_prefetch_factor: Optional[int] = None, dataloader_main_process_only: Optional[bool] = None, ddp_find_unused_parameters: Optional[bool] = None, max_grad_norm: Optional[float] = 1.0, trainer_kwargs: dict[str, typing.Any] = <factory>, profiler: oumi.core.configs.params.profiler_params.ProfilerParams = <factory>, telemetry: oumi.core.configs.params.telemetry_params.TelemetryParams = <factory>, empty_device_cache_steps: Optional[int] = None, nccl_default_timeout_minutes: Optional[float] = None)[source]#
Bases:
BaseParams
- adam_beta1: float = 0.9#
The beta1 parameter for Adam-based optimizers.
Exponential decay rate for the first moment estimates. Default is 0.9.
- adam_beta2: float = 0.999#
The beta2 parameter for Adam-based optimizers.
Exponential decay rate for the second moment estimates. Default is 0.999.
- adam_epsilon: float = 1e-08#
Epsilon parameter for Adam-based optimizers.
Small constant for numerical stability. Default is 1e-08.
- compile: bool = False#
Whether to JIT compile the model.
This parameter should be used instead of ModelParams.compile for training.
- dataloader_main_process_only: bool | None = None#
Controls whether the dataloader is iterated through on the main process only.
If set to True, the dataloader is only iterated through on the main process (rank 0), then the batches are split and broadcast to each process. This can reduce the number of requests to the dataset, and helps ensure that each example is seen by max one GPU per epoch, but may become a performance bottleneck if a large number of GPUs is used.
If set to False, the dataloader is iterated through on each GPU process.
If set to None (default), then True or False is auto-selected based on heuristics (properties of dataset, the number of nodes and/or GPUs, etc).
NOTE: We recommend to benchmark your setup, and configure True or False.
- dataloader_num_workers: int | str = 0#
Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process.
You can also use the special value “auto” to select the number of dataloader workers using a simple heuristic based on the number of CPU-s and GPU-s per node. Note that the accurate estimation of workers is difficult and depends on many factors (the properties of a model, dataset, VM, network, etc) so you can start with “auto” then experimentally tune the exact number to make it more optimal for your specific case. If “auto” is requested, then at minimum 1 worker is guaranteed to be assigned.
- dataloader_prefetch_factor: int | None = None#
Number of batches loaded in advance by each worker.
2 means there will be a total of 2 * num_workers batches prefetched across all workers.
This is only used if dataloader_num_workers >= 1.
- ddp_find_unused_parameters: bool | None = None#
When using PyTorch’s DistributedDataParallel training, the value of this flag is passed to find_unused_parameters.
Will default to False if gradient checkpointing is used, True otherwise.
- dep_log_level: str = 'warning'#
The logging level for dependency loggers (e.g., HuggingFace, PyTorch).
Possible values are “debug”, “info”, “warning”, “error”, “critical”.
- empty_device_cache_steps: int | None = None#
Number of steps to wait before calling torch.<device>.empty_cache().
This parameter determines how frequently the GPU cache should be cleared during training. If set, it will trigger cache clearing every empty_device_cache_steps. If left as None, the cache will not be emptied automatically.
Setting this can help manage GPU memory usage, especially for large models or long training runs, but may impact performance if set too low.
- enable_gradient_checkpointing: bool = False#
Whether to enable gradient checkpointing to save memory at the expense of speed.
Gradient checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward pass, it recomputes these activations during the backward pass. This can make the training slower, but it can also significantly reduce memory usage.
- enable_tensorboard: bool = True#
Whether to enable TensorBoard logging.
If True, TensorBoard will be used for logging metrics and visualizations.
- enable_wandb: bool = False#
Whether to enable Weights & Biases (wandb) logging.
If True, wandb will be used for experiment tracking and visualization. Wandb will also log a summary of the training run, including hyperparameters, metrics, and other relevant information at the end of training.
After enabling, you must set the WANDB_API_KEY environment variable. Alternatively, you can use the wandb login command to authenticate.
- eval_steps: int = 500#
Number of update steps between two evaluations if eval_strategy=”steps”.
Ignored if eval_strategy is not “steps”.
- eval_strategy: str = 'no'#
The strategy to use for evaluation during training.
Possible values: - “no”: No evaluation is done during training. - “steps”: Evaluation is done every eval_steps. - “epoch”: Evaluation is done at the end of each epoch.
- gradient_accumulation_steps: int = 1#
Number of update steps to accumulate before performing a backward/update pass.
This technique allows for effectively larger batch sizes and is especially useful when such batch sizes would not fit in memory. This is achieved by accumulating gradients from multiple forward passes before performing a single optimization step. Setting this to >1 can increase however memory usage for training setups without existing gradient accumulation buffers (ex. 1-GPU training).
- gradient_checkpointing_kwargs: dict[str, Any]#
Keyword arguments for gradient checkpointing.
The use_reentrant parameter is required and is recommended to be set to False. For more details, see: https://pytorch.org/docs/stable/checkpoint.html
- include_alternative_mfu_metrics: bool = False#
Whether to report alternative MFU (Model FLOPs Utilization) metrics.
These metrics are based on HuggingFace’s total_flos. This option is only used if include_performance_metrics is True.
- include_performance_metrics: bool = False#
Whether to include performance metrics such as token statistics.
- learning_rate: float = 5e-05#
The initial learning rate for the optimizer.
This value can be adjusted by the learning rate scheduler during training.
- log_level: str = 'info'#
The logging level for the main Oumi logger.
Possible values are “debug”, “info”, “warning”, “error”, “critical”.
- log_model_summary: bool = False#
Whether to print a model summary, including layer names.
- logging_dir: str | None = None#
The directory where training logs will be saved.
This includes TensorBoard logs and other training-related output.
- logging_first_step: bool = False#
Whether to log and evaluate the first global step.
If True, metrics will be logged and evaluation will be performed at the very beginning of training. Skipping the first step can be useful to avoid logging and evaluation of the initial random model.
The first step is usually not representative of the model’s performance, as it includes model compilation, optimizer initialization, and other setup steps.
- logging_steps: int = 50#
Number of update steps between two logs if logging_strategy=”steps”.
Ignored if logging_strategy is not “steps”.
- logging_strategy: str = 'steps'#
The strategy to use for logging during training.
Possible values are: - “steps”: Log every logging_steps steps. - “epoch”: Log at the end of each epoch. - “no”: Disable logging.
- lr_scheduler_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the learning rate scheduler.
These arguments can be used to fine-tune the behavior of the chosen scheduler.
- lr_scheduler_type: str = 'linear'#
The type of learning rate scheduler to use.
- Possible values include “linear”, “cosine”, “cosine_with_restarts”,
“cosine_with_min_lr” and “constant”.
See src/oumi/builders/lr_schedules.py for more details on each scheduler.
- max_grad_norm: float | None = 1.0#
Maximum gradient norm (for gradient clipping) to avoid exploding gradients which can destabilize training.
Defaults to 1.0. When set to 0.0 or None gradient clipping will not be applied.
- max_steps: int = -1#
If set to a positive number, the total number of training steps to perform.
This parameter overrides num_train_epochs. If set to -1 (default), the number of training steps is determined by num_train_epochs.
- metrics_function: str | None = None#
The name of the metrics function in the Oumi registry to use for evaluation during training.
The method must accept as input a HuggingFace EvalPrediction and return a dictionary of metrics, with string keys mapping to metric values. A single metrics_function may compute multiple metrics.
- mixed_precision_dtype: MixedPrecisionDtype = 'none'#
The data type to use for mixed precision training.
Default is NONE, which means no mixed precision is used.
- nccl_default_timeout_minutes: float | None = None#
Default timeout for NCCL operations in minutes.
See: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
If unset, will use the default value of torch.distributed.init_process_group which is 10min.
- num_train_epochs: int = 3#
Total number of training epochs to perform (if max_steps is not specified).
An epoch is one complete pass through the entire training dataset. This parameter is ignored if max_steps is set to a positive number.
- optimizer: str = 'adamw_torch'#
The optimizer to use for training.
See pytorch documentation for more information on available optimizers: https://pytorch.org/docs/stable/optim.html
Default is “adamw_torch” (AdamW implemented by PyTorch).
- output_dir: str = 'output'#
Directory where the output files will be saved.
This includes checkpoints, evaluation results, and any other artifacts produced during the training process.
- per_device_eval_batch_size: int = 8#
Number of samples per batch on each device during evaluation.
Similar to per_device_train_batch_size, but used during evaluation phases. Can often be set higher than the train batch size as no gradients are stored.
- per_device_train_batch_size: int = 8#
Number of samples per batch on each device during training.
This parameter directly affects memory usage and training speed. Larger batch sizes generally lead to better utilization of GPU compute capabilities but require more memory.
- profiler: ProfilerParams#
Parameters for performance profiling.
This field contains configuration options for the profiler, which can be used to analyze the performance of the training process. It uses the ProfilerParams class to define specific profiling settings.
- resume_from_checkpoint: str | None = None#
Path to a checkpoint folder from which to resume training.
If specified, training will resume by first loading the model from this folder.
- run_name: str | None = None#
A unique identifier for the current training run.
This name is used to identify the run in logging outputs, saved model checkpoints, and experiment tracking tools like Weights & Biases or TensorBoard. It’s particularly useful when running multiple experiments or when you want to easily distinguish between different training sessions.
- save_epoch: bool = False#
Save a checkpoint at the end of every epoch.
When set to True, this ensures that a model checkpoint is saved after each complete pass through the training data. This can be useful for tracking model progress over time and for resuming training from a specific epoch if needed.
If both save_steps and save_epoch are set, then save_steps takes precedence.
- save_final_model: bool = True#
Whether to save the model at the end of training.
For different options for saving PEFT models, see PeftParams.peft_save_mode. This should normally be set to True to ensure the final trained model is saved. However, in some cases, you may want to disable it, for example: - If saving a large model which takes a long time - When quickly testing training speed or metrics - During debugging or experimentation phases
- save_steps: int = 500#
Save a checkpoint every save_steps training steps.
This parameter determines the frequency of saving checkpoints during training based on the number of steps. If both save_steps and save_epoch are set, then save_steps takes precedence.
To disable saving checkpoints during training, set save_steps to 0 and save_epoch to False. If enabled, a checkpoint will be saved at the end of training if there’s any residual steps left.
- seed: int = 42#
Random seed used for initialization.
This seed is passed to the trainer and to all downstream dependencies to ensure reproducibility of results. It affects random number generation in various parts of the training process, including data shuffling, weight initialization, and any stochastic operations.
- sgd_momentum: float = 0.0#
Momentum factor for SGD optimizer.
Only used when optimizer is set to “sgd”, and when trainer_type is set to OUMI. Default is 0.0.
- telemetry: TelemetryParams#
Parameters for telemetry.
This field contains telemetry configuration options.
- property telemetry_dir: Path | None#
Returns the telemetry stats output directory.
- trainer_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the Trainer.
This allows for customization of the Trainer beyond the standard parameters defined in this class. Any key-value pairs added here will be passed directly to the Trainer’s constructor.
- trainer_type: TrainerType = 'hf'#
The type of trainer to use for the training process.
Options are defined in the TrainerType enum and include: - HF: HuggingFace’s Trainer - TRL_SFT: TRL’s SFT Trainer - TRL_DPO: TRL’s DPO Trainer - OUMI: Custom generic trainer implementation
- try_resume_from_last_checkpoint: bool = False#
If True, attempt to resume from the last checkpoint in “output_dir”.
If a checkpoint is found, training will resume from the model/optimizer/scheduler states loaded from this checkpoint. If no checkpoint is found, training will continue without loading any intermediate checkpoints.
Note: If resume_from_checkpoint is specified and contains a non-empty path, this parameter has no effect.
- use_peft: bool = False#
Whether to use Parameter-Efficient Fine-Tuning (PEFT) techniques.
PEFT methods allow for efficient adaptation of pre-trained language models to specific tasks by only updating a small number of (extra) model parameters. This can significantly reduce memory usage and training time.
- warmup_ratio: float | None = None#
The ratio of total training steps used for a linear warmup from 0 to the learning rate.
If set along with warmup_steps, this value will be ignored.
- warmup_steps: int | None = None#
The number of steps for the warmup phase of the learning rate scheduler.
If set, will override the value of warmup_ratio.
- weight_decay: float = 0.0#
Weight decay (L2 penalty) to apply to the model’s parameters.
In the HF trainers and the OUMI trainer, this is automatically applied to only weight tensors, and skips biases/layernorms.
Default is 0.0 (no weight decay).