oumi.core.configs.params#
Submodules#
oumi.core.configs.params.base_params module#
- class oumi.core.configs.params.base_params.BaseParams[source]#
Bases:
object
Base class for all parameter classes.
This class provides a common interface for all parameter classes, and provides a finalize_and_validate method to recursively validate the parameters.
Subclasses should implement the __finalize_and_validate__ method to perform custom validation logic.
- __finalize_and_validate__() None [source]#
Finalizes and validates the parameters of this object.
This method can be overridden by subclasses to implement custom validation logic.
In case of validation errors, this method should raise a ValueError or other appropriate exception.
oumi.core.configs.params.data_params module#
- class oumi.core.configs.params.data_params.DataParams(train: oumi.core.configs.params.data_params.DatasetSplitParams = <factory>, test: oumi.core.configs.params.data_params.DatasetSplitParams = <factory>, validation: oumi.core.configs.params.data_params.DatasetSplitParams = <factory>)[source]#
Bases:
BaseParams
- get_split(split: DatasetSplit) DatasetSplitParams [source]#
A public getting for individual dataset splits.
- test: DatasetSplitParams#
The input datasets used for testing. This field is currently unused.
- train: DatasetSplitParams#
The input datasets used for training.
- validation: DatasetSplitParams#
The input datasets used for validation.
- class oumi.core.configs.params.data_params.DatasetParams(dataset_name: str = '???', dataset_path: Optional[str] = None, subset: Optional[str] = None, split: str = 'train', dataset_kwargs: dict[str, typing.Any] = <factory>, sample_count: Optional[int] = None, mixture_proportion: Optional[float] = None, shuffle: bool = False, seed: Optional[int] = None, shuffle_buffer_size: int = 1000, trust_remote_code: bool = False, transform_num_workers: Union[int, str, NoneType] = None)[source]#
Bases:
BaseParams
- dataset_kwargs: dict[str, Any]#
Keyword arguments to pass to the dataset constructor.
These arguments will be passed directly to the dataset constructor.
- dataset_name: str = '???'#
The name of the dataset to load. Required.
This field is used to retrieve the appropriate class from the dataset registry that can be used to instantiate and preprocess the data.
If dataset_path is not specified, then the raw data will be automatically downloaded from the huggingface hub or oumi registry. Otherwise, the dataset will be loaded from the specified dataset_path.
- dataset_path: str | None = None#
The path to the dataset to load.
This can be used to load a dataset of type dataset_name from a custom path.
If dataset_path is not specified, then the raw data will be automatically downloaded from the huggingface hub or oumi registry.
- mixture_proportion: float | None = None#
- The proportion of examples from this dataset relative to other datasets
in the mixture.
If specified, all datasets must supply this value. Must be a float in the range [0, 1.0]. The mixture_proportion for all input datasets must sum to 1.
Examples are sampled after the dataset has been sampled using sample_count if specified.
- sample_count: int | None = None#
The number of examples to sample from the dataset.
Must be non-negative. If sample_count is larger than the size of the dataset, then the required additional examples are sampled by looping over the original dataset.
- seed: int | None = None#
The random seed used for shuffling the dataset before sampling.
If set to None, shuffling will be non-deterministic.
- shuffle: bool = False#
Whether to shuffle the dataset before any sampling occurs.
- shuffle_buffer_size: int = 1000#
The size of the shuffle buffer used for shuffling the dataset before sampling.
- split: str = 'train'#
The split of the dataset to load.
This is typically one of “train”, “test”, or “validation”. Defaults to “train”.
- subset: str | None = None#
The subset of the dataset to load.
This is usually a subfolder within the dataset root.
- transform_num_workers: int | str | None = None#
Number of subprocesses to use for dataset post-processing (ds.transform()).
Multiprocessing is disabled by default (None).
You can also use the special value “auto” to let oumi automatically select the number of subprocesses.
Using multiple processes can speed-up processing e.g., for large or multi-modal datasets.
The parameter is only supported for Map (non-iterable) datasets.
- trust_remote_code: bool = False#
Whether to trust remote code when loading the dataset.
- class oumi.core.configs.params.data_params.DatasetSplit(value)[source]#
Bases:
Enum
Enum representing the split for a dataset.
- TEST = 'test'#
- TRAIN = 'train'#
- VALIDATION = 'validation'#
- class oumi.core.configs.params.data_params.DatasetSplitParams(datasets: list[oumi.core.configs.params.data_params.DatasetParams] = <factory>, collator_name: Optional[str] = None, pack: bool = False, stream: bool = False, target_col: Optional[str] = None, mixture_strategy: str = 'first_exhausted', seed: Optional[int] = None, use_async_dataset: bool = False, use_torchdata: Optional[bool] = None)[source]#
Bases:
BaseParams
- collator_name: str | None = None#
Name of Oumi data collator.
Data collator controls how to form a mini-batch from individual dataset elements.
Valid options are:
- “text_with_padding”: Dynamically pads the inputs received to
the longest length.
- “vision_language_with_padding”: Uses VisionLanguageCollator
for image+text multi-modal data.
If None, then a default collator will be assigned.
- datasets: list[DatasetParams]#
The datasets in this split.
- mixture_strategy: str = 'first_exhausted'#
The strategy for mixing multiple datasets.
When multiple datasets are provided, this parameter determines how they are combined. Two strategies are available:
FIRST_EXHAUSTED: Samples from all datasets until one is fully represented in the mixture. This is the default strategy.
ALL_EXHAUSTED: Samples from all datasets until each one is fully represented in the mixture. This may lead to significant oversampling.
- pack: bool = False#
Whether to pack the text into constant-length chunks.
Each chunk will be the size of the model’s max input length. This will stream the dataset, and tokenize on the fly if the dataset isn’t already tokenized (i.e. has an input_ids column).
- seed: int | None = None#
The random seed used for mixing this dataset split, if specified.
If set to None mixing will be non-deterministic.
- stream: bool = False#
Whether to stream the dataset.
- target_col: str | None = None#
The dataset column name containing the input for training/testing/validation.
- Deprecated:
This parameter is deprecated and will be removed in the future.
- use_async_dataset: bool = False#
Whether to use the PretrainingAsyncTextDataset instead of ConstantLengthDataset.
- Deprecated:
This parameter is deprecated and will be removed in the future.
- use_torchdata: bool | None = None#
Whether to use the torchdata library for dataset loading and processing.
If set to None, this setting may be auto-inferred.
oumi.core.configs.params.evaluation_params module#
- class oumi.core.configs.params.evaluation_params.AlpacaEvalTaskParams(evaluation_backend: str = '', task_name: str | None = None, num_samples: int | None = None, log_samples: bool | None = False, eval_kwargs: dict[str, ~typing.Any] = <factory>, evaluation_platform: str | None = '', version: float | None = 2.0)[source]#
Bases:
EvaluationTaskParams
Parameters for the AlpacaEval evaluation framework.
AlpacaEval is an LLM-based automatic evaluation suite that is fast, cheap, replicable, and validated against 20K human annotations. The latest version (AlpacaEval 2.0) contains 805 prompts (tatsu-lab/alpaca_eval), which are open-ended questions. A model annotator (judge) is used to evaluate the quality of model’s responses for these questions and calculates win rates vs. reference responses. The default judge is GPT4 Turbo.
- eval_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the evaluation function.
This allows for passing any evaluation-specific parameters that are not covered by other fields in TaskParams classes.
- version: float | None = 2.0#
1.0 or 2.0 (default).
- Type:
The version of AlpacaEval to use. Options
- class oumi.core.configs.params.evaluation_params.EvaluationBackend(value)[source]#
Bases:
Enum
Enum representing the evaluation backend to use.
- ALPACA_EVAL = 'alpaca_eval'#
- CUSTOM = 'custom'#
- LM_HARNESS = 'lm_harness'#
- class oumi.core.configs.params.evaluation_params.EvaluationTaskParams(evaluation_backend: str = '', task_name: str | None = None, num_samples: int | None = None, log_samples: bool | None = False, eval_kwargs: dict[str, ~typing.Any] = <factory>, evaluation_platform: str | None = '')[source]#
Bases:
BaseParams
Configuration parameters for model evaluation tasks.
Supported backends:
LM Harness: Framework for evaluating language models on standard benchmarks. A list of all supported tasks can be found at: EleutherAI/lm-evaluation-harness.
Alpaca Eval: Framework for evaluating language models on instruction-following and quality of responses on open-ended questions.
Custom: Users can register their own evaluation functions using the decorator @register_evaluation_function. The task_name should be the registry key for the custom evaluation function to be used.
Examples
# LM Harness evaluation on MMLU params = EvaluationTaskParams( evaluation_backend="lm_harness", task_name="mmlu", eval_kwargs={"num_fewshot": 5} )
# Alpaca Eval 2.0 evaluation params = EvaluationTaskParams( evaluation_backend="alpaca_eval" )
# Custom evaluation @register_evaluation_function("my_evaluation_function") def my_evaluation(task_params, config): accuracy = ... return EvaluationResult(task_result={"accuracy": accuracy}) params = EvaluationTaskParams( task_name="my_evaluation_function", evaluation_backend="custom" )
- eval_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the evaluation function.
This allows for passing any evaluation-specific parameters that are not covered by other fields in TaskParams classes.
- evaluation_backend: str = ''#
The evaluation backend to use for the current task.
- evaluation_platform: str | None = ''#
DEPRECATED; Please use evaluation_backend instead.
- get_evaluation_backend() EvaluationBackend [source]#
Returns the evaluation backend as an Enum.
- static list_evaluation_backends() str [source]#
Returns a string listing all available evaluation backends.
- log_samples: bool | None = False#
Whether to log the samples used for evaluation.
If not set (False): the model samples used for evaluation will not be logged. If set to True: the model samples generated during inference and used for evaluation will be logged in backend_config.json. The backend may also log other intermediate results related to inference.
- num_samples: int | None = None#
Number of samples/examples to evaluate from this dataset.
Mostly for debugging, in order to reduce the runtime. If not set (None): the entire dataset is evaluated. If set, this must be a positive integer.
- task_name: str | None = None#
The task to evaluate or the custom evaluation function to use.
For LM Harness evaluations (when the evaluation_backend is set to EvaluationBackend.LM_HARNESS), the task_name corresponds to a predefined task to evaluate on (e.g. “mmlu”). A list of all supported tasks by the LM Harness backend can be found by running: lm-eval –tasks list.
For custom evaluations (when evaluation_backend is set to EvaluationBackend.CUSTOM), the task_name should be the registry key for the custom evaluation function to be used. Users can register new evaluation functions using the decorator @register_evaluation_function.
- class oumi.core.configs.params.evaluation_params.LMHarnessTaskParams(evaluation_backend: str = '', task_name: str | None = None, num_samples: int | None = None, log_samples: bool | None = False, eval_kwargs: dict[str, ~typing.Any] = <factory>, evaluation_platform: str | None = '', num_fewshot: int | None = None)[source]#
Bases:
EvaluationTaskParams
Parameters for the LM Harness evaluation framework.
LM Harness is a comprehensive benchmarking suite for evaluating language models across various tasks.
- eval_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the evaluation function.
This allows for passing any evaluation-specific parameters that are not covered by other fields in TaskParams classes.
- num_fewshot: int | None = None#
Number of few-shot examples (with responses) to add in the prompt, in order to teach the model how to respond to the specific dataset’s prompts.
If not set (None): LM Harness will decide the value. If set to 0: no few-shot examples will be added in the prompt.
oumi.core.configs.params.fsdp_params module#
- class oumi.core.configs.params.fsdp_params.AutoWrapPolicy(value)[source]#
Bases:
str
,Enum
The auto wrap policies for FullyShardedDataParallel (FSDP).
- NO_WRAP = 'NO_WRAP'#
No automatic wrapping is performed.
- SIZE_BASED_WRAP = 'SIZE_BASED_WRAP'#
Wraps layers based on parameter count.
- TRANSFORMER_BASED_WRAP = 'TRANSFORMER_BASED_WRAP'#
Wraps layers based on the transformer block layer.
- class oumi.core.configs.params.fsdp_params.BackwardPrefetch(value)[source]#
Bases:
str
,Enum
The backward prefetch options for FullyShardedDataParallel (FSDP).
- BACKWARD_POST = 'BACKWARD_POST'#
Enables less overlap but requires less memory usage.
- BACKWARD_PRE = 'BACKWARD_PRE'#
Enables the most overlap but increases memory usage the most.
- NO_PREFETCH = 'NO_PREFETCH'#
Disables backward prefetching altogether.
- class oumi.core.configs.params.fsdp_params.FSDPParams(enable_fsdp: bool = False, sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD, cpu_offload: bool = False, mixed_precision: str | None = None, backward_prefetch: BackwardPrefetch = BackwardPrefetch.BACKWARD_PRE, forward_prefetch: bool = False, use_orig_params: bool | None = None, state_dict_type: StateDictType = StateDictType.FULL_STATE_DICT, auto_wrap_policy: AutoWrapPolicy = AutoWrapPolicy.NO_WRAP, min_num_params: int = 100000, transformer_layer_cls: str | None = None, sync_module_states: bool = True)[source]#
Bases:
BaseParams
Configuration options for Pytorch’s FullyShardedDataParallel (FSDP) training.
- auto_wrap_policy: AutoWrapPolicy = 'NO_WRAP'#
Policy for automatically wrapping layers in FSDP.
- backward_prefetch: BackwardPrefetch = 'BACKWARD_PRE'#
Determines when to prefetch the next set of parameters.
Improves throughput by enabling communication and computation overlap in the backward pass at the cost of slightly increased memory usage.
- Options:
- BACKWARD_PRE: Enables the most overlap but increases memory
usage the most. This prefetches the next set of parameters before the current set of parameters’ gradient computation.
- BACKWARD_POST: Enables less overlap but requires less memory
usage. This prefetches the next set of parameters after the current set of parameters’ gradient computation.
- NO_PREFETCH: Disables backward prefetching altogether. This has no overlap and
does not increase memory usage. This may degrade throughput significantly.
- cpu_offload: bool = False#
If True, offloads parameters and gradients to CPU when not in use.
- enable_fsdp: bool = False#
If True, enables FullyShardedDataParallel training.
Allows training larger models by sharding models and gradients across multiple GPUs.
- forward_prefetch: bool = False#
If True, prefetches the forward pass results.
- min_num_params: int = 100000#
Minimum number of parameters for a layer to be wrapped when using size_based policy. This has no effect when using transformer_based policy.
- mixed_precision: str | None = None#
Enables mixed precision training.
Options: None, “fp16”, “bf16”.
- sharding_strategy: ShardingStrategy = 'FULL_SHARD'#
Determines how to shard model parameters across GPUs.
See
torch.distributed.fsdp.api.ShardingStrategy
for more details.- Options:
- FULL_SHARD: Shards model parameters, gradients, and optimizer states.
Provides the most memory efficiency but may impact performance.
- SHARD_GRAD_OP: Shards gradients and optimizer states, but not model
parameters. Balances memory savings and performance.
- HYBRID_SHARD: Shards model parameters within a node and replicates them
across nodes.
- NO_SHARD: No sharding is applied. Parameters, gradients, and optimizer states
are kept in full on each GPU.
- HYBRID_SHARD_ZERO2: Apply SHARD_GRAD_OP within a node, and replicate
parameters across nodes.
Warning
- NO_SHARD option is deprecated and will be removed in a future release.
Please use DistributedDataParallel (DDP) instead.
- state_dict_type: StateDictType = 'FULL_STATE_DICT'#
Specifies the type of state dict to use for checkpointing.
- sync_module_states: bool = True#
If True, synchronizes module states across processes.
When enabled, each FSDP module broadcasts parameters and buffers from rank 0 to ensure replication across ranks.
- transformer_layer_cls: str | None = None#
Class name for transformer layers when using transformer_based policy.
This has no effect when using size_based policy.
- use_orig_params: bool | None = None#
If True, uses the PyTorch Module’s original parameters for FSDP.
For more information, see: https://pytorch.org/docs/stable/fsdp.html. If not specified, it will be automatically inferred based on other config values.
- class oumi.core.configs.params.fsdp_params.ShardingStrategy(value)[source]#
Bases:
str
,Enum
The sharding strategies for FullyShardedDataParallel (FSDP).
See
torch.distributed.fsdp.ShardingStrategy
for more details.- FULL_SHARD = 'FULL_SHARD'#
Shards model parameters, gradients, and optimizer states. Provides the most memory efficiency but may impact performance.
- HYBRID_SHARD = 'HYBRID_SHARD'#
Shards model parameters within a node and replicates them across nodes.
- HYBRID_SHARD_ZERO2 = 'HYBRID_SHARD_ZERO2'#
Apply SHARD_GRAD_OP within a node, and replicate parameters across nodes.
- NO_SHARD = 'NO_SHARD'#
No sharding is applied. Parameters, gradients, and optimizer states are kept in full on each GPU.
- SHARD_GRAD_OP = 'SHARD_GRAD_OP'#
Shards gradients and optimizer states, but not model parameters. Balances memory savings and performance.
- class oumi.core.configs.params.fsdp_params.StateDictType(value)[source]#
Bases:
str
,Enum
The supported state dict types for FullyShardedDataParallel (FSDP).
This controls how the model’s state dict will be saved during checkpointing, and how it can be consumed afterwards.
- FULL_STATE_DICT = 'FULL_STATE_DICT'#
The state dict will be saved in a non-sharded, unflattened format.
This is similar to checkpointing without FSDP.
- LOCAL_STATE_DICT = 'LOCAL_STATE_DICT'#
The state dict will be saved in a sharded, flattened format.
Since it’s flattened, this can only be used by FSDP.
- SHARDED_STATE_DICT = 'SHARDED_STATE_DICT'#
The state dict will be saved in a sharded, unflattened format.
This can be used by other parallel schemes.
oumi.core.configs.params.generation_params module#
- class oumi.core.configs.params.generation_params.GenerationParams(max_new_tokens: int = 1024, batch_size: Optional[int] = 1, exclude_prompt_from_response: bool = True, seed: Optional[int] = None, temperature: float = 0.0, top_p: float = 1.0, frequency_penalty: float = 0.0, presence_penalty: float = 0.0, stop_strings: Optional[list[str]] = None, stop_token_ids: Optional[list[int]] = None, logit_bias: dict[typing.Any, float] = <factory>, min_p: float = 0.0, use_cache: bool = False, num_beams: int = 1, use_sampling: bool = False, guided_decoding: Optional[oumi.core.configs.params.guided_decoding_params.GuidedDecodingParams] = None)[source]#
Bases:
BaseParams
- batch_size: int | None = 1#
The number of sequences to generate in parallel.
Larger batch sizes can improve throughput but require more memory. Default is 1.
The value must either be positive or None, in which case the behavior is dependent on the downstream application. For example, LM Harness will automatically determine the largest batch size that will fit in memory.
For inference, this parameter is only used in NativeTextInferenceEngine.
- exclude_prompt_from_response: bool = True#
Whether to trim the model’s response and remove the prepended prompt.
- frequency_penalty: float = 0.0#
Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim.
- guided_decoding: GuidedDecodingParams | None = None#
Parameters for guided decoding.
- logit_bias: dict[Any, float]#
Modify the likelihood of specified tokens appearing in the completion.
Keys are tokens (specified by their token ID in the tokenizer), and values are the bias (-100 to 100). Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.
- max_new_tokens: int = 1024#
The maximum number of new tokens to generate.
This limits the length of the generated text to prevent excessively long outputs. Default is 1024 tokens.
- min_p: float = 0.0#
Sets a minimum probability threshold for token selection.
Tokens with probabilities below this threshold are filtered out before top-p or top-k sampling. This can help prevent the selection of highly improbable tokens. Default is 0.0 (no minimum threshold).
- num_beams: int = 1#
Number of beams for beam search. 1 means no beam search. Larger number of beams will make for a more thorough search for probable output token sequences, at the cost of increased computation time. Default is 1.
- presence_penalty: float = 0.0#
Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics.
- seed: int | None = None#
Seed to use for random number determinism. If specified, APIs may use this parameter to make a best-effort at determinism.
- stop_strings: list[str] | None = None#
List of sequences where the API will stop generating further tokens.
- stop_token_ids: list[int] | None = None#
List of token ids for which the API will stop generating further tokens. This is only supported in VLLMInferenceEngine and NativeTextInferenceEngine.
- temperature: float = 0.0#
Controls randomness in the output.
Higher values (e.g., 1.0) make output more random, while lower values (e.g., 0.2) make it more focused and deterministic.
- top_p: float = 1.0#
An alternative to temperature, called nucleus sampling.
It sets the cumulative probability threshold for token selection. For example, 0.9 means only considering the tokens comprising the top 90% probability mass.
- use_cache: bool = False#
Whether to use the model’s internal cache (key/value attentions) to speed up generation. Default is False.
- use_sampling: bool = False#
Whether to use sampling for next-token generation. If False, uses greedy decoding. Default is False.
oumi.core.configs.params.grpo_params module#
- class oumi.core.configs.params.grpo_params.GrpoParams(model_init_kwargs: dict[str, typing.Any] = <factory>, max_prompt_length: Optional[int] = None, max_completion_length: Optional[int] = None, num_generations: Optional[int] = None, temperature: float = 0.9, remove_unused_columns: bool = False, repetition_penalty: Optional[float] = 1.0, use_vllm: bool = False, vllm_device: Optional[str] = None, vllm_gpu_memory_utilization: float = 0.9, vllm_dtype: Optional[str] = None, vllm_max_model_len: Optional[int] = None)[source]#
Bases:
BaseParams
- max_completion_length: int | None = None#
Maximum length of the generated completion.
If unspecified (None), defaults to 256.
- max_prompt_length: int | None = None#
Maximum length of the prompt.
If the prompt is longer than this value, it will be truncated left. If unspecified (None), defaults to 512.
- model_init_kwargs: dict[str, Any]#
Keyword arguments for AutoModelForCausalLM.from_pretrained(…)
- num_generations: int | None = None#
Number of generations per prompt to sample.
The global batch size (num_processes * per_device_batch_size) must be divisible by this value. If unspecified (None), defaults to 8.
- remove_unused_columns: bool = False#
Whether to only keep the column “prompt” in the dataset.
If you use a custom reward function that requires any column other than “prompts” and “completions”, you should set it to False.
- repetition_penalty: float | None = 1.0#
Float that penalizes new tokens if they appear in the prompt/response so far.
Values > 1.0 encourage the model to use new tokens, while values < 1.0 encourage the model to repeat tokens.
- temperature: float = 0.9#
Temperature for sampling.
The higher the temperature, the more random the completions.
- use_vllm: bool = False#
Whether to use vLLM for generating completions.
If set to True, ensure that a GPU is kept unused for training, as vLLM will require one for generation.
- vllm_device: str | None = None#
Device where vLLM generation will run.
For example, “cuda:1”. If set to None, the system will automatically select the next available GPU after the last one used for training. This assumes that training has not already occupied all available GPUs. If only one device is available, the device will be shared between both training and vLLM.
- vllm_dtype: str | None = None#
Data type to use for vLLM generation.
If set to None, the data type will be automatically determined based on the model configuration. Find the supported values in the vLLM documentation.
- vllm_gpu_memory_utilization: float = 0.9#
Ratio (between 0 and 1) of GPU memory to reserve.
Fraction of VRAM reserved for the model weights, activations, and KV cache on the device dedicated to generation powered by vLLM. Higher values will increase the KV cache size and thus improve the model’s throughput. However, if the value is too high, it may cause out-of-memory (OOM) errors during initialization.
- vllm_max_model_len: int | None = None#
The max_model_len to use for vLLM.
This could be useful when running with reduced vllm_gpu_memory_utilization, leading to a reduced KV cache size. If not set, vLLM will use the model context size, which might be much larger than the KV cache, leading to inefficiencies.
oumi.core.configs.params.guided_decoding_params module#
- class oumi.core.configs.params.guided_decoding_params.GuidedDecodingParams(json: Any | None = None, regex: str | None = None, choice: list[str] | None = None)[source]#
Bases:
BaseParams
Parameters for guided decoding.
The parameters are mutually exclusive. Only one of the parameters can be specified at a time.
- choice: list[str] | None = None#
List of allowed choices for the output.
Restricts model output to one of the provided choices. Useful for forcing the model to select from a predefined set of options.
- json: Any | None = None#
JSON schema, Pydantic model, or string to guide the output format.
Can be a dict containing a JSON schema, a Pydantic model class, or a string containing JSON schema. Used to enforce structured output from the model.
- regex: str | None = None#
Regular expression pattern to guide the output format.
Pattern that the model output must match. Can be used to enforce specific text formats or patterns.
oumi.core.configs.params.model_params module#
- class oumi.core.configs.params.model_params.ModelParams(model_name: str = '???', adapter_model: Optional[str] = None, tokenizer_name: Optional[str] = None, tokenizer_pad_token: Optional[str] = None, tokenizer_kwargs: dict[str, typing.Any] = <factory>, processor_kwargs: dict[str, typing.Any] = <factory>, model_max_length: Optional[int] = None, load_pretrained_weights: bool = True, trust_remote_code: bool = False, torch_dtype_str: str = 'float32', compile: bool = False, chat_template: Optional[str] = None, attn_implementation: Optional[str] = None, device_map: Optional[str] = 'auto', model_kwargs: dict[str, typing.Any] = <factory>, enable_liger_kernel: bool = False, shard_for_eval: bool = False, freeze_layers: list[str] = <factory>)[source]#
Bases:
BaseParams
- adapter_model: str | None = None#
The path to an adapter model to be applied on top of the base model.
If provided, this adapter will be loaded and applied to the base model. The adapter path could alternatively be specified in model_name.
- attn_implementation: str | None = None#
The attention implementation to use.
Valid options include:
None: Use the default attention implementation (spda for torch>=2.1.1, else eager)
“sdpa”: Use PyTorch’s scaled dot-product attention
“flash_attention_2”: Use Flash Attention 2 for potentially faster computation. Requires “flash-attn” package to be installed
“eager”: Manual implementation of attention
- chat_template: str | None = None#
The chat template to use for formatting inputs.
If provided, this template will be used to format multi-turn conversations for models that support chat-like interactions.
Note
Different models may require specific chat templates. Consult the model’s documentation for the appropriate template to use.
- compile: bool = False#
Whether to JIT compile the model.
For training, do not set this param, and instead set TrainingParams.compile.
- device_map: str | None = 'auto'#
Specifies how to distribute the model’s layers across available devices.
“auto”: Automatically distribute the model across available devices
None: Load the entire model on the default device
Note
“auto” is generally recommended as it optimizes device usage, especially for large models that don’t fit on a single GPU.
- enable_liger_kernel: bool = False#
Whether to enable the Liger kernel for potential performance improvements.
Liger is an optimized CUDA kernel that can accelerate certain operations.
Tip
Enabling this may improve performance, but ensure compatibility with your model and hardware before use in production.
- freeze_layers: list[str]#
A list of layer names to freeze during training.
These layers will have their parameters set to not require gradients, effectively preventing them from being updated during the training process. This is useful for fine-tuning specific parts of a model while keeping other parts fixed.
- load_pretrained_weights: bool = True#
Whether to load the pretrained model’s weights.
If True, the model will be initialized with pretrained weights. If False, the model will be initialized from the pretrained config without loading weights.
- model_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the model’s constructor.
This allows for passing any model-specific parameters that are not covered by other fields in ModelParams.
Note
Use this for model-specific parameters or to enable experimental features.
- model_max_length: int | None = None#
The maximum sequence length the model can handle.
If specified, this will override the default max length of the model’s config.
Note
Setting this to a larger value may increase memory usage but allow for processing longer inputs. Ensure your hardware can support the chosen length.
- model_name: str = '???'#
The name or path of the model or LoRA adapter to use.
This can be a model identifier from the Oumi registry, HuggingFace Hub, or a path to a local directory containing model files.
The LoRA adapter can be specified here instead of in adapter_model. If so, this value is copied to adapter_model, and the appropriate base model is set here instead. The base model could either be in the same directory as the adapter, or specified in the adapter’s config file.
- processor_kwargs: dict[str, Any]#
Additional keyword arguments to pass into the processor’s constructor.
Processors are used in Oumi for vision-language models to process image and text inputs. This field is optional and can be left empty for text-only models, or if not needed.
These params override model-specific default values for these kwargs, if present.
- shard_for_eval: bool = False#
Whether to shard the model for evaluation.
This is needed for large models that do not fit on a single GPU. It is used as the value for the parallelize argument in LM Harness.
- tokenizer_kwargs: dict[str, Any]#
Additional keyword arguments to pass into the tokenizer’s constructor.
This allows for passing any tokenizer-specific parameters that are not covered by other fields in ModelParams.
- tokenizer_name: str | None = None#
The name or path of the tokenizer to use.
If None, the tokenizer associated with model_name will be used. Specify this if you want to use a different tokenizer than the default for the model.
- tokenizer_pad_token: str | None = None#
The padding token used by the tokenizer.
If this is set, it will override the default padding token of the tokenizer and the padding token optionally defined in the tokenizer_kwargs.
- torch_dtype_str: str = 'float32'#
The data type to use for the model’s parameters as a string.
Valid options are: - “float32” or “f32” or “float” for 32-bit floating point - “float16” or “f16” or “half” for 16-bit floating point - “bfloat16” or “bf16” for brain floating point - “float64” or “f64” or “double” for 64-bit floating point
This string will be converted to the corresponding torch.dtype. Defaults to “float32” for full precision.
- trust_remote_code: bool = False#
Whether to allow loading remote code when loading the model.
If True, this allows loading and executing code from the model’s repository, which can be a security risk. Only set to True for models you trust.
Defaults to False for safety.
oumi.core.configs.params.peft_params module#
- class oumi.core.configs.params.peft_params.LoraWeightInitialization(value)[source]#
Bases:
str
,Enum
Enum representing the supported weight initializations for LoRA adapters.
- DEFAULT = 'default'#
- EVA = 'eva'#
- GAUSSIAN = 'gaussian'#
- LOFTQ = 'loftq'#
- OLORA = 'olora'#
- PISA = 'pissa'#
- PISSA_NITER = 'pissa_niter_[number of iters]'#
- RANDOM = 'random'#
- class oumi.core.configs.params.peft_params.PeftParams(lora_r: int = 8, lora_alpha: int = 8, lora_dropout: float = 0.0, lora_target_modules: Optional[list[str]] = None, lora_modules_to_save: Optional[list[str]] = None, lora_bias: str = 'none', lora_init_weights: oumi.core.configs.params.peft_params.LoraWeightInitialization = <LoraWeightInitialization.DEFAULT: 'default'>, lora_task_type: peft.utils.peft_types.TaskType = <TaskType.CAUSAL_LM: 'CAUSAL_LM'>, q_lora: bool = False, q_lora_bits: int = 4, bnb_4bit_quant_type: str = 'fp4', use_bnb_nested_quant: bool = False, bnb_4bit_quant_storage: str = 'uint8', bnb_4bit_compute_dtype: str = 'float32', peft_save_mode: oumi.core.configs.params.peft_params.PeftSaveMode = <PeftSaveMode.ADAPTER_ONLY: 'adapter_only'>)[source]#
Bases:
BaseParams
- bnb_4bit_compute_dtype: str = 'float32'#
Compute type of the quantized parameters. It can be different than the input type, e.g., it can be set to a lower precision for improved speed.
The string will be converted to the corresponding torch.dtype.
Valid string options are: - “float32” for 32-bit floating point - “float16” for 16-bit floating point - “bfloat16” for brain floating point - “float64” for 64-bit floating point
Defaults to “float16” for half precision.
- bnb_4bit_quant_storage: str = 'uint8'#
The storage type for packing quantized 4-bit parameters.
Defaults to ‘uint8’ for efficient storage.
- bnb_4bit_quant_type: str = 'fp4'#
The type of 4-bit quantization to use.
Can be ‘fp4’ (float point 4) or ‘nf4’ (normal float 4).
- lora_alpha: int = 8#
The scaling factor for the LoRA update.
This value is typically set equal to lora_r or 2*lora_r for stable training.
- lora_bias: str = 'none'#
Bias type for LoRA.
Can be ‘none’, ‘all’ or ‘lora_only’: - ‘none’: No biases are trained. - ‘all’: All biases in the model are trained. - ‘lora_only’: Only biases in LoRA layers are trained.
If ‘all’ or ‘lora_only’, the corresponding biases will be updated during training. Note that this means even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation.
For more details, see: huggingface/peft
- lora_dropout: float = 0.0#
The dropout probability applied to LoRA layers.
This helps prevent overfitting in the adaptation layers.
- lora_init_weights: LoraWeightInitialization = 'default'#
Passing LoraWeightInitialization.DEFAULT will use the underlying reference implementation of the corresponding model from Microsoft.
- Other valid (LoraWeightInitialization) options include:
“random” which will use fully random initialization and is discouraged.
“gaussian” for Gaussian initialization.
“eva” for Explained Variance Adaptation (EVA) (https://arxiv.org/abs/2410.07170).
“loftq” for improved performance when LoRA is combined with with quantization (https://arxiv.org/abs/2310.08659).
“olora” for Orthonormal Low-Rank Adaptation of Large Language Models (OLoRA) (https://arxiv.org/html/2406.01775v1).
“pissa” for Principal Singular values and Singular vectors Adaptation (PiSSA) (https://arxiv.org/abs/2404.02948).
- For more information, see HF:
- lora_modules_to_save: list[str] | None = None#
List of module names to unfreeze and train alongside LoRA parameters.
These modules will be fully fine-tuned, not adapted using LoRA. Use this to selectively train certain parts of the model in full precision.
- lora_r: int = 8#
The rank of the update matrices in LoRA.
A higher value allows for more expressive adaptations but increases the number of trainable parameters.
- lora_target_modules: list[str] | None = None#
List of module names to apply LoRA to.
If None, LoRA will be applied to all linear layers in the model. Specify module names to selectively apply LoRA to certain parts of the model.
- lora_task_type: TaskType = 'CAUSAL_LM'#
The task type for LoRA adaptation.
Defaults to CAUSAL_LM (Causal Language Modeling).
- peft_save_mode: PeftSaveMode = 'adapter_only'#
How to save the final model during PEFT training.
This option is only used if TrainingParams.save_final_model is True. By default, only the model adapter is saved to reduce disk usage. Options are defined in the PeftSaveMode enum and include: - ADAPTER_ONLY: Only save the model adapter. - ADAPTER_AND_BASE_MODEL: Save the base model in addition to the adapter. - MERGED: Merge the adapter and base model’s weights and save as a single model.
- q_lora: bool = False#
Whether to use quantization for LoRA (Q-LoRA).
If True, enables quantization for more memory-efficient fine-tuning.
- q_lora_bits: int = 4#
The number of bits to use for quantization in Q-LoRA.
This is only used if q_lora is True.
Defaults to 4-bit quantization.
- to_bits_and_bytes() BitsAndBytesConfig [source]#
Creates a configuration for quantized models via BitsAndBytes.
The resulting configuration uses the instantiated peft parameters.
- use_bnb_nested_quant: bool = False#
Whether to use nested quantization.
Nested quantization can provide additional memory savings.
- class oumi.core.configs.params.peft_params.PeftSaveMode(value)[source]#
Bases:
Enum
Enum representing how to save the final model during PEFT training.
While models saved with any of these options can be loaded by Oumi, those saved with ADAPTER_ONLY are not self-contained; the base model will be loaded separately from the local HF cache or downloaded from HF Hub if not in the cache.
- ADAPTER_AND_BASE_MODEL = 'adapter_and_base_model'#
Save the base model in addition to the adapter.
This is similar to ADAPTER_ONLY, but the base model’s weights are also saved in the same directory as the adapter weights, making the output dir self-contained.
- ADAPTER_ONLY = 'adapter_only'#
Only save the model adapter.
Note that when loading this saved model, the base model will be loaded separately from the local HF cache or downloaded from HF Hub.
- MERGED = 'merged'#
Merge the adapter and base model’s weights and save as a single model.
Note that the resulting model is a standard HF Transformers model, and is no longer a PEFT model. A copy of the adapter before merging is saved in the “adapter/” subdirectory.
oumi.core.configs.params.profiler_params module#
- class oumi.core.configs.params.profiler_params.ProfilerParams(save_dir: Optional[str] = None, enable_cpu_profiling: bool = False, enable_cuda_profiling: bool = False, record_shapes: bool = False, profile_memory: bool = False, with_stack: bool = False, with_flops: bool = False, with_modules: bool = False, row_limit: int = 50, schedule: oumi.core.configs.params.profiler_params.ProfilerScheduleParams = <factory>)[source]#
Bases:
BaseParams
- enable_cpu_profiling: bool = False#
Whether to profile CPU activity.
Corresponds to torch.profiler.ProfilerActivity.CPU.
- enable_cuda_profiling: bool = False#
Whether to profile CUDA.
Corresponds to torch.profiler.ProfilerActivity.CUDA.
- profile_memory: bool = False#
Track tensor memory allocation/deallocation.
- record_shapes: bool = False#
Save information about operator’s input shapes.
- row_limit: int = 50#
Max number of rows to include into profiling report tables.
Set to -1 to make it unlimited.
- save_dir: str | None = None#
Directory where the profiling data will be saved to.
If not specified and profiling is enabled, then the profiler sub-dir will be used under output_dir.
- schedule: ProfilerScheduleParams#
Parameters that define what subset of training steps to profile.
- with_flops: bool = False#
Record module hierarchy (including function names) corresponding to the callstack of the op.
- with_modules: bool = False#
Use formula to estimate the FLOPs (floating point operations) of specific operators (matrix multiplication and 2D convolution).
- with_stack: bool = False#
Record source information (file and line number) for the ops.
- class oumi.core.configs.params.profiler_params.ProfilerScheduleParams(enable_schedule: bool = False, wait: int = 0, warmup: int = 1, active: int = 3, repeat: int = 1, skip_first: int = 1)[source]#
Bases:
BaseParams
Parameters that define what subset of training steps to profile.
Keeping profiling enabled for all training steps may be impractical as it may result in out-of-memory errors, extremely large trace files, and may interfere with regular training performance. This config can be used to enable PyTorch profiler only for a small number of training steps, which is not affected by such issues, and may still provide a useful signal for performance analysis.
- active: int = 3#
The number of training steps to do active recording (ProfilerAction.RECORD) in each profiling cycle.
- enable_schedule: bool = False#
Whether profiling schedule is enabled.
If False, then profiling is enabled for the entire process duration, and all schedule parameters below will be ignored.
- repeat: int = 1#
The optional number of profiling cycles.
Each cycle includes wait + warmup + active steps. The zero value means that the cycles will continue until the profiling is finished.
- skip_first: int = 1#
The number of initial training steps to skip at the beginning of profiling session (ProfilerAction.NONE).
- wait: int = 0#
The number of training steps to skip at the beginning of each profiling cycle (ProfilerAction.NONE). Each cycle includes wait + warmup + active steps.
- warmup: int = 1#
The number of training steps to do profiling warmup (ProfilerAction.WARMUP) in each profiling cycle.
oumi.core.configs.params.remote_params module#
- class oumi.core.configs.params.remote_params.RemoteParams(api_url: str | None = None, api_key: str | None = None, api_key_env_varname: str | None = None, max_retries: int = 3, connection_timeout: float = 300.0, num_workers: int = 1, politeness_policy: float = 0.0, batch_completion_window: str | None = '24h')[source]#
Bases:
BaseParams
Parameters for running inference against a remote API.
- api_key: str | None = None#
API key to use for authentication.
- api_key_env_varname: str | None = None#
Name of the environment variable containing the API key for authentication.
- api_url: str | None = None#
URL of the API endpoint to use for inference.
- batch_completion_window: str | None = '24h'#
Time window for batch completion. Currently only ‘24h’ is supported.
Only used for batch inference.
- connection_timeout: float = 300.0#
Timeout in seconds for a request to an API.
- max_retries: int = 3#
Maximum number of retries to attempt when calling an API.
- num_workers: int = 1#
Number of workers to use for parallel inference.
- politeness_policy: float = 0.0#
Politeness policy to use when calling an API.
If greater than zero, this is the amount of time in seconds a worker will sleep before making a subsequent request.
oumi.core.configs.params.telemetry_params module#
- class oumi.core.configs.params.telemetry_params.TelemetryParams(telemetry_dir: str | None = 'telemetry', collect_telemetry_for_all_ranks: bool = False, track_gpu_temperature: bool = False)[source]#
Bases:
BaseParams
- collect_telemetry_for_all_ranks: bool = False#
Whether to collect telemetry for all ranks.
By default, only the main rank’s telemetry stats are collected and saved.
- telemetry_dir: str | None = 'telemetry'#
Directory where the telemetry data will be saved to.
If not specified, then telemetry files will be written under output_dir. If a relative path is specified, then files will be written in a telemetry_dir sub-directory in output_dir.
- track_gpu_temperature: bool = False#
Whether to record GPU temperature.
If save_telemetry_for_all_ranks is False, only the first GPU’s temperature is tracked. Otherwise, temperature is recorded for all GPUs.
oumi.core.configs.params.training_params module#
- class oumi.core.configs.params.training_params.MixedPrecisionDtype(value)[source]#
Bases:
str
,Enum
Enum representing the dtype used for mixed precision training.
For more details on mixed-precision training, see: https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html
- BF16 = 'bf16'#
Similar to fp16 mixed precision, but with bf16 instead.
This requires Ampere or higher NVIDIA architecture, or using CPU or Ascend NPU.
- FP16 = 'fp16'#
fp16 mixed precision.
Requires ModelParams.torch_dtype (the dtype of the model weights) to be fp32. The model weights and optimizer state are fp32, but some ops will run in fp16 to improve training speed.
- NONE = 'none'#
No mixed precision.
Uses ModelParams.torch_dtype as the dtype for all tensors (model weights, optimizer state, activations, etc.).
- class oumi.core.configs.params.training_params.SchedulerType(value)[source]#
Bases:
str
,Enum
Enum representing the supported learning rate schedulers.
For optional args for each scheduler, see src/oumi/builders/lr_schedules.py.
- CONSTANT = 'constant'#
Constant scheduler.
Keeps the learning rate constant throughout training.
- COSINE = 'cosine'#
Cosine scheduler.
Decays the learning rate following the decreasing part of a cosine curve.
- COSINE_WITH_MIN_LR = 'cosine_with_min_lr'#
Cosine with a minimum learning rate scheduler.
Similar to cosine scheduler, but maintains a minimum learning rate at the end.
- COSINE_WITH_RESTARTS = 'cosine_with_restarts'#
Cosine with restarts scheduler.
Decays the learning rate following a cosine curve with periodic restarts.
- LINEAR = 'linear'#
Linear scheduler.
Decreases the learning rate linearly from the initial value to 0 over the course of training.
- class oumi.core.configs.params.training_params.TrainerType(value)[source]#
Bases:
Enum
Enum representing the supported trainers.
- HF = 'hf'#
Generic HuggingFace trainer from transformers library.
This is the standard trainer provided by the Hugging Face Transformers library, suitable for a wide range of training tasks.
- OUMI = 'oumi'#
Custom generic trainer implementation.
This is a custom trainer implementation specific to the Oumi project, designed to provide additional flexibility and features.
- TRL_DPO = 'trl_dpo'#
Direct Preference Optimization trainer from trl library.
This trainer implements the Direct Preference Optimization algorithm for fine-tuning language models based on human preferences.
- TRL_GRPO = 'trl_grpo'#
Group Relative Policy Optimization trainer from trl library.
This trainer implements the Group Relative Policy Optimization algorithm introduced in the paper https://arxiv.org/pdf/2402.03300 for fine-tuning language models. Optionally, supports user-defined reward functions.
- TRL_SFT = 'trl_sft'#
Supervised fine-tuning trainer from trl library.
This trainer is specifically designed for supervised fine-tuning tasks using the TRL (Transformer Reinforcement Learning) library.
- VERL_GRPO = 'verl_grpo'#
Group Relative Policy Optimization trainer from verl library.
This trainer implements the Group Relative Policy Optimization algorithm introduced in the paper https://arxiv.org/pdf/2402.03300 for fine-tuning language models. Optionally, supports user-defined reward functions.
- class oumi.core.configs.params.training_params.TrainingParams(use_peft: bool = False, trainer_type: oumi.core.configs.params.training_params.TrainerType = <TrainerType.HF: 'hf'>, enable_gradient_checkpointing: bool = False, gradient_checkpointing_kwargs: dict[str, typing.Any] = <factory>, output_dir: str = 'output', per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, gradient_accumulation_steps: int = 1, max_steps: int = -1, num_train_epochs: int = 3, save_epoch: bool = False, save_steps: int = 500, save_final_model: bool = True, seed: int = 42, data_seed: int = 42, use_deterministic: bool = False, full_determinism: bool = False, run_name: Optional[str] = None, metrics_function: Optional[str] = None, reward_functions: Optional[list[str]] = None, grpo: oumi.core.configs.params.grpo_params.GrpoParams = <factory>, log_level: str = 'info', dep_log_level: str = 'warning', enable_wandb: bool = False, enable_mlflow: bool = False, enable_tensorboard: bool = True, logging_strategy: str = 'steps', logging_dir: Optional[str] = None, logging_steps: int = 50, logging_first_step: bool = False, eval_strategy: str = 'no', eval_steps: int = 500, learning_rate: float = 5e-05, lr_scheduler_type: str = 'linear', lr_scheduler_kwargs: dict[str, typing.Any] = <factory>, warmup_ratio: Optional[float] = None, warmup_steps: Optional[int] = None, optimizer: str = 'adamw_torch', weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, sgd_momentum: float = 0.0, mixed_precision_dtype: oumi.core.configs.params.training_params.MixedPrecisionDtype = <MixedPrecisionDtype.NONE: 'none'>, compile: bool = False, include_performance_metrics: bool = False, include_alternative_mfu_metrics: bool = False, log_model_summary: bool = False, resume_from_checkpoint: Optional[str] = None, try_resume_from_last_checkpoint: bool = False, dataloader_num_workers: Union[int, str] = 0, dataloader_persistent_workers: bool = False, dataloader_prefetch_factor: Optional[int] = None, dataloader_main_process_only: Optional[bool] = None, ddp_find_unused_parameters: Optional[bool] = None, max_grad_norm: Optional[float] = 1.0, trainer_kwargs: dict[str, typing.Any] = <factory>, verl_config_overrides: dict[str, typing.Any] = <factory>, profiler: oumi.core.configs.params.profiler_params.ProfilerParams = <factory>, telemetry: oumi.core.configs.params.telemetry_params.TelemetryParams = <factory>, empty_device_cache_steps: Optional[int] = None, nccl_default_timeout_minutes: Optional[float] = None, label_ignore_index: Optional[int] = None)[source]#
Bases:
BaseParams
- adam_beta1: float = 0.9#
The beta1 parameter for Adam-based optimizers.
Exponential decay rate for the first moment estimates. Default is 0.9.
- adam_beta2: float = 0.999#
The beta2 parameter for Adam-based optimizers.
Exponential decay rate for the second moment estimates. Default is 0.999.
- adam_epsilon: float = 1e-08#
Epsilon parameter for Adam-based optimizers.
Small constant for numerical stability. Default is 1e-08.
- compile: bool = False#
Whether to JIT compile the model.
This parameter should be used instead of ModelParams.compile for training.
- data_seed: int = 42#
Random data_seed used for initialization. The seed to use for the underlying generator when using use_seedable_sampler. If None, the generator will use the current default seed from torch. Used only by the HuggingFace trainers.
- dataloader_main_process_only: bool | None = None#
Controls whether the dataloader is iterated through on the main process only.
If set to True, the dataloader is only iterated through on the main process (rank 0), then the batches are split and broadcast to each process. This can reduce the number of requests to the dataset, and helps ensure that each example is seen by max one GPU per epoch, but may become a performance bottleneck if a large number of GPUs is used.
If set to False, the dataloader is iterated through on each GPU process.
If set to None (default), then True or False is auto-selected based on heuristics (properties of dataset, the number of nodes and/or GPUs, etc).
NOTE: We recommend to benchmark your setup, and configure True or False.
- dataloader_num_workers: int | str = 0#
Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process.
You can also use the special value “auto” to select the number of dataloader workers using a simple heuristic based on the number of CPU-s and GPU-s per node. Note that the accurate estimation of workers is difficult and depends on many factors (the properties of a model, dataset, VM, network, etc) so you can start with “auto” then experimentally tune the exact number to make it more optimal for your specific case. If “auto” is requested, then at minimum 1 worker is guaranteed to be assigned.
- dataloader_persistent_workers: bool = False#
Whether to use persistent workers for data loading (HF Trainers only). If True, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. Can potentially speed up training, but will increase RAM usage. Will default to False.
- dataloader_prefetch_factor: int | None = None#
Number of batches loaded in advance by each worker.
2 means there will be a total of 2 * num_workers batches prefetched across all workers.
This is only used if dataloader_num_workers >= 1.
- ddp_find_unused_parameters: bool | None = None#
When using PyTorch’s DistributedDataParallel training, the value of this flag is passed to find_unused_parameters.
Will default to False if gradient checkpointing is used, True otherwise.
- dep_log_level: str = 'warning'#
The logging level for dependency loggers (e.g., HuggingFace, PyTorch).
Possible values are “debug”, “info”, “warning”, “error”, “critical”.
- empty_device_cache_steps: int | None = None#
Number of steps to wait before calling torch.<device>.empty_cache().
This parameter determines how frequently the GPU cache should be cleared during training. If set, it will trigger cache clearing every empty_device_cache_steps. If left as None, the cache will not be emptied automatically.
Setting this can help manage GPU memory usage, especially for large models or long training runs, but may impact performance if set too low.
- enable_gradient_checkpointing: bool = False#
Whether to enable gradient checkpointing to save memory at the expense of speed.
Gradient checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward pass, it recomputes these activations during the backward pass. This can make the training slower, but it can also significantly reduce memory usage.
- enable_mlflow: bool = False#
Whether to enable MLflow logging.
If True, MLflow will be used for experiment tracking and visualization. If you want to use MLflow, you must set the MLFLOW_TRACKING_URI environment variable to specify the tracking server URI and the MLFLOW_EXPERIMENT_ID or MLFLOW_EXPERIMENT_NAME environment variable to specify the experiment to report the run to.
- enable_tensorboard: bool = True#
Whether to enable TensorBoard logging.
If True, TensorBoard will be used for logging metrics and visualizations.
- enable_wandb: bool = False#
Whether to enable Weights & Biases (wandb) logging.
If True, wandb will be used for experiment tracking and visualization. Wandb will also log a summary of the training run, including hyperparameters, metrics, and other relevant information at the end of training.
After enabling, you must set the WANDB_API_KEY environment variable. Alternatively, you can use the wandb login command to authenticate.
- eval_steps: int = 500#
Number of update steps between two evaluations if eval_strategy=”steps”.
Ignored if eval_strategy is not “steps”.
- eval_strategy: str = 'no'#
The strategy to use for evaluation during training.
Possible values: - “no”: No evaluation is done during training. - “steps”: Evaluation is done every eval_steps. - “epoch”: Evaluation is done at the end of each epoch.
- full_determinism: bool = False#
If True, enable_full_determinism() is called instead of set_seed() to ensure reproducible results in distributed training. This will only affect HF trainers. Important: this will negatively impact performance, so only use it for debugging.
- gradient_accumulation_steps: int = 1#
Number of update steps to accumulate before performing a backward/update pass.
This technique allows for effectively larger batch sizes and is especially useful when such batch sizes would not fit in memory. This is achieved by accumulating gradients from multiple forward passes before performing a single optimization step. Setting this to >1 can increase however memory usage for training setups without existing gradient accumulation buffers (ex. 1-GPU training).
- gradient_checkpointing_kwargs: dict[str, Any]#
Keyword arguments for gradient checkpointing.
The use_reentrant parameter is required and is recommended to be set to False. For more details, see: https://pytorch.org/docs/stable/checkpoint.html
- grpo: GrpoParams#
Parameters for GRPO training.
- include_alternative_mfu_metrics: bool = False#
Whether to report alternative MFU (Model FLOPs Utilization) metrics.
These metrics are based on HuggingFace’s total_flos. This option is only used if include_performance_metrics is True.
- include_performance_metrics: bool = False#
Whether to include performance metrics such as token statistics.
- label_ignore_index: int | None = None#
Tokens with this label value don’t contribute to the loss computation. For example, this can be PAD, or image tokens. -100 is the PyTorch convention. Refer to the ignore_index parameter of torch.nn.CrossEntropyLoss() for more details.
If unspecified (None), then the default model-specific preferences configured in Oumi may be used.
Users should only set label_ignore_index if the default behavior is not satisfactory, or for new models not yet fully-integrated by Oumi.
- learning_rate: float = 5e-05#
The initial learning rate for the optimizer.
This value can be adjusted by the learning rate scheduler during training.
- log_level: str = 'info'#
The logging level for the main Oumi logger.
Possible values are “debug”, “info”, “warning”, “error”, “critical”.
- log_model_summary: bool = False#
Whether to print a model summary, including layer names.
- logging_dir: str | None = None#
The directory where training logs will be saved.
This includes TensorBoard logs and other training-related output.
- logging_first_step: bool = False#
Whether to log and evaluate the first global step.
If True, metrics will be logged and evaluation will be performed at the very beginning of training. Skipping the first step can be useful to avoid logging and evaluation of the initial random model.
The first step is usually not representative of the model’s performance, as it includes model compilation, optimizer initialization, and other setup steps.
- logging_steps: int = 50#
Number of update steps between two logs if logging_strategy=”steps”.
Ignored if logging_strategy is not “steps”.
- logging_strategy: str = 'steps'#
The strategy to use for logging during training.
Possible values are: - “steps”: Log every logging_steps steps. - “epoch”: Log at the end of each epoch. - “no”: Disable logging.
- lr_scheduler_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the learning rate scheduler.
These arguments can be used to fine-tune the behavior of the chosen scheduler.
- lr_scheduler_type: str = 'linear'#
The type of learning rate scheduler to use.
- Possible values include “linear”, “cosine”, “cosine_with_restarts”,
“cosine_with_min_lr” and “constant”.
See src/oumi/builders/lr_schedules.py for more details on each scheduler.
- max_grad_norm: float | None = 1.0#
Maximum gradient norm (for gradient clipping) to avoid exploding gradients which can destabilize training.
Defaults to 1.0. When set to 0.0 or None gradient clipping will not be applied.
- max_steps: int = -1#
If set to a positive number, the total number of training steps to perform.
This parameter overrides num_train_epochs. If set to -1 (default), the number of training steps is determined by num_train_epochs.
- metrics_function: str | None = None#
The name of the metrics function in the Oumi registry to use for evaluation during training.
The method must accept as input a HuggingFace EvalPrediction and return a dictionary of metrics, with string keys mapping to metric values. A single metrics_function may compute multiple metrics.
- mixed_precision_dtype: MixedPrecisionDtype = 'none'#
The data type to use for mixed precision training.
Default is NONE, which means no mixed precision is used.
- nccl_default_timeout_minutes: float | None = None#
Default timeout for NCCL operations in minutes.
See: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
If unset, will use the default value of torch.distributed.init_process_group which is 10min.
- num_train_epochs: int = 3#
Total number of training epochs to perform (if max_steps is not specified).
An epoch is one complete pass through the entire training dataset. This parameter is ignored if max_steps is set to a positive number.
- optimizer: str = 'adamw_torch'#
The optimizer to use for training.
See pytorch documentation for more information on available optimizers: https://pytorch.org/docs/stable/optim.html
Default is “adamw_torch” (AdamW implemented by PyTorch).
- output_dir: str = 'output'#
Directory where the output files will be saved.
This includes checkpoints, evaluation results, and any other artifacts produced during the training process.
- per_device_eval_batch_size: int = 8#
Number of samples per batch on each device during evaluation.
Similar to per_device_train_batch_size, but used during evaluation phases. Can often be set higher than the train batch size as no gradients are stored.
- per_device_train_batch_size: int = 8#
Number of samples per batch on each device during training.
This parameter directly affects memory usage and training speed. Larger batch sizes generally lead to better utilization of GPU compute capabilities but require more memory.
- profiler: ProfilerParams#
Parameters for performance profiling.
This field contains configuration options for the profiler, which can be used to analyze the performance of the training process. It uses the ProfilerParams class to define specific profiling settings.
- resume_from_checkpoint: str | None = None#
Path to a checkpoint folder from which to resume training.
If specified, training will resume by first loading the model from this folder.
- reward_functions: list[str] | None = None#
The names of the reward function in the Oumi registry to use for reinforcement learning.
Only supported with the TRL_GRPO and VERL_GRPO trainers. Currently, VERL_GRPO only supports specifying a single reward function.
For TRL_GRPO, refer to https://huggingface.co/docs/trl/main/en/grpo_trainer for documentation about the function signature.
For VERL_GRPO, refer to https://verl.readthedocs.io/en/latest/preparation/reward_function.html for documentation about the function signature.
- run_name: str | None = None#
A unique identifier for the current training run.
This name is used to identify the run in logging outputs, saved model checkpoints, and experiment tracking tools like Weights & Biases or TensorBoard. It’s particularly useful when running multiple experiments or when you want to easily distinguish between different training sessions.
- save_epoch: bool = False#
Save a checkpoint at the end of every epoch.
When set to True, this ensures that a model checkpoint is saved after each complete pass through the training data. This can be useful for tracking model progress over time and for resuming training from a specific epoch if needed.
If both save_steps and save_epoch are set, then save_steps takes precedence.
- save_final_model: bool = True#
Whether to save the model at the end of training.
For different options for saving PEFT models, see PeftParams.peft_save_mode. This should normally be set to True to ensure the final trained model is saved. However, in some cases, you may want to disable it, for example: - If saving a large model which takes a long time - When quickly testing training speed or metrics - During debugging or experimentation phases
- save_steps: int = 500#
Save a checkpoint every save_steps training steps.
This parameter determines the frequency of saving checkpoints during training based on the number of steps. If both save_steps and save_epoch are set, then save_steps takes precedence.
To disable saving checkpoints during training, set save_steps to 0 and save_epoch to False. If enabled, a checkpoint will be saved at the end of training if there’s any residual steps left.
- seed: int = 42#
Random seed used for initialization.
This seed is passed to the trainer and to all downstream dependencies to ensure reproducibility of results. It affects random number generation in various parts of the training process, including data shuffling, weight initialization, and any stochastic operations.
- sgd_momentum: float = 0.0#
Momentum factor for SGD optimizer.
Only used when optimizer is set to “sgd”, and when trainer_type is set to OUMI. Default is 0.0.
- telemetry: TelemetryParams#
Parameters for telemetry.
This field contains telemetry configuration options.
- property telemetry_dir: Path | None#
Returns the telemetry stats output directory.
- trainer_kwargs: dict[str, Any]#
Additional keyword arguments to pass to the HF/TRL Trainer.
This allows for customization of the Trainer beyond the standard parameters defined in this class. Any key-value pairs added here will be passed directly to the Trainer’s constructor. Note that this field is only used for HuggingFace and TRL trainers (TRL_SFT, TRL_DPO, TRL_GRPO, HF).
- trainer_type: TrainerType = 'hf'#
The type of trainer to use for the training process.
Options are defined in the TrainerType enum and include: - HF: HuggingFace’s Trainer - TRL_SFT: TRL’s SFT Trainer - TRL_DPO: TRL’s DPO Trainer - TRL_GRPO: TRL’s GRPO Trainer - OUMI: Custom generic trainer implementation - VERL_GRPO: verl’s GRPO Trainer
- try_resume_from_last_checkpoint: bool = False#
If True, attempt to resume from the last checkpoint in “output_dir”.
If a checkpoint is found, training will resume from the model/optimizer/scheduler states loaded from this checkpoint. If no checkpoint is found, training will continue without loading any intermediate checkpoints.
Note: If resume_from_checkpoint is specified and contains a non-empty path, this parameter has no effect.
- use_deterministic: bool = False#
Whether to use deterministic algorithms for reproducibility. If set to True, this will only allow those CuDNN algorithms that are (believed to be) deterministic. Please refer to https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html for more details. If using distributed training, this will override ddp_find_unused_parameters to False and will also use ddp_broadcast_buffers, and disable gradient checkpointing. Note that this will not guarantee full reproducibility, but will help to reduce the variance between runs.
- use_peft: bool = False#
Whether to use Parameter-Efficient Fine-Tuning (PEFT) techniques.
PEFT methods allow for efficient adaptation of pre-trained language models to specific tasks by only updating a small number of (extra) model parameters. This can significantly reduce memory usage and training time.
- verl_config_overrides: dict[str, Any]#
Values to override in the verl config.
This field is only used for the VERL_GRPO trainer. To see supported params in verl, see: https://verl.readthedocs.io/en/latest/examples/config.html
The verl config is a nested dict, so the kwargs should be structured accordingly. For example, to set actor_rollout_ref.actor.use_kl_loss to True, you can use: {“actor_rollout_ref”: {“actor”: {“use_kl_loss”: True}}}.
The priority of setting verl config params, from highest to lowest, is: 1. Values specified by this field. 2. Values automatically set by Oumi in
src/oumi/core/trainers/verl_grpo_trainer.py:_create_config() for verl params which have corresponding Oumi params. For example, Oumi’s training.output_dir -> verl’s trainer.default_local_dir
Default verl config values in src/oumi/core/trainers/verl_trainer_config.yaml.
- warmup_ratio: float | None = None#
The ratio of total training steps used for a linear warmup from 0 to the learning rate.
If set along with warmup_steps, this value will be ignored.
- warmup_steps: int | None = None#
The number of steps for the warmup phase of the learning rate scheduler.
If set, will override the value of warmup_ratio.
- weight_decay: float = 0.0#
Weight decay (L2 penalty) to apply to the model’s parameters.
In the HF trainers and the OUMI trainer, this is automatically applied to only weight tensors, and skips biases/layernorms.
Default is 0.0 (no weight decay).