oumi.core.configs.params

oumi.core.configs.params#

Submodules#

oumi.core.configs.params.base_params module#

class oumi.core.configs.params.base_params.BaseParams[source]#

Bases: object

Base class for all parameter classes.

This class provides a common interface for all parameter classes, and provides a finalize_and_validate method to recursively validate the parameters.

Subclasses should implement the __finalize_and_validate__ method to perform custom validation logic.

__finalize_and_validate__() → None[source]#

Finalizes and validates the parameters of this object.

This method can be overridden by subclasses to implement custom validation logic.

In case of validation errors, this method should raise a ValueError or other appropriate exception.

__iter__() → Iterator[tuple[str, Any]][source]#

Returns an iterator over field names and values.

Note: for an attribute to be a field, it must be declared in the dataclass definition and have a type annotation.

finalize_and_validate() → None[source]#: Recursively finalizes and validates the parameters.

oumi.core.configs.params.data_params module#

class oumi.core.configs.params.data_params.DataParams(train: oumi.core.configs.params.data_params.DatasetSplitParams = <factory>, test: oumi.core.configs.params.data_params.DatasetSplitParams = <factory>, validation: oumi.core.configs.params.data_params.DatasetSplitParams = <factory>)[source]#

Bases: BaseParams

__finalize_and_validate__()[source]#: Verifies params.

get_split(split: DatasetSplit) → DatasetSplitParams[source]#: A public getting for individual dataset splits.

test: DatasetSplitParams#: The input datasets used for testing. This field is currently unused.

train: DatasetSplitParams#: The input datasets used for training.

validation: DatasetSplitParams#: The input datasets used for validation.

class oumi.core.configs.params.data_params.DatasetParams(dataset_name: str = '???', dataset_path: Optional[str] = None, subset: Optional[str] = None, split: str = 'train', dataset_kwargs: dict[str, typing.Any] = <factory>, sample_count: Optional[int] = None, mixture_proportion: Optional[float] = None, shuffle: bool = False, seed: Optional[int] = None, shuffle_buffer_size: int = 1000, trust_remote_code: bool = False, transform_num_workers: Union[int, str, NoneType] = None)[source]#

Bases: BaseParams

__post_init__()[source]#: Verifies params.

dataset_kwargs: dict[str, Any]#

Keyword arguments to pass to the dataset constructor.

These arguments will be passed directly to the dataset constructor.

dataset_name: str = '???'#

The name of the dataset to load. Required.

This field is used to retrieve the appropriate class from the dataset registry that can be used to instantiate and preprocess the data.

If dataset_path is not specified, then the raw data will be automatically downloaded from the huggingface hub or oumi registry. Otherwise, the dataset will be loaded from the specified dataset_path.

dataset_path: str | None = None#

The path to the dataset to load.

This can be used to load a dataset of type dataset_name from a custom path.

If dataset_path is not specified, then the raw data will be automatically downloaded from the huggingface hub or oumi registry.

mixture_proportion: float | None = None#

The proportion of examples from this dataset relative to other datasets: in the mixture.

If specified, all datasets must supply this value. Must be a float in the range [0, 1.0]. The mixture_proportion for all input datasets must sum to 1.

Examples are sampled after the dataset has been sampled using sample_count if specified.

sample_count: int | None = None#

The number of examples to sample from the dataset.

Must be non-negative. If sample_count is larger than the size of the dataset, then the required additional examples are sampled by looping over the original dataset.

seed: int | None = None#

The random seed used for shuffling the dataset before sampling.

If set to None, shuffling will be non-deterministic.

shuffle: bool = False#: Whether to shuffle the dataset before any sampling occurs.

shuffle_buffer_size: int = 1000#: The size of the shuffle buffer used for shuffling the dataset before sampling.

split: str = 'train'#

The split of the dataset to load.

This is typically one of “train”, “test”, or “validation”. Defaults to “train”.

subset: str | None = None#

The subset of the dataset to load.

This is usually a subfolder within the dataset root.

transform_num_workers: int | str | None = None#

Number of subprocesses to use for dataset post-processing (ds.transform()).

Multiprocessing is disabled by default (None).

You can also use the special value “auto” to let oumi automatically select the number of subprocesses.

Using multiple processes can speed-up processing e.g., for large or multi-modal datasets.

The parameter is only supported for Map (non-iterable) datasets.

trust_remote_code: bool = False#: Whether to trust remote code when loading the dataset.

class oumi.core.configs.params.data_params.DatasetSplit(value)[source]#

Bases: Enum

Enum representing the split for a dataset.

TEST = 'test'#

TRAIN = 'train'#

VALIDATION = 'validation'#

class oumi.core.configs.params.data_params.DatasetSplitParams(datasets: list[oumi.core.configs.params.data_params.DatasetParams] = <factory>, collator_name: Optional[str] = None, collator_kwargs: dict[str, typing.Any] = <factory>, pack: bool = False, stream: bool = False, target_col: Optional[str] = None, mixture_strategy: str = 'first_exhausted', seed: Optional[int] = None, use_async_dataset: bool = False, use_torchdata: Optional[bool] = None)[source]#

Bases: BaseParams

__post_init__()[source]#: Verifies params.

collator_kwargs: dict[str, Any]#

Additional keyword arguments to pass to the collator constructor.

These arguments will be passed directly to the collator constructor and can be used to customize collator behavior beyond the default parameters.

collator_name: str | None = None#

Name of Oumi data collator.

Data collator controls how to form a mini-batch from individual dataset elements.

Valid options are:

“text_with_padding”: Dynamically pads the inputs received to
the longest length.

“vision_language_with_padding”: Uses VisionLanguageCollator
for image+text multi-modal data.

If None, then a default collator will be assigned.

datasets: list[DatasetParams]#: The datasets in this split.

mixture_strategy: str = 'first_exhausted'#

The strategy for mixing multiple datasets.

When multiple datasets are provided, this parameter determines how they are combined. Two strategies are available:

FIRST_EXHAUSTED: Samples from all datasets until one is fully represented in the mixture. This is the default strategy.
ALL_EXHAUSTED: Samples from all datasets until each one is fully represented in the mixture. This may lead to significant oversampling.

pack: bool = False#

Whether to pack the text into constant-length chunks.

Each chunk will be the size of the model’s max input length. This will stream the dataset, and tokenize on the fly if the dataset isn’t already tokenized (i.e. has an input_ids column).

seed: int | None = None#

The random seed used for mixing this dataset split, if specified.

If set to None mixing will be non-deterministic.

stream: bool = False#: Whether to stream the dataset.

target_col: str | None = None#

The dataset column name containing the input for training/testing/validation.

Deprecated:: This parameter is deprecated and will be removed in the future.

use_async_dataset: bool = False#

Whether to use the PretrainingAsyncTextDataset instead of ConstantLengthDataset.

Deprecated:: This parameter is deprecated and will be removed in the future.

use_torchdata: bool | None = None#

Whether to use the torchdata library for dataset loading and processing.

If set to None, this setting may be auto-inferred.

class oumi.core.configs.params.data_params.MixtureStrategy(value)[source]#

Bases: str, Enum

Enum representing the supported mixture strategies for datasets.

ALL_EXHAUSTED = 'all_exhausted'#

FIRST_EXHAUSTED = 'first_exhausted'#

get_literal_value() → Literal['first_exhausted', 'all_exhausted'][source]#: Returns a literal value of the enum.

oumi.core.configs.params.evaluation_params module#

class oumi.core.configs.params.evaluation_params.AlpacaEvalTaskParams(evaluation_backend: str = '', task_name: str | None = None, num_samples: int | None = None, log_samples: bool | None = False, eval_kwargs: dict[str, ~typing.Any] = <factory>, version: float | None = 2.0)[source]#

Bases: EvaluationTaskParams

Parameters for the AlpacaEval evaluation framework.

AlpacaEval is an LLM-based automatic evaluation suite that is fast, cheap, replicable, and validated against 20K human annotations. The latest version (AlpacaEval 2.0) contains 805 prompts (tatsu-lab/alpaca_eval), which are open-ended questions. A model annotator (judge) is used to evaluate the quality of model’s responses for these questions and calculates win rates vs. reference responses. The default judge is GPT4 Turbo.

__post_init__()[source]#: Verifies params.

eval_kwargs: dict[str, Any]#

Additional keyword arguments to pass to the evaluation function.

This allows for passing any evaluation-specific parameters that are not covered by other fields in TaskParams classes.

version: float | None = 2.0#

1.0 or 2.0 (default).

Type:: The version of AlpacaEval to use. Options

class oumi.core.configs.params.evaluation_params.EvaluationBackend(value)[source]#

Bases: Enum

Enum representing the evaluation backend to use.

ALPACA_EVAL = 'alpaca_eval'#

CUSTOM = 'custom'#

LM_HARNESS = 'lm_harness'#

class oumi.core.configs.params.evaluation_params.EvaluationTaskParams(evaluation_backend: str = '', task_name: str | None = None, num_samples: int | None = None, log_samples: bool | None = False, eval_kwargs: dict[str, ~typing.Any] = <factory>)[source]#

Bases: BaseParams

Configuration parameters for model evaluation tasks.

Supported backends:

LM Harness: Framework for evaluating language models on standard benchmarks. A list of all supported tasks can be found at: EleutherAI/lm-evaluation-harness.
Alpaca Eval: Framework for evaluating language models on instruction-following and quality of responses on open-ended questions.
Custom: Users can register their own evaluation functions using the decorator @register_evaluation_function. The task_name should be the registry key for the custom evaluation function to be used.

Examples

# LM Harness evaluation on MMLU
params = EvaluationTaskParams(
    evaluation_backend="lm_harness",
    task_name="mmlu",
    eval_kwargs={"num_fewshot": 5}
)

# Alpaca Eval 2.0 evaluation
params = EvaluationTaskParams(
    evaluation_backend="alpaca_eval"
)

# Custom evaluation
@register_evaluation_function("my_evaluation_function")
def my_evaluation(task_params, config):
    accuracy = ...
    return EvaluationResult(task_result={"accuracy": accuracy})

params = EvaluationTaskParams(
    task_name="my_evaluation_function",
    evaluation_backend="custom"
)

__post_init__()[source]#: Verifies params.

eval_kwargs: dict[str, Any]#

Additional keyword arguments to pass to the evaluation function.

This allows for passing any evaluation-specific parameters that are not covered by other fields in TaskParams classes.

evaluation_backend: str = ''#: The evaluation backend to use for the current task.

get_evaluation_backend() → EvaluationBackend[source]#: Returns the evaluation backend as an Enum.

static list_evaluation_backends() → str[source]#: Returns a string listing all available evaluation backends.

log_samples: bool | None = False#

Whether to log the samples used for evaluation.

If not set (False): the model samples used for evaluation will not be logged. If set to True: the model samples generated during inference and used for evaluation will be logged in backend_config.json. The backend may also log other intermediate results related to inference.

num_samples: int | None = None#

Number of samples/examples to evaluate from this dataset.

Mostly for debugging, in order to reduce the runtime. If not set (None): the entire dataset is evaluated. If set, this must be a positive integer.

task_name: str | None = None#

The task to evaluate or the custom evaluation function to use.

For LM Harness evaluations (when the evaluation_backend is set to EvaluationBackend.LM_HARNESS), the task_name corresponds to a predefined task to evaluate on (e.g. “mmlu”). A list of all supported tasks by the LM Harness backend can be found by running: lm-eval –tasks list.

For custom evaluations (when evaluation_backend is set to EvaluationBackend.CUSTOM), the task_name should be the registry key for the custom evaluation function to be used. Users can register new evaluation functions using the decorator @register_evaluation_function.

class oumi.core.configs.params.evaluation_params.LMHarnessTaskParams(evaluation_backend: str = '', task_name: str | None = None, num_samples: int | None = None, log_samples: bool | None = False, eval_kwargs: dict[str, ~typing.Any] = <factory>, num_fewshot: int | None = None)[source]#

Bases: EvaluationTaskParams

Parameters for the LM Harness evaluation framework.

LM Harness is a comprehensive benchmarking suite for evaluating language models across various tasks.

__post_init__()[source]#: Verifies params.

eval_kwargs: dict[str, Any]#

Additional keyword arguments to pass to the evaluation function.

This allows for passing any evaluation-specific parameters that are not covered by other fields in TaskParams classes.

num_fewshot: int | None = None#

Number of few-shot examples (with responses) to add in the prompt, in order to teach the model how to respond to the specific dataset’s prompts.

If not set (None): LM Harness will decide the value. If set to 0: no few-shot examples will be added in the prompt.

oumi.core.configs.params.fsdp_params module#

class oumi.core.configs.params.fsdp_params.AutoWrapPolicy(value)[source]#

Bases: str, Enum

The auto wrap policies for FullyShardedDataParallel (FSDP).

NO_WRAP = 'NO_WRAP'#: No automatic wrapping is performed.

SIZE_BASED_WRAP = 'SIZE_BASED_WRAP'#: Wraps layers based on parameter count.

TRANSFORMER_BASED_WRAP = 'TRANSFORMER_BASED_WRAP'#: Wraps layers based on the transformer block layer.

class oumi.core.configs.params.fsdp_params.BackwardPrefetch(value)[source]#

Bases: str, Enum

The backward prefetch options for FullyShardedDataParallel (FSDP).

BACKWARD_POST = 'BACKWARD_POST'#: Enables less overlap but requires less memory usage.

BACKWARD_PRE = 'BACKWARD_PRE'#: Enables the most overlap but increases memory usage the most.

NO_PREFETCH = 'NO_PREFETCH'#: Disables backward prefetching altogether.

to_torch() → BackwardPrefetch | None[source]#: Convert the enum to the corresponding torch_fsdp.BackwardPrefetch.

class oumi.core.configs.params.fsdp_params.FSDPParams(enable_fsdp: bool = False, sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD, cpu_offload: bool = False, mixed_precision: str | None = None, backward_prefetch: BackwardPrefetch = BackwardPrefetch.BACKWARD_PRE, forward_prefetch: bool = False, use_orig_params: bool | None = None, state_dict_type: StateDictType = StateDictType.FULL_STATE_DICT, auto_wrap_policy: AutoWrapPolicy = AutoWrapPolicy.NO_WRAP, min_num_params: int = 100000, transformer_layer_cls: str | None = None, sync_module_states: bool = True)[source]#

Bases: BaseParams

Configuration options for Pytorch’s FullyShardedDataParallel (FSDP) training.

auto_wrap_policy: AutoWrapPolicy = 'NO_WRAP'#: Policy for automatically wrapping layers in FSDP.

backward_prefetch: BackwardPrefetch = 'BACKWARD_PRE'#

Determines when to prefetch the next set of parameters.

Improves throughput by enabling communication and computation overlap in the backward pass at the cost of slightly increased memory usage.

Options:

BACKWARD_PRE: Enables the most overlap but increases memory: usage the most. This prefetches the next set of parameters before the current set of parameters’ gradient computation.
BACKWARD_POST: Enables less overlap but requires less memory: usage. This prefetches the next set of parameters after the current set of parameters’ gradient computation.
NO_PREFETCH: Disables backward prefetching altogether. This has no overlap and: does not increase memory usage. This may degrade throughput significantly.

cpu_offload: bool = False#: If True, offloads parameters and gradients to CPU when not in use.

enable_fsdp: bool = False#

If True, enables FullyShardedDataParallel training.

Allows training larger models by sharding models and gradients across multiple GPUs.

forward_prefetch: bool = False#: If True, prefetches the forward pass results.

min_num_params: int = 100000#: Minimum number of parameters for a layer to be wrapped when using size_based policy. This has no effect when using transformer_based policy.

mixed_precision: str | None = None#

Enables mixed precision training.

Options: None, “fp16”, “bf16”.

sharding_strategy: ShardingStrategy = 'FULL_SHARD'#

Determines how to shard model parameters across GPUs.

See torch.distributed.fsdp.api.ShardingStrategy for more details.

Options:

FULL_SHARD: Shards model parameters, gradients, and optimizer states.: Provides the most memory efficiency but may impact performance.
SHARD_GRAD_OP: Shards gradients and optimizer states, but not model: parameters. Balances memory savings and performance.
HYBRID_SHARD: Shards model parameters within a node and replicates them: across nodes.
NO_SHARD: No sharding is applied. Parameters, gradients, and optimizer states: are kept in full on each GPU.
HYBRID_SHARD_ZERO2: Apply SHARD_GRAD_OP within a node, and replicate: parameters across nodes.

Warning

NO_SHARD option is deprecated and will be removed in a future release.: Please use DistributedDataParallel (DDP) instead.

state_dict_type: StateDictType = 'FULL_STATE_DICT'#: Specifies the type of state dict to use for checkpointing.

sync_module_states: bool = True#

If True, synchronizes module states across processes.

When enabled, each FSDP module broadcasts parameters and buffers from rank 0 to ensure replication across ranks.

transformer_layer_cls: str | None = None#

Class name for transformer layers when using transformer_based policy.

This has no effect when using size_based policy.

use_orig_params: bool | None = None#

If True, uses the PyTorch Module’s original parameters for FSDP.

For more information, see: https://pytorch.org/docs/stable/fsdp.html. If not specified, it will be automatically inferred based on other config values.

class oumi.core.configs.params.fsdp_params.ShardingStrategy(value)[source]#

Bases: str, Enum

The sharding strategies for FullyShardedDataParallel (FSDP).

See torch.distributed.fsdp.ShardingStrategy for more details.

FULL_SHARD = 'FULL_SHARD'#: Shards model parameters, gradients, and optimizer states. Provides the most memory efficiency but may impact performance.

HYBRID_SHARD = 'HYBRID_SHARD'#: Shards model parameters within a node and replicates them across nodes.

HYBRID_SHARD_ZERO2 = 'HYBRID_SHARD_ZERO2'#: Apply SHARD_GRAD_OP within a node, and replicate parameters across nodes.

NO_SHARD = 'NO_SHARD'#: No sharding is applied. Parameters, gradients, and optimizer states are kept in full on each GPU.

SHARD_GRAD_OP = 'SHARD_GRAD_OP'#: Shards gradients and optimizer states, but not model parameters. Balances memory savings and performance.

to_torch() → ShardingStrategy[source]#: Convert the enum to the corresponding torch_fsdp.ShardingStrategy.

class oumi.core.configs.params.fsdp_params.StateDictType(value)[source]#

Bases: str, Enum

The supported state dict types for FullyShardedDataParallel (FSDP).

This controls how the model’s state dict will be saved during checkpointing, and how it can be consumed afterwards.

FULL_STATE_DICT = 'FULL_STATE_DICT'#

The state dict will be saved in a non-sharded, unflattened format.

This is similar to checkpointing without FSDP.

LOCAL_STATE_DICT = 'LOCAL_STATE_DICT'#

The state dict will be saved in a sharded, flattened format.

Since it’s flattened, this can only be used by FSDP.

SHARDED_STATE_DICT = 'SHARDED_STATE_DICT'#

The state dict will be saved in a sharded, unflattened format.

This can be used by other parallel schemes.

to_torch() → StateDictType[source]#: Converts to the corresponding torch.distributed.fsdp.StateDictType.

oumi.core.configs.params.generation_params module#

class oumi.core.configs.params.generation_params.GenerationParams(max_new_tokens: int = 1024, batch_size: Optional[int] = 1, exclude_prompt_from_response: bool = True, seed: Optional[int] = None, temperature: float = 0.0, top_p: float = 1.0, frequency_penalty: float = 0.0, presence_penalty: float = 0.0, stop_strings: Optional[list[str]] = None, stop_token_ids: Optional[list[int]] = None, logit_bias: dict[typing.Any, float] = <factory>, min_p: float = 0.0, use_cache: bool = False, num_beams: int = 1, use_sampling: bool = False, guided_decoding: Optional[oumi.core.configs.params.guided_decoding_params.GuidedDecodingParams] = None)[source]#

Bases: BaseParams

__post_init__()[source]#: Validates generation-specific parameters.

batch_size: int | None = 1#

The number of sequences to generate in parallel.

Larger batch sizes can improve throughput but require more memory. Default is 1.

The value must either be positive or None, in which case the behavior is dependent on the downstream application. For example, LM Harness will automatically determine the largest batch size that will fit in memory.

For inference, this parameter is only used in NativeTextInferenceEngine.

exclude_prompt_from_response: bool = True#: Whether to trim the model’s response and remove the prepended prompt.

frequency_penalty: float = 0.0#: Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim.

guided_decoding: GuidedDecodingParams | None = None#: Parameters for guided decoding.

logit_bias: dict[Any, float]#

Modify the likelihood of specified tokens appearing in the completion.

Keys are tokens (specified by their token ID in the tokenizer), and values are the bias (-100 to 100). Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.

max_new_tokens: int = 1024#

The maximum number of new tokens to generate.

This limits the length of the generated text to prevent excessively long outputs. Default is 1024 tokens.

min_p: float = 0.0#

Sets a minimum probability threshold for token selection.

Tokens with probabilities below this threshold are filtered out before top-p or top-k sampling. This can help prevent the selection of highly improbable tokens. Default is 0.0 (no minimum threshold).

num_beams: int = 1#: Number of beams for beam search. 1 means no beam search. Larger number of beams will make for a more thorough search for probable output token sequences, at the cost of increased computation time. Default is 1.

presence_penalty: float = 0.0#: Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics.

seed: int | None = None#: Seed to use for random number determinism. If specified, APIs may use this parameter to make a best-effort at determinism.

stop_strings: list[str] | None = None#: List of sequences where the API will stop generating further tokens.

stop_token_ids: list[int] | None = None#: List of token ids for which the API will stop generating further tokens. This is only supported in VLLMInferenceEngine and NativeTextInferenceEngine.

temperature: float = 0.0#

Controls randomness in the output.

Higher values (e.g., 1.0) make output more random, while lower values (e.g., 0.2) make it more focused and deterministic.

top_p: float = 1.0#

An alternative to temperature, called nucleus sampling.

It sets the cumulative probability threshold for token selection. For example, 0.9 means only considering the tokens comprising the top 90% probability mass.

use_cache: bool = False#: Whether to use the model’s internal cache (key/value attentions) to speed up generation. Default is False.

use_sampling: bool = False#: Whether to use sampling for next-token generation. If False, uses greedy decoding. Default is False.

oumi.core.configs.params.grpo_params module#

class oumi.core.configs.params.grpo_params.GrpoParams(model_init_kwargs: dict[str, typing.Any] = <factory>, max_prompt_length: Optional[int] = None, max_completion_length: Optional[int] = None, num_generations: Optional[int] = None, temperature: float = 0.9, remove_unused_columns: bool = False, repetition_penalty: Optional[float] = 1.0, use_vllm: bool = False, vllm_mode: Optional[str] = None, vllm_gpu_memory_utilization: float = 0.9, epsilon: float = 0.2, log_completions: bool = False)[source]#

Bases: BaseParams

__post_init__()[source]#: Verifies params.

epsilon: float = 0.2#

Epsilon value for clipping the relative probability in the loss.

For example, if epsilon is 0.2, then the new probability can only differ from the old probability by a factor of x0.8-1.2.

log_completions: bool = False#: Whether to log prompt and completion pairs every logging_steps steps.

max_completion_length: int | None = None#

Maximum length of the generated completion.

If unspecified (None), defaults to 256.

max_prompt_length: int | None = None#

Maximum length of the prompt.

If the prompt is longer than this value, it will be truncated left. If unspecified (None), defaults to 512.

model_init_kwargs: dict[str, Any]#: Keyword arguments for AutoModelForCausalLM.from_pretrained(…)

num_generations: int | None = None#

Number of generations per prompt to sample.

The global batch size (num_processes * per_device_batch_size) must be divisible by this value. If unspecified (None), defaults to 8.

remove_unused_columns: bool = False#

Whether to only keep the column “prompt” in the dataset.

If you use a custom reward function that requires any column other than “prompts” and “completions”, you should set it to False.

repetition_penalty: float | None = 1.0#

Float that penalizes new tokens if they appear in the prompt/response so far.

Values > 1.0 encourage the model to use new tokens, while values < 1.0 encourage the model to repeat tokens.

temperature: float = 0.9#

Temperature for sampling.

The higher the temperature, the more random the completions.

to_hf_trainer_kwargs() → dict[str, Any][source]#: Converts GrpoParams to TRL’s GRPOConfig kwargs.

use_vllm: bool = False#

Whether to use vLLM for generating completions.

If set to True, ensure that a GPU is kept unused for training, as vLLM will require one for generation.

vllm_gpu_memory_utilization: float = 0.9#

Ratio (between 0 and 1) of GPU memory to reserve.

Fraction of VRAM reserved for the model weights, activations, and KV cache on the device dedicated to generation powered by vLLM. Higher values will increase the KV cache size and thus improve the model’s throughput. However, if the value is too high, it may cause out-of-memory (OOM) errors during initialization.

vllm_mode: str | None = None#

The mode to use for vLLM generation (“colocate” or “server”).

If set to None, defaults to “server”.

Server mode means that vLLM is running on a separate server that the trainer will communicate with. It requires the server to be started with trl vllm-serve beforehand.

Colocate mode means that vLLM will run in the same process as the trainer and share GPUs. While this is simpler as it doesn’t require a separate server, vLLM will contend with the trainer for GPU resources.

oumi.core.configs.params.guided_decoding_params module#

class oumi.core.configs.params.guided_decoding_params.GuidedDecodingParams(json: Any | None = None, regex: str | None = None, choice: list[str] | None = None)[source]#

Bases: BaseParams

Parameters for guided decoding.

The parameters are mutually exclusive. Only one of the parameters can be specified at a time.

__post_init__() → None[source]#: Validate parameters.

choice: list[str] | None = None#

List of allowed choices for the output.

Restricts model output to one of the provided choices. Useful for forcing the model to select from a predefined set of options.

json: Any | None = None#

JSON schema, Pydantic model, or string to guide the output format.

Can be a dict containing a JSON schema, a Pydantic model class, or a string containing JSON schema. Used to enforce structured output from the model.

regex: str | None = None#

Regular expression pattern to guide the output format.

Pattern that the model output must match. Can be used to enforce specific text formats or patterns.

oumi.core.configs.params.judge_params module#

class oumi.core.configs.params.judge_params.JudgeOutputType(value)[source]#

Bases: str, Enum

Enumeration of possible output types for the judge’s output fields.

BOOL = 'bool'#: Boolean judgment (True/False, Yes/No).

ENUM = 'enum'#: Categorical judgment from predefined options.

FLOAT = 'float'#: Floating-point value judgment.

INT = 'int'#: Integer value judgment.

TEXT = 'text'#: Free-form text judgment.

class oumi.core.configs.params.judge_params.JudgeParams(prompt_template: str, system_instruction: str | None = None, template_variables: dict[str, str] = <factory>, response_format: ~oumi.core.configs.params.judge_params.JudgeResponseFormat = JudgeResponseFormat.XML, judgment_type: ~oumi.core.configs.params.judge_params.JudgeOutputType = JudgeOutputType.BOOL, judgment_scores: dict[str, float] | None = None, include_explanation: bool = False, examples: list[dict[str, str]] = <factory>)[source]#

Bases: BaseParams

Parameters for the Judge prompt and response format.

This class holds the parameters for a single-attribute judge, including the prompt template and response format.

Examples

Basic boolean judgment: >>> judge_params = JudgeParams( # doctest: +SKIP … prompt_template=”Is the following answer helpful? Question: {question}, … Answer: {answer}. Respond with True or False.”, … response_format=JudgeResponseFormat.XML, … judgment_type=JudgeOutputType.BOOL, … include_explanation=False … )

Categorical judgment with scores: >>> judge_params = JudgeParams( # doctest: +SKIP … prompt_template=”Rate the quality of this text: {text}. .. Respond with ‘excellent’, ‘good’, or ‘poor’.”, … response_format=JudgeResponseFormat.JSON, … judgment_type=JudgeOutputType.ENUM, … judgment_scores={“excellent”: 1.0, “good”: 0.7, “poor”: 0.3}, … include_explanation=True … )

__post_init__()[source]#: Validate the parameters after initialization.

examples: list[dict[str, str]]#

Few-shot examples for the judge as a list of field value dictionaries.

Each dictionary should contain values for all template placeholders and expected output fields. Used to provide examples of how the judge should respond.

Example

[

{: “question”: “What is 2+2?”, # placeholder value “answer”: “4”, # placeholder value “judgment”: “Correct”, # output field value “explanation”: “It is mathematically correct.” # output field value

}, {

“question”: “What is the capital of Mars?”, # placeholder value “answer”: “New York”, # placeholder value “judgment”: “Incorrect”, # output field value “explanation”: “Mars does not have capitals.” # output field value

}

]

get_placeholders() → set[str][source]#: Get the prompt template placeholders, after template variable replacement.

include_explanation: bool = False#: Whether the judge should provide an explanation before the judgment.

judgment_scores: dict[str, float] | None = None#

For ENUM judgment_type, the mapping from category names to numeric scores.

Example

{“excellent”: 1.0, “good”: 0.7, “poor”: 0.3}

judgment_type: JudgeOutputType = 'bool'#: The type of output that the judgment should be provided with.

prompt_template: str#: Template for the judge prompt with placeholders, such as {question}, {answer}.

replace_template_variables()[source]#: Apply template variables to prompt_template and system_instruction.

response_format: JudgeResponseFormat = 'xml'#: The format in which the judge should respond.

system_instruction: str | None = None#: Optional system message to guide judge behavior.

template_variables: dict[str, str]#

Variables to be replaced in prompt_template and system_instruction.

This dictionary contains variable names and their corresponding values that should be replaced in the prompt_template and system_instruction fields, before the dataset-based placeholders are processed. These variables have the following format: {variable_name}.

class oumi.core.configs.params.judge_params.JudgeResponseFormat(value)[source]#

Bases: str, Enum

Enumeration of possible response formats for the judge output.

JSON = 'json'#: JSON structured response format.

RAW = 'raw'#: Plain text response format.

XML = 'xml'#: XML-tagged response format.

oumi.core.configs.params.model_params module#

class oumi.core.configs.params.model_params.ModelParams(model_name: str = '???', adapter_model: Optional[str] = None, tokenizer_name: Optional[str] = None, tokenizer_pad_token: Optional[str] = None, tokenizer_kwargs: dict[str, typing.Any] = <factory>, processor_kwargs: dict[str, typing.Any] = <factory>, model_max_length: Optional[int] = None, load_pretrained_weights: bool = True, trust_remote_code: bool = False, torch_dtype_str: str = 'float32', compile: bool = False, chat_template: Optional[str] = None, attn_implementation: Optional[str] = None, device_map: Optional[str] = 'auto', model_kwargs: dict[str, typing.Any] = <factory>, enable_liger_kernel: bool = False, shard_for_eval: bool = False, freeze_layers: list[str] = <factory>, model_revision: Optional[str] = None)[source]#

Bases: BaseParams

__finalize_and_validate__()[source]#: Finalizes and validates final config params.

__post_init__()[source]#: Populate additional params.

adapter_model: str | None = None#

The path to an adapter model to be applied on top of the base model.

If provided, this adapter will be loaded and applied to the base model. The adapter path could alternatively be specified in model_name.

attn_implementation: str | None = None#

The attention implementation to use.

Valid options include:

None: Use the default attention implementation (spda for torch>=2.1.1, else eager)
“sdpa”: Use PyTorch’s scaled dot-product attention
“flash_attention_2”: Use Flash Attention 2 for potentially faster computation. Requires “flash-attn” package to be installed
“eager”: Manual implementation of attention

chat_template: str | None = None#

The chat template to use for formatting inputs.

If provided, this template will be used to format multi-turn conversations for models that support chat-like interactions.

Note

Different models may require specific chat templates. Consult the model’s documentation for the appropriate template to use.

compile: bool = False#

Whether to JIT compile the model.

For training, do not set this param, and instead set TrainingParams.compile.

device_map: str | None = 'auto'#

Specifies how to distribute the model’s layers across available devices.

“auto”: Automatically distribute the model across available devices
None: Load the entire model on the default device

Note

“auto” is generally recommended as it optimizes device usage, especially for large models that don’t fit on a single GPU.

enable_liger_kernel: bool = False#

Whether to enable the Liger kernel for potential performance improvements.

Liger is an optimized CUDA kernel that can accelerate certain operations.

Tip

Enabling this may improve performance, but ensure compatibility with your model and hardware before use in production.

freeze_layers: list[str]#

A list of layer names to freeze during training.

These layers will have their parameters set to not require gradients, effectively preventing them from being updated during the training process. This is useful for fine-tuning specific parts of a model while keeping other parts fixed.

load_pretrained_weights: bool = True#

Whether to load the pretrained model’s weights.

If True, the model will be initialized with pretrained weights. If False, the model will be initialized from the pretrained config without loading weights.

model_kwargs: dict[str, Any]#

Additional keyword arguments to pass to the model’s constructor.

This allows for passing any model-specific parameters that are not covered by other fields in ModelParams.

Note

Use this for model-specific parameters or to enable experimental features.

model_max_length: int | None = None#

The maximum sequence length the model can handle.

If specified, this will override the default max length of the model’s config.

Note

Setting this to a larger value may increase memory usage but allow for processing longer inputs. Ensure your hardware can support the chosen length.

model_name: str = '???'#

The name or path of the model or LoRA adapter to use.

This can be a model identifier from the Oumi registry, HuggingFace Hub, or a path to a local directory containing model files.

The LoRA adapter can be specified here instead of in adapter_model. If so, this value is copied to adapter_model, and the appropriate base model is set here instead. The base model could either be in the same directory as the adapter, or specified in the adapter’s config file.

model_revision: str | None = None#

The revision of the model to use.

This is used to specify the version of the model to use.

processor_kwargs: dict[str, Any]#

Additional keyword arguments to pass into the processor’s constructor.

Processors are used in Oumi for vision-language models to process image and text inputs. This field is optional and can be left empty for text-only models, or if not needed.

These params override model-specific default values for these kwargs, if present.

shard_for_eval: bool = False#

Whether to shard the model for evaluation.

This is needed for large models that do not fit on a single GPU. It is used as the value for the parallelize argument in LM Harness.

tokenizer_kwargs: dict[str, Any]#

Additional keyword arguments to pass into the tokenizer’s constructor.

This allows for passing any tokenizer-specific parameters that are not covered by other fields in ModelParams.

tokenizer_name: str | None = None#

The name or path of the tokenizer to use.

If None, the tokenizer associated with model_name will be used. Specify this if you want to use a different tokenizer than the default for the model.

tokenizer_pad_token: str | None = None#

The padding token used by the tokenizer.

If this is set, it will override the default padding token of the tokenizer and the padding token optionally defined in the tokenizer_kwargs.

torch_dtype_str: str = 'float32'#

The data type to use for the model’s parameters as a string.

Valid options are: - “float32” or “f32” or “float” for 32-bit floating point - “float16” or “f16” or “half” for 16-bit floating point - “bfloat16” or “bf16” for brain floating point - “float64” or “f64” or “double” for 64-bit floating point

This string will be converted to the corresponding torch.dtype. Defaults to “float32” for full precision.

trust_remote_code: bool = False#

Whether to allow loading remote code when loading the model.

If True, this allows loading and executing code from the model’s repository, which can be a security risk. Only set to True for models you trust.

Defaults to False for safety.

oumi.core.configs.params.peft_params module#

class oumi.core.configs.params.peft_params.LoraWeightInitialization(value)[source]#

Bases: str, Enum

Enum representing the supported weight initializations for LoRA adapters.

DEFAULT = 'default'#

EVA = 'eva'#

GAUSSIAN = 'gaussian'#

LOFTQ = 'loftq'#

OLORA = 'olora'#

PISA = 'pissa'#

PISSA_NITER = 'pissa_niter_[number of iters]'#

RANDOM = 'random'#

get_literal_value() → Literal['default', 'random', 'gaussian', 'eva', 'pissa', 'pissa_niter_[number of iters]', 'loftq', 'olora'][source]#: Returns a literal value of the enum.

class oumi.core.configs.params.peft_params.PeftParams(lora_r: int = 8, lora_alpha: int = 8, lora_dropout: float = 0.0, lora_target_modules: Optional[list[str]] = None, lora_modules_to_save: Optional[list[str]] = None, lora_bias: str = 'none', lora_init_weights: oumi.core.configs.params.peft_params.LoraWeightInitialization = <LoraWeightInitialization.DEFAULT: 'default'>, lora_task_type: peft.utils.peft_types.TaskType = <TaskType.CAUSAL_LM: 'CAUSAL_LM'>, q_lora: bool = False, q_lora_bits: int = 4, bnb_4bit_quant_type: str = 'fp4', llm_int8_skip_modules: Optional[list[str]] = None, use_bnb_nested_quant: bool = False, bnb_4bit_quant_storage: str = 'uint8', bnb_4bit_compute_dtype: str = 'float32', peft_save_mode: oumi.core.configs.params.peft_params.PeftSaveMode = <PeftSaveMode.ADAPTER_ONLY: 'adapter_only'>)[source]#

Bases: BaseParams

bnb_4bit_compute_dtype: str = 'float32'#

Compute type of the quantized parameters. It can be different than the input type, e.g., it can be set to a lower precision for improved speed.

The string will be converted to the corresponding torch.dtype.

Valid string options are: - “float32” for 32-bit floating point - “float16” for 16-bit floating point - “bfloat16” for brain floating point - “float64” for 64-bit floating point

Defaults to “float16” for half precision.

bnb_4bit_quant_storage: str = 'uint8'#

The storage type for packing quantized 4-bit parameters.

Defaults to ‘uint8’ for efficient storage.

bnb_4bit_quant_type: str = 'fp4'#

The type of 4-bit quantization to use.

Can be ‘fp4’ (float point 4) or ‘nf4’ (normal float 4).

llm_int8_skip_modules: list[str] | None = None#: An explicit list of the modules that we do not want to convert in 8-bit.

lora_alpha: int = 8#

The scaling factor for the LoRA update.

This value is typically set equal to lora_r or 2*lora_r for stable training.

lora_bias: str = 'none'#

Bias type for LoRA.

Can be ‘none’, ‘all’ or ‘lora_only’: - ‘none’: No biases are trained. - ‘all’: All biases in the model are trained. - ‘lora_only’: Only biases in LoRA layers are trained.

If ‘all’ or ‘lora_only’, the corresponding biases will be updated during training. Note that this means even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation.

For more details, see: huggingface/peft

lora_dropout: float = 0.0#

The dropout probability applied to LoRA layers.

This helps prevent overfitting in the adaptation layers.

lora_init_weights: LoraWeightInitialization = 'default'#

Passing LoraWeightInitialization.DEFAULT will use the underlying reference implementation of the corresponding model from Microsoft.

Other valid (LoraWeightInitialization) options include:

“random” which will use fully random initialization and is discouraged.
“gaussian” for Gaussian initialization.
“eva” for Explained Variance Adaptation (EVA) (https://arxiv.org/abs/2410.07170).
“loftq” for improved performance when LoRA is combined with with quantization (https://arxiv.org/abs/2310.08659).
“olora” for Orthonormal Low-Rank Adaptation of Large Language Models (OLoRA) (https://arxiv.org/html/2406.01775v1).
“pissa” for Principal Singular values and Singular vectors Adaptation (PiSSA) (https://arxiv.org/abs/2404.02948).

For more information, see HF:

huggingface/peft

lora_modules_to_save: list[str] | None = None#

List of module names to unfreeze and train alongside LoRA parameters.

These modules will be fully fine-tuned, not adapted using LoRA. Use this to selectively train certain parts of the model in full precision.

lora_r: int = 8#

The rank of the update matrices in LoRA.

A higher value allows for more expressive adaptations but increases the number of trainable parameters.

lora_target_modules: list[str] | None = None#

List of module names to apply LoRA to.

If None, LoRA will be applied to all linear layers in the model. Specify module names to selectively apply LoRA to certain parts of the model.

lora_task_type: TaskType = 'CAUSAL_LM'#

The task type for LoRA adaptation.

Defaults to CAUSAL_LM (Causal Language Modeling).

peft_save_mode: PeftSaveMode = 'adapter_only'#

How to save the final model during PEFT training.

This option is only used if TrainingParams.save_final_model is True. By default, only the model adapter is saved to reduce disk usage. Options are defined in the PeftSaveMode enum and include: - ADAPTER_ONLY: Only save the model adapter. - ADAPTER_AND_BASE_MODEL: Save the base model in addition to the adapter. - MERGED: Merge the adapter and base model’s weights and save as a single model.

q_lora: bool = False#

Whether to use quantization for LoRA (Q-LoRA).

If True, enables quantization for more memory-efficient fine-tuning.

q_lora_bits: int = 4#

The number of bits to use for quantization in Q-LoRA.

This is only used if q_lora is True.

Defaults to 4-bit quantization.

to_bits_and_bytes() → BitsAndBytesConfig[source]#

Creates a configuration for quantized models via BitsAndBytes.

The resulting configuration uses the instantiated peft parameters.

use_bnb_nested_quant: bool = False#

Whether to use nested quantization.

Nested quantization can provide additional memory savings.

class oumi.core.configs.params.peft_params.PeftSaveMode(value)[source]#

Bases: Enum

Enum representing how to save the final model during PEFT training.

While models saved with any of these options can be loaded by Oumi, those saved with ADAPTER_ONLY are not self-contained; the base model will be loaded separately from the local HF cache or downloaded from HF Hub if not in the cache.

ADAPTER_AND_BASE_MODEL = 'adapter_and_base_model'#

Save the base model in addition to the adapter.

This is similar to ADAPTER_ONLY, but the base model’s weights are also saved in the same directory as the adapter weights, making the output dir self-contained.

ADAPTER_ONLY = 'adapter_only'#

Only save the model adapter.

Note that when loading this saved model, the base model will be loaded separately from the local HF cache or downloaded from HF Hub.

MERGED = 'merged'#

Merge the adapter and base model’s weights and save as a single model.

Note that the resulting model is a standard HF Transformers model, and is no longer a PEFT model. A copy of the adapter before merging is saved in the “adapter/” subdirectory.

oumi.core.configs.params.profiler_params module#

class oumi.core.configs.params.profiler_params.ProfilerParams(save_dir: Optional[str] = None, enable_cpu_profiling: bool = False, enable_cuda_profiling: bool = False, record_shapes: bool = False, profile_memory: bool = False, with_stack: bool = False, with_flops: bool = False, with_modules: bool = False, row_limit: int = 50, schedule: oumi.core.configs.params.profiler_params.ProfilerScheduleParams = <factory>)[source]#

Bases: BaseParams

enable_cpu_profiling: bool = False#

Whether to profile CPU activity.

Corresponds to torch.profiler.ProfilerActivity.CPU.

enable_cuda_profiling: bool = False#

Whether to profile CUDA.

Corresponds to torch.profiler.ProfilerActivity.CUDA.

profile_memory: bool = False#: Track tensor memory allocation/deallocation.

record_shapes: bool = False#: Save information about operator’s input shapes.

row_limit: int = 50#

Max number of rows to include into profiling report tables.

Set to -1 to make it unlimited.

save_dir: str | None = None#

Directory where the profiling data will be saved to.

If not specified and profiling is enabled, then the profiler sub-dir will be used under output_dir.

schedule: ProfilerScheduleParams#: Parameters that define what subset of training steps to profile.

with_flops: bool = False#: Record module hierarchy (including function names) corresponding to the callstack of the op.

with_modules: bool = False#: Use formula to estimate the FLOPs (floating point operations) of specific operators (matrix multiplication and 2D convolution).

with_stack: bool = False#: Record source information (file and line number) for the ops.

class oumi.core.configs.params.profiler_params.ProfilerScheduleParams(enable_schedule: bool = False, wait: int = 0, warmup: int = 1, active: int = 3, repeat: int = 1, skip_first: int = 1)[source]#

Bases: BaseParams

Parameters that define what subset of training steps to profile.

Keeping profiling enabled for all training steps may be impractical as it may result in out-of-memory errors, extremely large trace files, and may interfere with regular training performance. This config can be used to enable PyTorch profiler only for a small number of training steps, which is not affected by such issues, and may still provide a useful signal for performance analysis.

__post_init__()[source]#: Verifies params.

active: int = 3#: The number of training steps to do active recording (ProfilerAction.RECORD) in each profiling cycle.

enable_schedule: bool = False#

Whether profiling schedule is enabled.

If False, then profiling is enabled for the entire process duration, and all schedule parameters below will be ignored.

repeat: int = 1#

The optional number of profiling cycles.

Each cycle includes wait + warmup + active steps. The zero value means that the cycles will continue until the profiling is finished.

skip_first: int = 1#: The number of initial training steps to skip at the beginning of profiling session (ProfilerAction.NONE).

wait: int = 0#: The number of training steps to skip at the beginning of each profiling cycle (ProfilerAction.NONE). Each cycle includes wait + warmup + active steps.

warmup: int = 1#: The number of training steps to do profiling warmup (ProfilerAction.WARMUP) in each profiling cycle.

oumi.core.configs.params.remote_params module#

class oumi.core.configs.params.remote_params.AdaptiveConcurrencyParams(min_concurrency: int = 5, max_concurrency: int = 100, initial_concurrency_factor: float = 0.5, concurrency_step: int = 5, min_update_time: float = 60.0, error_threshold: float = 0.01, backoff_factor: float = 0.8, recovery_threshold: float = 0.0, min_window_size: int = 10)[source]#

Bases: BaseParams

Configuration for adaptive concurrency control.

__post_init__()[source]#: Validate the adaptive concurrency parameters.

backoff_factor: float = 0.8#

Factor to multiply concurrency by during backoff.

During backoff, the concurrency will be reduced by this factor. (i.e. if concurrency is 100, and the backoff factor is 0.8, the concurrency will be reduced to 80).

concurrency_step: int = 5#

How much to increase concurrency during warmup.

During warmup, concurrency will be increased by this amount. (i.e. if concurrency is 50, and the concurrency step is 5, the concurrency will be increased to 55). This change will happen no sooner than min_update_time seconds after the last update.

error_threshold: float = 0.01#

Error rate threshold (0.01 = 1%) to trigger backoff.

If the error rate is greater than this threshold, backoff will be triggered.

Consider keeping this value low. Once a particular error rate is reached, there are already other requests in-flight which will likely fail, so the sooner concurrency is reduced, the better chance the system has of recovering.

initial_concurrency_factor: float = 0.5#

Initial concurrency factor.

The initial concurrency will be set to (max_concurrency - min_concurrency) * initial_concurrency_factor + min_concurrency.

Example: - min_concurrency = 5 - max_concurrency = 100 - initial_concurrency_factor = 0.5 - initial_concurrency = (100 - 5) * 0.5 + 5 = 52.5 = 52

max_concurrency: int = 100#

Maximum number of concurrent requests allowed.

The concurrency will never be allowed to go above this value.

min_concurrency: int = 5#

Minimum number of concurrent requests to allow.

Backoff throttling will never reduce concurrency below this value.

min_update_time: float = 60.0#

Minimum seconds between attempted updates.

The concurrency will not be adjusted sooner than this time.

min_window_size: int = 10#

Minimum number of recent requests to consider for error rate calculation.

If the number of requests since the last update is less than this threshold, the concurrency will not be adjusted.

recovery_threshold: float = 0.0#

Error rate threshold (0.00 = 0%) to allow recovery.

If the error rate is less than this threshold, recovery will be triggered.

class oumi.core.configs.params.remote_params.RemoteParams(api_url: str | None = None, api_key: str | None = None, api_key_env_varname: str | None = None, max_retries: int = 3, retry_backoff_base: float = 1.0, retry_backoff_max: float = 30.0, connection_timeout: float = 300.0, num_workers: int = 1, politeness_policy: float = 0.0, batch_completion_window: str | None = '24h', use_adaptive_concurrency: bool = True)[source]#

Bases: BaseParams

Parameters for running inference against a remote API.

__post_init__()[source]#: Validate the remote parameters.

api_key: str | None = None#: API key to use for authentication.

api_key_env_varname: str | None = None#: Name of the environment variable containing the API key for authentication.

api_url: str | None = None#: URL of the API endpoint to use for inference.

batch_completion_window: str | None = '24h'#

Time window for batch completion. Currently only ‘24h’ is supported.

Only used for batch inference.

connection_timeout: float = 300.0#: Timeout in seconds for a request to an API.

max_retries: int = 3#: Maximum number of retries to attempt when calling an API.

num_workers: int = 1#: Number of workers to use for parallel inference.

politeness_policy: float = 0.0#

Politeness policy to use when calling an API.

If greater than zero, this is the amount of time in seconds a worker will sleep before making a subsequent request.

retry_backoff_base: float = 1.0#: Base delay in seconds for exponential backoff between retries.

retry_backoff_max: float = 30.0#: Maximum delay in seconds between retries.

use_adaptive_concurrency: bool = True#

Whether to use adaptive concurrency control.

If True, the number of concurrent requests will be adjusted based on the error rate of the requests. As error rate increases above a threshold, the number of concurrent requests will decrease, and as error rate decreases below a threshold, the number of concurrent requests will increase.

When this is enabled, users should set num_workers to the requests per minute (RPM/QPM) of the model/API, and the politeness_policy to 60s (as most APIs query limits are dictated by the number of requests per minute).

The lowest concurrency can be is 1, and the highest concurrency is num_workers. Updates to concurrency will happen no sooner than politeness_policy seconds after the last update, and at least 10 requests must have been made since the last update.

In the event that even 1 concurrency causes the error rate to exceed the threshold, it is recommended to increase the politeness_policy to allow more time between requests.

oumi.core.configs.params.synthesis_params module#

class oumi.core.configs.params.synthesis_params.AttributeCombination(combination: dict[str, str], sample_rate: float)[source]#

Bases: object

Sampling rates for combinations of attributes.

__post_init__()[source]#: Verifies/populates params.

combination: dict[str, str]#: Combination of attribute values to be used.

sample_rate: float#: Sample rate for the combination.

class oumi.core.configs.params.synthesis_params.ChatTransform(transforms: Conversation)[source]#

Bases: object

Transform of an attribute using a chat.

__post_init__()[source]#: Verifies/populates params.

transforms: Conversation#: List of transforms for chat messages.

class oumi.core.configs.params.synthesis_params.DatasetSource(path: str, hf_split: str | None = None, hf_revision: str | None = None, attribute_map: dict[str, str] | None = None)[source]#

Bases: object

Dataset to be used in synthesis.

__post_init__()[source]#: Verifies/populates params.

attribute_map: dict[str, str] | None = None#: Map of attributes to be used in synthesis. Will use the existing keys in the dataset if not specified.

hf_revision: str | None = None#: Revision of the huggingface dataset to be used in synthesis.

hf_split: str | None = None#: Split of the huggingface dataset to be used in synthesis.

path: str#: Path to the dataset source.

class oumi.core.configs.params.synthesis_params.DictTransform(transforms: dict[str, str])[source]#

Bases: object

Create a new attribute which is a dictionary of strings.

__post_init__()[source]#: Verifies/populates params.

transforms: dict[str, str]#: Mapping of dictionary keys to their corresponding transforms.

class oumi.core.configs.params.synthesis_params.DocumentSegmentationParams(id: str, segmentation_strategy: SegmentationStrategy = SegmentationStrategy.TOKENS, tokenizer: str = 'openai-community/gpt2', segment_length: int = 2048, segment_overlap: int = 0, keep_original_text: bool = False)[source]#

Bases: object

Segmentation parameters to be used when segmenting the document.

__post_init__()[source]#: Verifies/populates params.

id: str#: ID to be used when referencing the document segment during synthesis.

keep_original_text: bool = False#: Whether to keep the original text of the document.

segment_length: int = 2048#: Length of each segment, dependent on the segmentation strategy.

segment_overlap: int = 0#: Overlap between segments. Must be less than segment_length.

segmentation_strategy: SegmentationStrategy = 'tokens'#: Type of segmentation to be used.

tokenizer: str = 'openai-community/gpt2'#

Tokenizer to be used for segmentation.

Tokenizers can be specified by their HuggingFace Hub ID or by direct file path. If not specified, will use the GPT-2 tokenizer from the HuggingFace Hub.

class oumi.core.configs.params.synthesis_params.DocumentSource(path: str, id: str, segmentation_params: DocumentSegmentationParams | None = None)[source]#

Bases: object

Documents to be used in synthesis.

__post_init__()[source]#: Verifies/populates params.

id: str#: ID to be used when referencing the document during synthesis.

path: str#: Path to the document source.

segmentation_params: DocumentSegmentationParams | None = None#: Segmentation parameters to be used when segmenting the document.

class oumi.core.configs.params.synthesis_params.ExampleSource(examples: list[dict[str, Any]])[source]#

Bases: object

In-line examples to be used in synthesis.

__post_init__()[source]#: Verifies/populates params.

examples: list[dict[str, Any]]#: Examples to be used in synthesis.

class oumi.core.configs.params.synthesis_params.GeneralSynthesisParams(input_data: list[DatasetSource] | None = None, input_documents: list[DocumentSource] | None = None, input_examples: list[ExampleSource] | None = None, permutable_attributes: list[PermutableAttribute] | None = None, combination_sampling: list[AttributeCombination] | None = None, generated_attributes: list[GeneratedAttribute] | None = None, transformed_attributes: list[TransformedAttribute] | None = None, passthrough_attributes: list[str] | None = None)[source]#

Bases: BaseParams

General synthesis parameters.

__post_init__()[source]#: Verifies/populates params.

combination_sampling: list[AttributeCombination] | None = None#

Sampling rates for combinations of attributes.

Each combination is a dictionary of attribute IDs to their values. The sample rate is the probability of sampling this combination. The sample rate of all combinations must sum to <= 1.0.

generated_attributes: list[GeneratedAttribute] | None = None#

Attributes to be generated.

Generated attributes are created by running a chat with the model. The chat is specified by a list of messages. The messages will be populated with attribute values specific to that data point. The output of the chat is the generated attribute.

For example, if one of the previous attributes is “name”, and you use the following instruction messages: [

{“role”: “system”, “content”: “You are a helpful assistant.”}, {“role”: “user”, “content”: “How do you pronounce the name {name}?”}

]

Then assuming your data point has a value of “Oumi” for the “name” attribute, the chat will be run with the following messages: [

{“role”: “system”, “content”: “You are a helpful assistant.”}, {“role”: “user”, “content”: “How do you pronounce the name Oumi?”}

]

The model’s response to these messages will be the value of the “name” attribute for that data point.

input_data: list[DatasetSource] | None = None#

Datasets whose rows and columns will be used in synthesis.

Rows will be enumerated during sampling, and columns can be referenced as attributes when generating new attributes.

input_documents: list[DocumentSource] | None = None#

Documents to be used in synthesis.

Documents will be enumerated during sampling, and both documents and document segments can be referenced as attributes when generating new attributes.

input_examples: list[ExampleSource] | None = None#

In-line examples to be used in synthesis.

Examples will be enumerated during sampling, and attributes can be referenced as attributes when generating new attributes.

passthrough_attributes: list[str] | None = None#: When specified, will ONLY pass through these attributes in final output. If left unspecified, all attributes are saved. If an attribute is specified in passthrough_attributes but doesn’t exist, it will be ignored.

permutable_attributes: list[PermutableAttribute] | None = None#

Attributes to be varied across the dataset.

Attributes each have a set of possible values which will be randomly sampled according to their sample rate. If no sample rate is specified, a uniform distribution is used. Sample rates must sum to <= 1.0. Any attributes that do not have a sample rate will be given a uniform sample rate equal to whatever remains.

For example, if there are 3 attributes with sample rates of 0.5, 0.3, and 0.2, the total sample rate is 1.0. The first attribute will be sampled 50% of the time, the second attribute will be sampled 30% of the time, and the third attribute will be sampled 20% of the time. If the last two attributes have no sample rate, they will be sampled 25% of the time each as (1.0 - 0.5) / 2 = 0.25.

transformed_attributes: list[TransformedAttribute] | None = None#

Transformation of existing attributes.

Transformed attributes involve no model interaction and instead are for the convenience of transforming parts of your data into a new form.

For example, if you have “prompt” and “response” attributes, you can create a “chat” attribute by transforming the “prompt” and “response” attributes into a chat message.

[: {“role”: “user”, “content”: “{prompt}”}, {“role”: “assistant”, “content”: “{response}”}

]

class oumi.core.configs.params.synthesis_params.GeneratedAttribute(id: str, instruction_messages: Conversation, postprocessing_params: GeneratedAttributePostprocessingParams | None = None)[source]#

Bases: object

Attributes to be generated.

__post_init__()[source]#: Verifies/populates params.

id: str#: ID to be used when referencing the attribute during synthesis.

instruction_messages: Conversation#: List of messages providing instructions for generating this attribute.

postprocessing_params: GeneratedAttributePostprocessingParams | None = None#: Postprocessing parameters for the generated attribute.

class oumi.core.configs.params.synthesis_params.GeneratedAttributePostprocessingParams(id: str, keep_original_text_attribute: bool = True, cut_prefix: str | None = None, cut_suffix: str | None = None, regex: str | None = None, strip_whitespace: bool = True, added_prefix: str | None = None, added_suffix: str | None = None)[source]#

Bases: object

Postprocessing parameters for generated attributes.

__post_init__()[source]#: Verifies/populates params.

added_prefix: str | None = None#: Prefix to be added to the value.

added_suffix: str | None = None#: Suffix to be added to the value.

cut_prefix: str | None = None#: Cut off value before and including prefix.

cut_suffix: str | None = None#: Cut off value after and including suffix.

id: str#: ID to be used when referencing the postprocessing parameters during synthesis.

keep_original_text_attribute: bool = True#: Whether to keep the original text of the generated attribute. If True, the original text will be returned as an attribute. If False, the original text will be discarded.

regex: str | None = None#: Regex to be used to pull out the value from the generated text.

strip_whitespace: bool = True#: Whether to strip whitespace from the value.

class oumi.core.configs.params.synthesis_params.ListTransform(element_transforms: list[str])[source]#

Bases: object

Create a new attribute which is a list of strings.

__post_init__()[source]#: Verifies/populates params.

element_transforms: list[str]#: List of transforms for each element of the list.

class oumi.core.configs.params.synthesis_params.PermutableAttribute(id: str, attribute: str, description: str, possible_values: list[PermutableAttributeValue])[source]#

Bases: object

Attributes to be varied across the dataset.

__post_init__()[source]#: Verifies/populates params.

attribute: str#: Plaintext name of the attribute. Referenced as {attribute_id}

description: str#: Description of the attribute. Referenced as {attribute_id.description}

get_value_distribution() → dict[str, float][source]#: Get the distribution of attribute values.

id: str#: ID to be used when referencing the attribute during synthesis.

possible_values: list[PermutableAttributeValue]#: Type of the attribute.

class oumi.core.configs.params.synthesis_params.PermutableAttributeValue(id: str, value: str, description: str, sample_rate: float | None = None)[source]#

Bases: object

Value to be used for the attribute.

__post_init__()[source]#: Verifies/populates params.

description: str#: Description of the attribute value. Referenced as {attribute_id.value.description}

id: str#: ID to be used when referencing the attribute value during synthesis.

sample_rate: float | None = None#: Sample rate for the attribute value. If not specified, will assume uniform sampling among possible values.

value: str#: Value to be used for the attribute. Referenced as {attribute_id.value}

class oumi.core.configs.params.synthesis_params.SegmentationStrategy(value)[source]#

Bases: str, Enum

Segmentation strategies.

TOKENS = 'tokens'#: Segment the document via tokens.

class oumi.core.configs.params.synthesis_params.TransformedAttribute(id: str, transformation_strategy: str | ListTransform | DictTransform | ChatTransform)[source]#

Bases: object

Transformation of existing attributes.

__post_init__()[source]#: Verifies/populates params.

id: str#: ID to be used when referencing the transformed attribute during synthesis.

transformation_strategy: str | ListTransform | DictTransform | ChatTransform#: Strategy to be used for the transformation.

oumi.core.configs.params.telemetry_params module#

class oumi.core.configs.params.telemetry_params.TelemetryParams(telemetry_dir: str | None = 'telemetry', collect_telemetry_for_all_ranks: bool = False, track_gpu_temperature: bool = False)[source]#

Bases: BaseParams

collect_telemetry_for_all_ranks: bool = False#

Whether to collect telemetry for all ranks.

By default, only the main rank’s telemetry stats are collected and saved.

telemetry_dir: str | None = 'telemetry'#

Directory where the telemetry data will be saved to.

If not specified, then telemetry files will be written under output_dir. If a relative path is specified, then files will be written in a telemetry_dir sub-directory in output_dir.

track_gpu_temperature: bool = False#

Whether to record GPU temperature.

If save_telemetry_for_all_ranks is False, only the first GPU’s temperature is tracked. Otherwise, temperature is recorded for all GPUs.

oumi.core.configs.params.training_params module#

class oumi.core.configs.params.training_params.MixedPrecisionDtype(value)[source]#

Bases: str, Enum

Enum representing the dtype used for mixed precision training.

For more details on mixed-precision training, see: https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html

BF16 = 'bf16'#

Similar to fp16 mixed precision, but with bf16 instead.

This requires Ampere or higher NVIDIA architecture, or using CPU or Ascend NPU.

FP16 = 'fp16'#

fp16 mixed precision.

Requires ModelParams.torch_dtype (the dtype of the model weights) to be fp32. The model weights and optimizer state are fp32, but some ops will run in fp16 to improve training speed.

NONE = 'none'#

No mixed precision.

Uses ModelParams.torch_dtype as the dtype for all tensors (model weights, optimizer state, activations, etc.).

class oumi.core.configs.params.training_params.SchedulerType(value)[source]#

Bases: str, Enum

Enum representing the supported learning rate schedulers.

For optional args for each scheduler, see src/oumi/builders/lr_schedules.py.

CONSTANT = 'constant'#

Constant scheduler.

Keeps the learning rate constant throughout training.

COSINE = 'cosine'#

Cosine scheduler.

Decays the learning rate following the decreasing part of a cosine curve.

COSINE_WITH_MIN_LR = 'cosine_with_min_lr'#

Cosine with a minimum learning rate scheduler.

Similar to cosine scheduler, but maintains a minimum learning rate at the end.

COSINE_WITH_RESTARTS = 'cosine_with_restarts'#

Cosine with restarts scheduler.

Decays the learning rate following a cosine curve with periodic restarts.

LINEAR = 'linear'#

Linear scheduler.

Decreases the learning rate linearly from the initial value to 0 over the course of training.

class oumi.core.configs.params.training_params.TrainerType(value)[source]#

Bases: Enum

Enum representing the supported trainers.

HF = 'hf'#

Generic HuggingFace trainer from transformers library.

This is the standard trainer provided by the Hugging Face Transformers library, suitable for a wide range of training tasks.

OUMI = 'oumi'#

Custom generic trainer implementation.

This is a custom trainer implementation specific to the Oumi project, designed to provide additional flexibility and features.

TRL_DPO = 'trl_dpo'#

Direct Preference Optimization trainer from trl library.

This trainer implements the Direct Preference Optimization algorithm for fine-tuning language models based on human preferences.

TRL_GRPO = 'trl_grpo'#

Group Relative Policy Optimization trainer from trl library.

This trainer implements the Group Relative Policy Optimization algorithm introduced in the paper https://arxiv.org/pdf/2402.03300 for fine-tuning language models. Optionally, supports user-defined reward functions.

TRL_SFT = 'trl_sft'#

Supervised fine-tuning trainer from trl library.

This trainer is specifically designed for supervised fine-tuning tasks using the TRL (Transformer Reinforcement Learning) library.

VERL_GRPO = 'verl_grpo'#

Group Relative Policy Optimization trainer from verl library.

This trainer implements the Group Relative Policy Optimization algorithm introduced in the paper https://arxiv.org/pdf/2402.03300 for fine-tuning language models. Optionally, supports user-defined reward functions.

class oumi.core.configs.params.training_params.TrainingParams(use_peft: bool = False, trainer_type: oumi.core.configs.params.training_params.TrainerType = <TrainerType.HF: 'hf'>, enable_gradient_checkpointing: bool = False, gradient_checkpointing_kwargs: dict[str, typing.Any] = <factory>, output_dir: str = 'output', per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, gradient_accumulation_steps: int = 1, max_steps: int = -1, num_train_epochs: int = 3, save_epoch: bool = False, save_steps: int = 500, save_final_model: bool = True, seed: int = 42, data_seed: int = 42, use_deterministic: bool = False, full_determinism: bool = False, run_name: Optional[str] = None, metrics_function: Optional[str] = None, reward_functions: Optional[list[str]] = None, grpo: oumi.core.configs.params.grpo_params.GrpoParams = <factory>, log_level: str = 'info', dep_log_level: str = 'warning', enable_wandb: bool = False, enable_mlflow: bool = False, enable_tensorboard: bool = True, logging_strategy: str = 'steps', logging_dir: Optional[str] = None, logging_steps: int = 50, logging_first_step: bool = False, eval_strategy: str = 'no', eval_steps: int = 500, learning_rate: float = 5e-05, lr_scheduler_type: str = 'linear', lr_scheduler_kwargs: dict[str, typing.Any] = <factory>, warmup_ratio: Optional[float] = None, warmup_steps: Optional[int] = None, optimizer: str = 'adamw_torch', weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, sgd_momentum: float = 0.0, mixed_precision_dtype: oumi.core.configs.params.training_params.MixedPrecisionDtype = <MixedPrecisionDtype.NONE: 'none'>, compile: bool = False, include_performance_metrics: bool = False, include_alternative_mfu_metrics: bool = False, log_model_summary: bool = False, resume_from_checkpoint: Optional[str] = None, try_resume_from_last_checkpoint: bool = False, dataloader_num_workers: Union[int, str] = 0, dataloader_persistent_workers: bool = False, dataloader_prefetch_factor: Optional[int] = None, dataloader_main_process_only: Optional[bool] = None, ddp_find_unused_parameters: Optional[bool] = None, max_grad_norm: Optional[float] = 1.0, trainer_kwargs: dict[str, typing.Any] = <factory>, verl_config_overrides: dict[str, typing.Any] = <factory>, profiler: oumi.core.configs.params.profiler_params.ProfilerParams = <factory>, telemetry: oumi.core.configs.params.telemetry_params.TelemetryParams = <factory>, empty_device_cache_steps: Optional[int] = None, nccl_default_timeout_minutes: Optional[float] = None, label_ignore_index: Optional[int] = None)[source]#

Bases: BaseParams

__post_init__()[source]#: Verifies params.

adam_beta1: float = 0.9#

The beta1 parameter for Adam-based optimizers.

Exponential decay rate for the first moment estimates. Default is 0.9.

adam_beta2: float = 0.999#

The beta2 parameter for Adam-based optimizers.

Exponential decay rate for the second moment estimates. Default is 0.999.

adam_epsilon: float = 1e-08#

Epsilon parameter for Adam-based optimizers.

Small constant for numerical stability. Default is 1e-08.

compile: bool = False#

Whether to JIT compile the model.

This parameter should be used instead of ModelParams.compile for training.

data_seed: int = 42#: Random data_seed used for initialization. The seed to use for the underlying generator when using use_seedable_sampler. If None, the generator will use the current default seed from torch. Used only by the HuggingFace trainers.

dataloader_main_process_only: bool | None = None#

Controls whether the dataloader is iterated through on the main process only.

If set to True, the dataloader is only iterated through on the main process (rank 0), then the batches are split and broadcast to each process. This can reduce the number of requests to the dataset, and helps ensure that each example is seen by max one GPU per epoch, but may become a performance bottleneck if a large number of GPUs is used.

If set to False, the dataloader is iterated through on each GPU process.

If set to None (default), then True or False is auto-selected based on heuristics (properties of dataset, the number of nodes and/or GPUs, etc).

NOTE: We recommend to benchmark your setup, and configure True or False.

dataloader_num_workers: int | str = 0#

Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process.

You can also use the special value “auto” to select the number of dataloader workers using a simple heuristic based on the number of CPU-s and GPU-s per node. Note that the accurate estimation of workers is difficult and depends on many factors (the properties of a model, dataset, VM, network, etc) so you can start with “auto” then experimentally tune the exact number to make it more optimal for your specific case. If “auto” is requested, then at minimum 1 worker is guaranteed to be assigned.

dataloader_persistent_workers: bool = False#: Whether to use persistent workers for data loading (HF Trainers only). If True, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. Can potentially speed up training, but will increase RAM usage. Will default to False.

dataloader_prefetch_factor: int | None = None#

Number of batches loaded in advance by each worker.

2 means there will be a total of 2 * num_workers batches prefetched across all workers.

This is only used if dataloader_num_workers >= 1.

ddp_find_unused_parameters: bool | None = None#

When using PyTorch’s DistributedDataParallel training, the value of this flag is passed to find_unused_parameters.

Will default to False if gradient checkpointing is used, True otherwise.

dep_log_level: str = 'warning'#

The logging level for dependency loggers (e.g., HuggingFace, PyTorch).

Possible values are “debug”, “info”, “warning”, “error”, “critical”.

empty_device_cache_steps: int | None = None#

Number of steps to wait before calling torch.<device>.empty_cache().

This parameter determines how frequently the GPU cache should be cleared during training. If set, it will trigger cache clearing every empty_device_cache_steps. If left as None, the cache will not be emptied automatically.

Setting this can help manage GPU memory usage, especially for large models or long training runs, but may impact performance if set too low.

enable_gradient_checkpointing: bool = False#

Whether to enable gradient checkpointing to save memory at the expense of speed.

Gradient checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward pass, it recomputes these activations during the backward pass. This can make the training slower, but it can also significantly reduce memory usage.

enable_mlflow: bool = False#

Whether to enable MLflow logging.

If True, MLflow will be used for experiment tracking and visualization. If you want to use MLflow, you must set the MLFLOW_TRACKING_URI environment variable to specify the tracking server URI and the MLFLOW_EXPERIMENT_ID or MLFLOW_EXPERIMENT_NAME environment variable to specify the experiment to report the run to.

enable_tensorboard: bool = True#

Whether to enable TensorBoard logging.

If True, TensorBoard will be used for logging metrics and visualizations.

enable_wandb: bool = False#

Whether to enable Weights & Biases (wandb) logging.

If True, wandb will be used for experiment tracking and visualization. Wandb will also log a summary of the training run, including hyperparameters, metrics, and other relevant information at the end of training.

After enabling, you must set the WANDB_API_KEY environment variable. Alternatively, you can use the wandb login command to authenticate.

eval_steps: int = 500#

Number of update steps between two evaluations if eval_strategy=”steps”.

Ignored if eval_strategy is not “steps”.

eval_strategy: str = 'no'#

The strategy to use for evaluation during training.

Possible values: - “no”: No evaluation is done during training. - “steps”: Evaluation is done every eval_steps. - “epoch”: Evaluation is done at the end of each epoch.

full_determinism: bool = False#: If True, enable_full_determinism() is called instead of set_seed() to ensure reproducible results in distributed training. This will only affect HF trainers. Important: this will negatively impact performance, so only use it for debugging.

gradient_accumulation_steps: int = 1#

Number of update steps to accumulate before performing a backward/update pass.

This technique allows for effectively larger batch sizes and is especially useful when such batch sizes would not fit in memory. This is achieved by accumulating gradients from multiple forward passes before performing a single optimization step. Setting this to >1 can increase however memory usage for training setups without existing gradient accumulation buffers (ex. 1-GPU training).

gradient_checkpointing_kwargs: dict[str, Any]#

Keyword arguments for gradient checkpointing.

The use_reentrant parameter is required and is recommended to be set to False. For more details, see: https://pytorch.org/docs/stable/checkpoint.html

grpo: GrpoParams#: Parameters for GRPO training.

include_alternative_mfu_metrics: bool = False#

Whether to report alternative MFU (Model FLOPs Utilization) metrics.

These metrics are based on HuggingFace’s total_flos. This option is only used if include_performance_metrics is True.

include_performance_metrics: bool = False#: Whether to include performance metrics such as token statistics.

label_ignore_index: int | None = None#

Tokens with this label value don’t contribute to the loss computation. For example, this can be PAD, or image tokens. -100 is the PyTorch convention. Refer to the ignore_index parameter of torch.nn.CrossEntropyLoss() for more details.

If unspecified (None), then the default model-specific preferences configured in Oumi may be used.

Users should only set label_ignore_index if the default behavior is not satisfactory, or for new models not yet fully-integrated by Oumi.

learning_rate: float = 5e-05#

The initial learning rate for the optimizer.

This value can be adjusted by the learning rate scheduler during training.

log_level: str = 'info'#

The logging level for the main Oumi logger.

Possible values are “debug”, “info”, “warning”, “error”, “critical”.

log_model_summary: bool = False#: Whether to print a model summary, including layer names.

logging_dir: str | None = None#

The directory where training logs will be saved.

This includes TensorBoard logs and other training-related output.

logging_first_step: bool = False#

Whether to log and evaluate the first global step.

If True, metrics will be logged and evaluation will be performed at the very beginning of training. Skipping the first step can be useful to avoid logging and evaluation of the initial random model.

The first step is usually not representative of the model’s performance, as it includes model compilation, optimizer initialization, and other setup steps.

logging_steps: int = 50#

Number of update steps between two logs if logging_strategy=”steps”.

Ignored if logging_strategy is not “steps”.

logging_strategy: str = 'steps'#

The strategy to use for logging during training.

Possible values are: - “steps”: Log every logging_steps steps. - “epoch”: Log at the end of each epoch. - “no”: Disable logging.

lr_scheduler_kwargs: dict[str, Any]#

Additional keyword arguments to pass to the learning rate scheduler.

These arguments can be used to fine-tune the behavior of the chosen scheduler.

lr_scheduler_type: str = 'linear'#

The type of learning rate scheduler to use.

Possible values include “linear”, “cosine”, “cosine_with_restarts”,: “cosine_with_min_lr” and “constant”.

See src/oumi/builders/lr_schedules.py for more details on each scheduler.

max_grad_norm: float | None = 1.0#

Maximum gradient norm (for gradient clipping) to avoid exploding gradients which can destabilize training.

Defaults to 1.0. When set to 0.0 or None gradient clipping will not be applied.

max_steps: int = -1#

If set to a positive number, the total number of training steps to perform.

This parameter overrides num_train_epochs. If set to -1 (default), the number of training steps is determined by num_train_epochs.

metrics_function: str | None = None#

The name of the metrics function in the Oumi registry to use for evaluation during training.

The method must accept as input a HuggingFace EvalPrediction and return a dictionary of metrics, with string keys mapping to metric values. A single metrics_function may compute multiple metrics.

mixed_precision_dtype: MixedPrecisionDtype = 'none'#

The data type to use for mixed precision training.

Default is NONE, which means no mixed precision is used.

nccl_default_timeout_minutes: float | None = None#

Default timeout for NCCL operations in minutes.

See: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

If unset, will use the default value of torch.distributed.init_process_group which is 10min.

num_train_epochs: int = 3#

Total number of training epochs to perform (if max_steps is not specified).

An epoch is one complete pass through the entire training dataset. This parameter is ignored if max_steps is set to a positive number.

optimizer: str = 'adamw_torch'#

The optimizer to use for training.

See pytorch documentation for more information on available optimizers: https://pytorch.org/docs/stable/optim.html

Default is “adamw_torch” (AdamW implemented by PyTorch).

output_dir: str = 'output'#

Directory where the output files will be saved.

This includes checkpoints, evaluation results, and any other artifacts produced during the training process.

per_device_eval_batch_size: int = 8#

Number of samples per batch on each device during evaluation.

Similar to per_device_train_batch_size, but used during evaluation phases. Can often be set higher than the train batch size as no gradients are stored.

per_device_train_batch_size: int = 8#

Number of samples per batch on each device during training.

This parameter directly affects memory usage and training speed. Larger batch sizes generally lead to better utilization of GPU compute capabilities but require more memory.

profiler: ProfilerParams#

Parameters for performance profiling.

This field contains configuration options for the profiler, which can be used to analyze the performance of the training process. It uses the ProfilerParams class to define specific profiling settings.

resume_from_checkpoint: str | None = None#

Path to a checkpoint folder from which to resume training.

If specified, training will resume by first loading the model from this folder.

reward_functions: list[str] | None = None#

The names of the reward function in the Oumi registry to use for reinforcement learning.

Only supported with the TRL_GRPO and VERL_GRPO trainers. Currently, VERL_GRPO only supports specifying a single reward function.

For TRL_GRPO, refer to https://huggingface.co/docs/trl/main/en/grpo_trainer for documentation about the function signature.

For VERL_GRPO, refer to https://verl.readthedocs.io/en/latest/preparation/reward_function.html for documentation about the function signature.

run_name: str | None = None#

A unique identifier for the current training run.

This name is used to identify the run in logging outputs, saved model checkpoints, and experiment tracking tools like Weights & Biases or TensorBoard. It’s particularly useful when running multiple experiments or when you want to easily distinguish between different training sessions.

save_epoch: bool = False#

Save a checkpoint at the end of every epoch.

When set to True, this ensures that a model checkpoint is saved after each complete pass through the training data. This can be useful for tracking model progress over time and for resuming training from a specific epoch if needed.

If both save_steps and save_epoch are set, then save_steps takes precedence.

save_final_model: bool = True#

Whether to save the model at the end of training.

For different options for saving PEFT models, see PeftParams.peft_save_mode. This should normally be set to True to ensure the final trained model is saved. However, in some cases, you may want to disable it, for example: - If saving a large model which takes a long time - When quickly testing training speed or metrics - During debugging or experimentation phases

save_steps: int = 500#

Save a checkpoint every save_steps training steps.

This parameter determines the frequency of saving checkpoints during training based on the number of steps. If both save_steps and save_epoch are set, then save_steps takes precedence.

To disable saving checkpoints during training, set save_steps to 0 and save_epoch to False. If enabled, a checkpoint will be saved at the end of training if there’s any residual steps left.

seed: int = 42#

Random seed used for initialization.

This seed is passed to the trainer and to all downstream dependencies to ensure reproducibility of results. It affects random number generation in various parts of the training process, including data shuffling, weight initialization, and any stochastic operations.

sgd_momentum: float = 0.0#

Momentum factor for SGD optimizer.

Only used when optimizer is set to “sgd”, and when trainer_type is set to OUMI. Default is 0.0.

telemetry: TelemetryParams#

Parameters for telemetry.

This field contains telemetry configuration options.

property telemetry_dir: Path | None#: Returns the telemetry stats output directory.

to_hf()[source]#: Converts Oumi config to HuggingFace’s TrainingArguments.

trainer_kwargs: dict[str, Any]#

Additional keyword arguments to pass to the HF/TRL Trainer.

This allows for customization of the Trainer beyond the standard parameters defined in this class. Any key-value pairs added here will be passed directly to the Trainer’s constructor. Note that this field is only used for HuggingFace and TRL trainers (TRL_SFT, TRL_DPO, TRL_GRPO, HF).

trainer_type: TrainerType = 'hf'#

The type of trainer to use for the training process.

Options are defined in the TrainerType enum and include: - HF: HuggingFace’s Trainer - TRL_SFT: TRL’s SFT Trainer - TRL_DPO: TRL’s DPO Trainer - TRL_GRPO: TRL’s GRPO Trainer - OUMI: Custom generic trainer implementation - VERL_GRPO: verl’s GRPO Trainer

try_resume_from_last_checkpoint: bool = False#

If True, attempt to resume from the last checkpoint in “output_dir”.

If a checkpoint is found, training will resume from the model/optimizer/scheduler states loaded from this checkpoint. If no checkpoint is found, training will continue without loading any intermediate checkpoints.

Note: If resume_from_checkpoint is specified and contains a non-empty path, this parameter has no effect.

use_deterministic: bool = False#: Whether to use deterministic algorithms for reproducibility. If set to True, this will only allow those CuDNN algorithms that are (believed to be) deterministic. Please refer to https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html for more details. If using distributed training, this will override ddp_find_unused_parameters to False and will also use ddp_broadcast_buffers, and disable gradient checkpointing. Note that this will not guarantee full reproducibility, but will help to reduce the variance between runs.

use_peft: bool = False#

Whether to use Parameter-Efficient Fine-Tuning (PEFT) techniques.

PEFT methods allow for efficient adaptation of pre-trained language models to specific tasks by only updating a small number of (extra) model parameters. This can significantly reduce memory usage and training time.

verl_config_overrides: dict[str, Any]#

Values to override in the verl config.

This field is only used for the VERL_GRPO trainer. To see supported params in verl, see: https://verl.readthedocs.io/en/latest/examples/config.html

The verl config is a nested dict, so the kwargs should be structured accordingly. For example, to set actor_rollout_ref.actor.use_kl_loss to True, you can use: {“actor_rollout_ref”: {“actor”: {“use_kl_loss”: True}}}.

The priority of setting verl config params, from highest to lowest, is: 1. Values specified by this field. 2. Values automatically set by Oumi in

src/oumi/core/trainers/verl_grpo_trainer.py:_create_config() for verl params which have corresponding Oumi params. For example, Oumi’s training.output_dir -> verl’s trainer.default_local_dir

Default verl config values in src/oumi/core/trainers/verl_trainer_config.yaml.

warmup_ratio: float | None = None#

The ratio of total training steps used for a linear warmup from 0 to the learning rate.

If set along with warmup_steps, this value will be ignored.

warmup_steps: int | None = None#

The number of steps for the warmup phase of the learning rate scheduler.

If set, will override the value of warmup_ratio.

weight_decay: float = 0.0#

Weight decay (L2 penalty) to apply to the model’s parameters.

In the HF trainers and the OUMI trainer, this is automatically applied to only weight tensors, and skips biases/layernorms.

Default is 0.0 (no weight decay).

oumi.core.configs.params

Contents

oumi.core.configs.params#

Submodules#

oumi.core.configs.params.base_params module#

oumi.core.configs.params.data_params module#

oumi.core.configs.params.evaluation_params module#

oumi.core.configs.params.fsdp_params module#

oumi.core.configs.params.generation_params module#

oumi.core.configs.params.grpo_params module#

oumi.core.configs.params.guided_decoding_params module#

oumi.core.configs.params.judge_params module#

oumi.core.configs.params.model_params module#

oumi.core.configs.params.peft_params module#

oumi.core.configs.params.profiler_params module#

oumi.core.configs.params.remote_params module#

oumi.core.configs.params.synthesis_params module#

oumi.core.configs.params.telemetry_params module#

oumi.core.configs.params.training_params module#