oumi.core.callbacks#

Trainer callbacks module for the Oumi (Open Universal Machine Intelligence) library.

This module provides trainer callbacks, which can be used to customize the behavior of the training loop in the Oumi Trainer that can inspect the training loop state for progress reporting, logging, early stopping, etc.

oumi.core.callbacks.BaseTrainerCallback#

alias of TrainerCallback

class oumi.core.callbacks.HfMfuTrainerCallback(dtype: dtype)[source]#

Bases: TrainerCallback

Trainer callback to calculate the MFU of the model during training.

Relies on model’s flops estimate computed by HuggingFace in total_flos metric.

on_log(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called after logging the last logs.

on_step_begin(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the beginning of each train step.

on_step_end(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the end of each train step.

Note that this will be called after all gradient accumulation substeps.

class oumi.core.callbacks.MfuTrainerCallback(dtype: dtype, num_params: int, sequence_length: int, num_layers: int | None = None, num_attention_heads: int | None = None, attention_head_size: int | None = None, add_rematerialization: bool = False)[source]#

Bases: TrainerCallback

Trainer callback to calculate the MFU of the model during training.

Should be compatible with all trainers that inherit from transformers.Trainer.

on_log(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called after logging the last logs.

on_step_begin(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the beginning of each train step.

on_step_end(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the end of each train step.

Note that this will be called after all gradient accumulation substeps.

class oumi.core.callbacks.NanInfDetectionCallback(metrics: list[str])[source]#

Bases: TrainerCallback

Trainer callback to detect abnormal values (NaN, INF) of selected metrics.

For example, NaN loss value is an almost certain indication of a training process going badly, in which cases it’s best to detect the condition early, and fail.

on_log(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called after logging the last logs.

class oumi.core.callbacks.ProfilerStepCallback(profiler)[source]#

Bases: TrainerCallback

Trainer callback to notify PyTorch profiler about training steps completion.

Also, adds microstep function labels using torch.profiler.record_function().

on_step_begin(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the beginning of a training step.

If using gradient accumulation, one training step might take several inputs.

on_step_end(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the end of each train step.

Note that this will be called after all gradient accumulation substeps.

on_substep_end(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the end of an substep during gradient accumulation.

class oumi.core.callbacks.TelemetryCallback(skip_first_steps: int = 1, world_process_zero_only: bool = True, include_timer_metrics: bool = False, track_gpu_temperature: bool = False, output_dir: Path | None = None)[source]#

Bases: TrainerCallback

Trainer callback to collect sub-step/step/epoch timings.

Based on oumi.performance.telemetry.TelemetryTracker.

on_epoch_begin(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the beginning of an epoch.

on_epoch_end(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the end of an epoch.

on_log(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called after logging the last logs.

on_step_begin(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the beginning of a training step.

If using gradient accumulation, one training step might take several inputs.

on_step_end(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the end of each train step.

Note that this will be called after all gradient accumulation substeps.

on_substep_end(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the end of a substep during gradient accumulation.

on_train_end(args: TrainingArguments | TrainingParams, state: TrainerState | None = None, control: TrainerControl | None = None, **kwargs)[source]#

Event called at the end of training.