Monitoring & Debugging#
Introduction#
Effective monitoring and debugging tools are crucial for successful model training. In this guide, we’ll cover key signals to monitor during training, how to interpret them, and how to address common issues.
In particular, we’ll cover:
Tracking training metrics and progress in real-time
Monitoring system resource utilization (GPU, Memory, CPU) and training efficiency (MFU, Throughput)
Debugging training issues and performance bottlenecks
Training Monitoring#
Training Metrics#
During training, Oumi provides detailed progress information in the terminal:
Epoch 1/3: 100%|██████████| 1000/1000 [00:45<00:00, 22.22it/s]
{
'loss': 2.1234,
'learning_rate': 5e-5,
'epoch': 1.0,
'tokens_per_second': 156.7, # Tokens/second
}
If an output_dir
is specified, the trainer will also save logs and artifacts to the specified directory.
This includes training metrics, the full configuration, environment and package versions, to allow for reproducibility and debugging.
Let’s see how to configure logging in more detail:
training:
# Logging frequency
logging_steps: 50
logging_strategy: "steps" # "steps", "epoch", or "no"
logging_first_step: false # the first step is not logged by default as it's often an initialization step
# Log levels
log_level: "info"
dep_log_level: "warning" # the default log level for dependencies (e.g. transformers, tokenizers, etc.)
# Performance metrics
include_performance_metrics: true # enable performance metrics (throughput, mfu, etc.)
log_model_summary: true # log model architecture and parameters summary
# Logging output
output_dir: "output/my_run" # the directory to save logs and artifacts
run_name: "my_run" # the name of the run
# Third-party integrations
enable_wandb: true # enable Weights & Biases integration
enable_tensorboard: true # enable TensorBoard integration
Note:
If
training.enable_wandb
is set totrue
, make sure you have a W&B account and API key configured.If
training.enable_tensorboard
is set totrue
, you can launch TensorBoard locally by runningtensorboard --logdir <output_dir> --port 6006
, or use thevscode
tensorboard extension.
Remote Job Logs#
When running locally, accessing all this information is straightforward, as everything is saved to the specified output_dir
.
However, when deploying training jobs on remote clusters or cloud instances, this can be a bit more challenging.
Here are some ways to access logs and artifacts from remote jobs from your local machine:
View Live Logs
To view live logs of your remote jobs:
# For SkyPilot managed clusters
sky logs oumi-cluster
# For specific job ID
sky logs oumi-cluster --job-ID JOB_ID
Download Logs
To download logs and artifacts from remote jobs:
# Download all logs from a cluster
sky logs --sync-down oumi-cluster
# Download specific files
rsync -Pavz oumi-cluster:/path/to/file .
Log Locations
Common log locations:
Training logs:
output/<run_name>/logs/
TensorBoard logs:
output/<run_name>/tensorboard/
Profiler traces:
output/<run_name>/profiler/
Telemetry data:
output/<run_name>/telemetry/
Torch Profiler Integration#
Oumi integrates PyTorch’s profiler for detailed performance analysis.
The PyTorch profiler provides comprehensive performance insights including CPU and CUDA operation timings, memory usage analysis, tensor shapes and sizes tracking, FLOPS calculations, module-level profiling, and Chrome trace export capabilities for visualization.
See the ProfilerParams
class for more details.
Here’s an example configuration:
training:
profiler:
# Enable profiling
enable_cpu_profiling: true
enable_cuda_profiling: true
profile_memory: false
# Configure profiling schedule
schedule:
enable_schedule: true
skip_first: 5
wait: 1
warmup: 1
active: 3
repeat: 2
# Output configuration, enable as needed
save_dir: "profiler_output"
record_shapes: false
with_stack: false
with_flops: false
with_modules: false
Distributed Training#
The telemetry logger provides comprehensive system monitoring by tracking CPU timing metrics, CUDA operation timing, GPU memory usage, and GPU temperature.
It also collects cross-rank statistics for distributed training scenarios and monitors overall system resource utilization to give a complete picture of training performance.
In distributed training, each rank (GPU) collects its own telemetry data. The telemetry tracker provides methods to:
Collect telemetry from all ranks using
get_summaries_from_all_ranks()
Compute cross-rank statistics using
compute_cross_rank_summaries()
Analyze performance variations across GPUs
Here’s an example configuration:
training:
telemetry:
# Collection settings
collect_telemetry_for_all_ranks: true # Collect from all ranks
track_gpu_temperature: true # Monitor GPU temperatures
# Enable performance metrics for telemetry
include_performance_metrics: true
The cross-rank statistics provide valuable insights into distributed training performance by helping identify several key metrics across GPUs.
These include load imbalances between different GPUs, any stragglers that may be slowing down distributed execution, variations in memory usage patterns, temperature differences between GPU devices, and timing variations in critical operations.
For example, in a 4-GPU training setup, you might see output like:
Cross-Rank Statistics:
GPU Temperature:
max:
mean: 75.2C # Average of max temperatures across ranks
std_dev: 2.1C # Variation in max temperatures
min: 72.0C # Lowest max temperature among ranks
max: 78.0C # Highest max temperature among ranks
mean:
mean: 71.5C # Average of mean temperatures
std_dev: 1.8C # Variation in mean temperatures
min: 69.0C # Lowest mean temperature
max: 74.0C # Highest mean temperature
Timers:
model_forward:
mean:
mean: 0.096s # Average forward pass time across ranks
std_dev: 0.008s # Variation in forward pass times
min: 0.085s # Fastest rank
max: 0.105s # Slowest rank
max:
mean: 0.125s # Average of max forward times
std_dev: 0.010s # Variation in max times
min: 0.115s # Lowest max time
max: 0.142s # Highest max time
preprocessing:
mean:
mean: 0.016s # Average preprocessing time
std_dev: 0.003s # Variation in preprocessing
min: 0.012s # Fastest rank
max: 0.020s # Slowest rank
Adding Telemetry to Your Code#
When writing your own training code or pre-processing pipelines, you can add telemetry by using the TelemetryTracker
class.
Here’s an example:
from oumi.performance.telemetry import TelemetryTracker
# Initialize telemetry
telemetry = TelemetryTracker()
for step, batch in enumerate(dataloader):
# Track CUDA operations (e.g. forward pass)
with telemetry.cuda_timer("forward_pass"):
outputs = model(batch)
# Track CPU operations (e.g. loss computation)
with telemetry.timer("backward_pass"):
loss = criterion(outputs, batch["labels"])
loss.backward()
optimizer.step()
# Monitor GPU resources periodically
if step % 50 == 0:
telemetry.log_gpu_memory()
telemetry.record_gpu_temperature()
if step % 100 == 0:
# Print summary at step
telemetry.print_summary()
Example summary output:
Telemetry Summary (my-machine):
Total time: 3600.0 seconds
CPU Timers:
total_training:
Total: 3600.0s Mean: 3600.0s
Count: 1 Percentage of total time: 100.00%
epoch:
Total: 3590.0s Mean: 897.5s
Count: 4 Percentage of total time: 99.72%
backward_pass:
Total: 720.0s Mean: 0.18s
Count: 4000 Percentage of total time: 20.00%
CUDA Timers:
forward_pass:
Total: 2160.0s Mean: 0.54s
Count: 4000 Percentage of total time: 60.00%
Peak GPU memory usage: 14.5 GB
GPU temperature: Mean: 72°C Max: 76°C
The telemetry tracker provides:
CPU timers via
timer()
context managerCUDA timers via
cuda_timer()
context managerGPU memory tracking via
log_gpu_memory()
GPU temperature monitoring via
record_gpu_temperature()
Automatic statistics calculation and reporting with
print_summary()
For distributed training scenarios, see the Distributed Training Telemetry section above.
Callbacks#
Oumi provides several callbacks for monitoring training progress and performance:
training:
# Enable Performance Metrics, MFU monitoring, NaN/Inf detection, and telemetry callbacks
include_performance_metrics: true
Model Flops Utilization (MFU)#
MFU measures how efficiently your model utilizes the available compute resources:
training:
# Enable MFU monitoring
include_performance_metrics: true
# For accurate MFU calculation
model:
model_max_length: 2048 # Must be set for MFU calculation
MFU monitoring provides:
Step-level MFU: Efficiency during actual training steps
Overall training MFU: Efficiency including all overhead
Telemetry Callback#
While the telemetry tracker can be used directly in your code, Oumi also provides a TelemetryCallback
that integrates with the training loop to automatically collect timing metrics, GPU stats, and other telemetry data.
The callback will automatically:
Time each phase of training (microsteps, steps, epochs)
Log GPU temperature if enabled
Save detailed telemetry data to files:
telemetry_callback_metrics_rankXXXX.json
: Training metricstelemetry_callback_rankXXXX.json
: Timing statisticstelemetry_callback_all_ranks.json
: Cross-rank data (distributed training)telemetry_callback_gpu_temperature_summary.json
: GPU temperature stats
For distributed training, the callback can collect data from all ranks and compute cross-rank statistics. This is controlled by the world_process_zero_only
parameter:
If
True
: Only collect data on the main process (rank 0)If
False
: Collect data from all ranks and compute cross-rank statistics
Example output files in distributed training:
output/my_run/
├── telemetry_callback_metrics_rank0000.json # Training metrics from rank 0
├── telemetry_callback_rank0000.json # Timing stats from rank 0
├── telemetry_callback_all_ranks.json # Data from all ranks
└── telemetry_callback_gpu_temperature_summary.json # Cross-rank GPU stats
The callback integrates with both W&B and TensorBoard logging, adding telemetry metrics to your training logs when include_timer_metrics=True
:
rank000_microsteps_mean: 0.118s
rank000_steps_mean: 0.453s
rank000_gpu_temperature_max: 75.2C
...
Next Steps#
Explore training methods for specific tasks
Check out configuration options for detailed settings