Training Environments

Training Environments#

Training machine learning models requires different environments as you progress from initial experimentation & debugging to large-scale deployment.

Oumi supports training in various environments to suit different workflows and preferences. Moving between environments is streamlined through consistent configuration:

  • The train.yaml config file outlines your model, dataset, and training parameters,

  • The job_config.yaml contains your resource requirements (optional for training locally).

Environment Overview#

Environment

Best For

Advantages

Resource Scale

Setup Complexity

Local

Initial development, algorithmic testing

Provides rapid development cycles with immediate feedback loops

CPU only, Single GPU, Multi-GPU (1-8)

🟒 Easy:
Python + GPU drivers

VSCode, Cursor

Debugging

Step-by-step debugging capabilities with seamless Git integration and remote development support which allows you to debug your code running on a remote GPU machine

CPU only, Single GPU, Multi-GPU (1-8)

🟑 Moderate:
IDE setup + extensions

Notebooks

Research, interactive experimentation, visualization

Enables fluid experimentation with real-time code execution and immediate feedback.

CPU only, Single GPU, Multi-GPU (1-8)

🟒 Easy:
Jupyter setup

Remote

Production training, large-scale deployment, hyper-parameter tuning

Enterprise-grade deployment capabilities with automated resource allocation and cluster management. Integrates seamlessly with major cloud providers.

Multi-node deployments (16+ GPUs)
Frontier-scale (1000+ GPUs)

Scales with size:
🟑 Moderate: Single node (1-8 GPUs)
πŸ”΄ Complex: Multi-node (16-64 GPUs)
πŸ”΄ Advanced: Large cluster (64+ GPUs)

Next Steps#