Training Environments

Training Environments#

Training machine learning models requires different environments as you progress from initial experimentation & debugging to large-scale deployment.

Oumi supports training in various environments to suit different workflows and preferences. Moving between environments is streamlined through consistent configuration:

The train.yaml config file outlines your model, dataset, and training parameters,
The job_config.yaml contains your resource requirements (optional for training locally).

Environment Overview#

Environment	Best For	Advantages	Resource Scale	Setup Complexity
Local	Initial development, algorithmic testing	Provides rapid development cycles with immediate feedback loops.	CPU only, Single GPU, Multi-GPU (1-8)	🟢 Easy: Python + GPU drivers
VSCode, Cursor	Debugging	Step-by-step debugging capabilities with seamless Git integration and remote development support which allows you to debug your code running on a remote GPU machine.	CPU only, Single GPU, Multi-GPU (1-8)	🟡 Moderate: IDE setup + extensions
Docker	Reproducible environments, deployment	Provides isolated, containerized environments ensuring consistency across development and production. Simplifies dependency management and deployment.	CPU only, Single GPU, Multi-GPU (1-8)	🟡 Moderate: Docker + GPU runtime
Notebooks	Research, interactive experimentation, visualization	Enables fluid experimentation with real-time code execution and immediate feedback.	CPU only, Single GPU, Multi-GPU (1-8)	🟢 Easy: Jupyter setup
Remote	Production training, large-scale deployment, hyper-parameter tuning	Enterprise-grade deployment capabilities with automated resource allocation and cluster management. Integrates seamlessly with major cloud providers, with support for integrating with custom clusters.	Multi-node deployments (16+ GPUs) Frontier-scale (1000+ GPUs)	Scales with size: • 🟡 Moderate: Single node (1-8 GPUs) • 🔴 Complex: Multi-node (16-64 GPUs) • 🔴 Advanced: Large cluster (64+ GPUs)

Recommended Workflow#

While Oumi supports multiple training environments, we recommend a systematic progression through development stages:

Start Local and Small: Begin with local development using smaller models (like LLaMA-3.2-1B) to establish core functionality.
- If you are on CPU, even smaller models like SmolLM-135m and gpt2 are recommended for faster experimentation.
Debug in VSCode (or IDE of choice): Leverage VSCode’s integrated debugging tools to identify and resolve issues faster (much easier than print statements everywhere 😉).
Scale to Small Distributed: Test on multi-GPU setups (e.g., 1x8 GPU configurations) to validate distributed training.
Deploy to Cluster: Scale to cloud providers (GCP, AWS, Lambda Labs, etc.) or custom clusters (Polaris, Frontier, etc.) when ready for full-scale training.

Next Steps#

Check out configuration options
Set up monitoring tools

Training Environments

Contents

Training Environments#

Environment Overview#

Recommended Workflow#

Next Steps#