Training Environments#
Training machine learning models requires different environments as you progress from initial experimentation & debugging to large-scale deployment.
Oumi supports training in various environments to suit different workflows and preferences. Moving between environments is streamlined through consistent configuration:
The
train.yaml
config file outlines your model, dataset, and training parameters,The
job_config.yaml
contains your resource requirements (optional for training locally).
Environment Overview#
Environment |
Best For |
Advantages |
Resource Scale |
Setup Complexity |
---|---|---|---|---|
Initial development, algorithmic testing |
Provides rapid development cycles with immediate feedback loops |
CPU only, Single GPU, Multi-GPU (1-8) |
π’ Easy: |
|
Debugging |
Step-by-step debugging capabilities with seamless Git integration and remote development support which allows you to debug your code running on a remote GPU machine |
CPU only, Single GPU, Multi-GPU (1-8) |
π‘ Moderate: |
|
Research, interactive experimentation, visualization |
Enables fluid experimentation with real-time code execution and immediate feedback. |
CPU only, Single GPU, Multi-GPU (1-8) |
π’ Easy: |
|
Production training, large-scale deployment, hyper-parameter tuning |
Enterprise-grade deployment capabilities with automated resource allocation and cluster management. Integrates seamlessly with major cloud providers. |
Multi-node deployments (16+ GPUs) |
Scales with size: |
Recommended Workflow#
While Oumi supports multiple training environments, we recommend a systematic progression through development stages:
Start Local and Small: Begin with local development using smaller models (like LLaMA-3.2-1B) to establish core functionality.
If you are on CPU, even smaller models like
SmolLM-135m
andgpt2
are recommended for faster experimentation.
Debug in VSCode (or IDE of choice): Leverage VSCodeβs integrated debugging tools to identify and resolve issues faster (much easier than print statements everywhere π)
Scale to Small Distributed: Test on multi-GPU setups (e.g., 1x8 GPU configurations) to validate distributed training
Deploy to Cluster: Scale to cloud providers (GCP, AWS, Lambda Labs, Polaris, Frontier, etc.) or custom clusters when ready for full-scale training
Next Steps#
Check out configuration options
Set up monitoring tools
Explore remote training