Troubleshooting#
Getting Help#
Running into a problem? Check in with us on Discord-we’re happy to help!
Still can’t find a solution? Let us know by filing a new GitHub Issue.
Common Issues#
Pre-commit hook errors with VS Code#
When committing changes, you may encounter an error with pre-commit hooks related to missing imports.
To fix this, make sure to start your vscode instance after activating your conda environment.
conda activate oumi code . # inside the Oumi directory
Out of Memory (OOM)#
See Out of Memory (OOM) for more information.
Launching Remote Jobs Fail due to File Mounts#
When running a remote job using a command like:
oumi launch up -c your/config/file.yaml
It’s common to see failures with errors like:
ValueError: File mount source '~/.netrc' does not exist locally. To fix: check if it exists, and correct the path.
These errors indicate that your JobConfig contains a reference to a file that does not exist on your local machine. You can remove the offending line from your yaml file’s file_mounts
to resolve the error if it’s unneeded. Otherwise, here’s how to resolve the error for specific files often mounted by Oumi jobs:
~/.netrc
: This file contains your Weights and Biases (WandB) credentials, which are needed to log your run’s metrics to WandB.To fix, follow these instructions
If you don’t require WandB logging, disable either TrainingParams.
enable_wandb
or EvaluationConfig.enable_wandb
, for training and evaluation jobs respectively. This is needed in addition to removing the file mount to prevent an error.
~/.cache/huggingface/token
: This file contains your Huggingface credentials, which are needed to access gated datasets/models on HuggingFace Hub.To fix, follow these instructions
Training Stability & NaN Loss#
Lower the initial learning rate
Enable gradient clipping (or, apply further clipping if already enabled)
Add learning rate warmup
config = TrainingConfig(
training=TrainingParams(
max_grad_norm=0.5,
optimizer="adamw_torch_fused",
warmup_ratio=0.01,
lr_scheduler_type="cosine",
learning_rate=1e-5,
),
)
Inference Issues#
Ensure input data is correctly formatted and preprocessed
Validate that the inference engine is compatible with your model type
Quantization-Specific Issues#
Decreased model performance:
Increase
lora_r
andlora_alpha
parameters inoumi.core.configs.PeftParams