Troubleshooting#

Getting Help#

Running into a problem? Check in with us on Discord-we’re happy to help!

Still can’t find a solution? Let us know by filing a new GitHub Issue.

Common Issues#

Installing on Windows#

If you’d like to use Oumi on Windows, we strongly suggest using Windows Subsystem for Linux (WSL).

Installing natively on Windows outside of a WSL environment can lead to installation errors such as:

ERROR: Could not find a version that satisfies the requirement ... (from versions: none)

or runtime errors like:

ModuleNotFoundError: No module named 'resource'

Installing on Mac#

Oumi only supports Apple Silicon Macs, not Intel Macs. This is because PyTorch dropped support for the latter. Installing on Intel Macs can lead to errors like:

Using Python 3.11.11 environment at: /Users/moonshine/miniconda3/envs/oumi
  × No solution found when resolving dependencies:
  ╰─▶ Because only the following versions of torch are available:
          torch<=2.5.0
          torch==2.5.1
          torch>2.6.0
      and torch>=2.5.0,<=2.5.1 has no wheels with a matching platform tag
      (e.g., `macosx_10_16_x86_64`), we can conclude that torch>=2.5.0,<=2.5.1
      cannot be used.
      And because oumi==0.1.dev1313+g33c1fa9 depends on torch>=2.5.0,<2.6.0,
      we can conclude that oumi==0.1.dev1313+g33c1fa9 cannot be used.
      And because only oumi[dev]==0.1.dev1313+g33c1fa9 is available and
      you require oumi[dev], we can conclude that your requirements are
      unsatisfiable.

      hint: Wheels are available for `torch` (v2.5.1) on the following
      platforms: `manylinux1_x86_64`, `manylinux2014_aarch64`,
      `macosx_11_0_arm64`, `win_amd64`

Pre-commit hook errors with VS Code#

  • When committing changes, you may encounter an error with pre-commit hooks related to missing imports.

  • To fix this, make sure to start your vscode instance after activating your conda environment.

    conda activate oumi
    code .  # inside the Oumi directory
    

Out of Memory (OOM)#

See Out of Memory (OOM) for more information.

Launching Remote Jobs Fail due to File Mounts#

When running a remote job using a command like:

oumi launch up -c your/config/file.yaml

It’s common to see failures with errors like:

ValueError: File mount source '~/.netrc' does not exist locally. To fix: check if it exists, and correct the path.

These errors indicate that your JobConfig contains a reference to a file that does not exist on your local machine. You can remove the offending line from your yaml file’s file_mounts to resolve the error if it’s unneeded. Otherwise, here’s how to resolve the error for specific files often mounted by Oumi jobs:

  • ~/.netrc: This file contains your Weights and Biases (WandB) credentials, which are needed to log your run’s metrics to WandB.

    • To fix, follow these instructions

    • If you don’t require WandB logging, disable either TrainingParams.enable_wandb or EvaluationConfig.enable_wandb, for training and evaluation jobs respectively. This is needed in addition to removing the file mount to prevent an error.

  • ~/.cache/huggingface/token: This file contains your Huggingface credentials, which are needed to access gated datasets/models on HuggingFace Hub.

Training Stability & NaN Loss#

  • Lower the initial learning rate

  • Enable gradient clipping (or, apply further clipping if already enabled)

  • Add learning rate warmup

config = TrainingConfig(
    training=TrainingParams(
        max_grad_norm=0.5,
        optimizer="adamw_torch_fused",
        warmup_ratio=0.01,
        lr_scheduler_type="cosine",
        learning_rate=1e-5,
    ),
)

Inference Issues#

Quantization-Specific Issues#

Decreased model performance: