<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

üëã Welcome to Open Universal Machine Intelligence (Oumi)!

üöÄ Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

ü§ù Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

‚≠ê If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# OpenEnv GRPO with trl

In this tutorial notebook, we're going to use Oumi to train an agentic model on an [OpenEnv](https://github.com/meta-pytorch/OpenEnv) Echo reinforcement learning (RL) environment with the GRPO algorithm. To achieve this, we use the trl library by Hugging Face with a custom rollout function to interact with the vLLM server and OpenEnv environment. This notebook is derived from trl's [Echo environment example](https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/echo.py).

# üìã Prerequisites

‚ùó**NOTICE:** This notebook needs to be running on a machine with at least two GPUs.

## Oumi Installation

First, let's install the latest versions of Oumi and OpenEnv. You can find more detailed instructions [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

In [None]:
!pip install uv && uv pip install "oumi[gpu] @ git+https://github.com/oumi-ai/oumi.git"
!uv pip install git+https://github.com/meta-pytorch/OpenEnv.git

In [1]:
import os
from pathlib import Path

tutorial_dir = "openenv_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable warnings from HF.

# Start OpenEnv and vLLM servers

We need to run 2 servers in addition to the trl trainer. The OpenEnv server receives actions from the LLM and returns the updated state and reward. The vLLM server is used for inference, and updates it weights over training with the updated model weights from the trainer. We start these with separate subprocesses.

In [2]:
%%writefile $tutorial_dir/start_openenv_server.py

import os
import subprocess
import sys
import threading
import time
from pathlib import Path

import requests


def stream_output(pipe, prefix=""):
    """Stream output lines from subprocess pipe to stdout."""
    for line in iter(pipe.readline, ""):
        print(f"{prefix}{line}", end="")
    pipe.close()


print("‚ö° Starting FastAPI server for Echo Environment...")

work_dir = str(Path.cwd().parent.absolute())

server_process = subprocess.Popen(
    [
        sys.executable,
        "-m",
        "uvicorn",
        "envs.echo_env.server.app:app",
        "--host",
        "0.0.0.0",
        "--port",
        "8001",
    ],
    env={**os.environ, "PYTHONPATH": f"{work_dir}/src"},
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True,
    cwd=work_dir,
)

# Start background threads to stream errors
threading.Thread(
    target=stream_output, args=(server_process.stderr, "üî• [stderr] "), daemon=True
).start()

print("‚è≥ Waiting for server to start...")
time.sleep(5)

try:
    response = requests.get("http://0.0.0.0:8001/health", timeout=2)
    print("\n‚úÖ Echo Environment server is running!")
except Exception as e:
    print(f"\n‚ùå Server failed to start: {e}")
    print("\nüìã Checking error output...")
    server_process.poll()
    if server_process.stderr:
        stderr = server_process.stderr.read()
        if stderr:
            print(stderr)
    raise

try:
    input("Press Enter to exit...\n")
finally:
    print("üõë Stopping server...")
    server_process.terminate()
    server_process.wait()

Overwriting openenv_tutorial/start_openenv_server.py


In [3]:
import subprocess

# Start both servers in the background
server1 = subprocess.Popen(
    [
        "bash",
        "-c",
        (
            "CUDA_VISIBLE_DEVICES=0 trl vllm-serve "
            "--model Qwen/Qwen2.5-0.5B-Instruct "
            "--log-level warning "
            "--host 0.0.0.0 --port 8000"
        ),
    ]
)
server2 = subprocess.Popen(["python", f"{tutorial_dir}/start_openenv_server.py"])

print("Servers started. PIDs:", server1.pid, server2.pid)

Servers started. PIDs: 3787594 3787595


In [4]:
import time

import requests

URL = "http://0.0.0.0:8000/health"


def check_vllm_health():
    """Checks if the vLLM server is healthy."""
    try:
        response = requests.get(URL, timeout=3)
        if response.status_code == 200:
            print("‚úÖ vLLM server is healthy!")
            return True
        else:
            print(f"‚ö†Ô∏è Server responded with {response.status_code}")
    except requests.RequestException as e:
        print(f"‚ùå Server not ready: {e}")
    return False


max_retries = 24
for attempt in range(1, max_retries + 1):
    if check_vllm_health():
        break
    time.sleep(5)
else:
    print(f"‚ùå Failed to start vLLM server after {max_retries} attempts.")

‚ö° Starting FastAPI server for Echo Environment...
‚è≥ Waiting for server to start...
‚ùå Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7d30b0f93950>: Failed to establish a new connection: [Errno 111] Connection refused'))
üî• [stderr] INFO:     Started server process [3787596]
üî• [stderr] INFO:     Waiting for application startup.
üî• [stderr] INFO:     Application startup complete.
üî• [stderr] INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)

‚úÖ Echo Environment server is running!
Press Enter to exit...
‚ùå Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7d30b0f9df50>: Failed to establish a new connection: [Errno 111] Connection refused'))
‚ùå Server not ready: HTTPConnectionPool(host='0.

`torch_dtype` is deprecated! Use `dtype` instead!


INFO 10-31 17:31:42 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=16384.
‚ùå Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7d30b0fad190>: Failed to establish a new connection: [Errno 111] Connection refused'))
‚ùå Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7d30b0fa44d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
INFO 10-31 17:31:51 [__init__.py:216] Automatically detected platform cuda.
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:52 [core.py:654] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:52 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='Qwen/Qwen2.5-0.5B-Instruct



[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:53 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:53 [gpu_model_runner.py:2338] Starting to load model Qwen/Qwen2.5-0.5B-Instruct...
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:53 [gpu_model_runner.py:2370] Loading model from scratch...
[1;36m(EngineCore_DP0 pi

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.39it/s]
[1;36m(EngineCore_DP0 pid=3787957)[0;0m 


[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:53 [default_loader.py:268] Loading weights took 0.18 seconds
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:54 [gpu_model_runner.py:2392] Model loading took 0.9266 GiB and 0.487034 seconds
‚ùå Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7d30b0f9d690>: Failed to establish a new connection: [Errno 111] Connection refused'))
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:58 [backends.py:539] Using cache directory: /home/wizeng/.cache/vllm/torch_compile_cache/5d31f4c583/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:58 [backends.py:550] Dynamo bytecode transform time: 3.53 s
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:31:59 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cac

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  81%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 54/67 [00:01<00:00, 38.49it/s]

‚ùå Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7d30b0faefd0>: Failed to establish a new connection: [Errno 111] Connection refused'))


Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 67/67 [00:01<00:00, 40.82it/s]


[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:32:02 [gpu_model_runner.py:3118] Graph capturing finished in 2 secs, took 0.50 GiB
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:32:02 [gpu_worker.py:391] Free memory on device (78.59/79.19 GiB) on startup. Desired GPU memory utilization is (0.9, 71.27 GiB). Actual usage is 0.93 GiB for weight, 5.57 GiB for peak activation, 0.07 GiB for non-torch memory, and 0.5 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=68779733708` to fit into requested memory, or `--kv-cache-memory=76635612672` to fully utilize gpu memory. Current kv cache memory in use is 69478085324 bytes.
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:32:02 [core.py:218] init engine (profile, create kv cache, warmup model) took 8.68 seconds
INFO 10-31 17:32:04 [llm.py:295] Supported_tasks: ['generate']
INFO 10-31 17:32:04 [__init__.py:36] No IOProcessor plugins requested by the model
‚úÖ vLLM server is healthy!


# Train the model!

By providing a custom rollout function to interact with the OpenEnv and vLLM servers, we can use trl to do agentic GRPO training. We also need to provide a reward function that processes the reward value output by the environment.

The following script defines the custom rollout and reward functions and runs the trainer. We run it as a subprocess so that we can set `CUDA_VISIBLE_DEVICES` to not conflict with the vLLM server.

In [6]:
%%writefile $tutorial_dir/train.py

import requests
from envs.echo_env import EchoEnv
from envs.echo_env.models import EchoAction

from oumi.core.configs import TrainingConfig
from oumi.core.registry import RegistryType, register
from oumi.train import train


@register("env_reward", RegistryType.REWARD_FUNCTION)
def reward_from_env(completions, **kwargs):
    """Reward function that uses the environment reward."""
    # Extract environment rewards from kwargs (propagated via extra_fields)
    env_rewards = kwargs.get("env_reward", [])
    if env_rewards:
        return [float(reward) for reward in env_rewards]
    else:
        # Fallback if env_reward is not available
        return [0.0] * len(completions)


@register("echo_env_vllm_rollout", RegistryType.ROLLOUT_FUNCTION)
def echo_env_vllm_rollout(
    prompts: list[str], args, processing_class
) -> dict[str, list]:
    """Custom rollout function that generates completions via vLLM server and computes environment rewards.

    Args:
        prompts: List of prompts to generate from
        args: GRPOConfig containing all sampling parameters
        processing_class: Tokenizer/processor for decoding completions

    Returns:
        Dict containing prompt_ids, completion_ids, logprobs, and env_reward
    """  # noqa: E501
    # 1. Generate completions via vLLM inference server (running on port 8000)
    payload = {
        "prompts": prompts,
        "n": args.num_generations,
        "temperature": args.temperature,
        "top_p": args.top_p,
        "top_k": -1 if args.top_k is None else args.top_k,
        "min_p": 0.0 if args.min_p is None else args.min_p,
        "max_tokens": args.max_completion_length,
        "repetition_penalty": args.repetition_penalty,
    }
    response = requests.post("http://0.0.0.0:8000/generate/", json=payload)

    if response.status_code != 200:
        print(f"Error response: {response.text}")

    response.raise_for_status()
    result = response.json()

    completions_text = processing_class.batch_decode(
        result["completion_ids"], skip_special_tokens=True
    )

    # 2. Step through the environment to get rewards
    client = EchoEnv(base_url="http://0.0.0.0:8001")
    env_result = client.reset()
    env_rewards = []
    for msg in completions_text:
        env_result = client.step(EchoAction(message=msg))
        env_rewards.append(env_result.reward)

    # 3. Add environment rewards as extra field
    result["env_reward"] = env_rewards

    return result


config = TrainingConfig.from_yaml("openenv_tutorial/grpo_train.yaml")
train(config)

Overwriting openenv_tutorial/train.py


Finally, we define the YAML training config, and kick off training!

To enable logging to Weights and Biases, uncomment the relevant line in the config below, and make sure to [set up wandb](https://oumi.ai/docs/en/latest/development/dev_setup.html#optional-set-up-weights-and-biases) on your machine.

In [7]:
%%writefile $tutorial_dir/grpo_train.yaml

model:
  model_name: "Qwen/Qwen2-0.5B-Instruct"
  model_max_length: 2048
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"

data:
  train:
    datasets:
      - dataset_name: "trl-lib/ultrafeedback-prompt"
        split: "train"
        sample_count: 100

training:
  trainer_type: "TRL_GRPO"
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 4

  reward_functions: ["env_reward"]

  ddp_find_unused_parameters: False
  optimizer: "adamw_torch_fused"

  grpo:
    use_vllm: True
    rollout_function: "echo_env_vllm_rollout"
    max_completion_length: 2048

  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 32

  num_train_epochs: 1
  logging_steps: 1
  log_model_summary: False
  output_dir: "openenv_tutorial/echo_grpo"
  # Uncomment to enable wandb logging
  # enable_wandb: True

Overwriting openenv_tutorial/grpo_train.yaml


In [8]:
import os
import subprocess
import sys

env = {**os.environ, "CUDA_VISIBLE_DEVICES": "1"}
# Run the trainer as a subprocess to reinitialize CUDA with only the second GPU visible.
subprocess.run(
    [sys.executable, str(Path(tutorial_dir) / "train.py")], env=env, check=True
)

[2025-10-31 17:32:17,666][oumi][rank0][pid:3788270][MainThread][INFO]][train.py:117] Creating training.output_dir: openenv_tutorial/echo_grpo...
[2025-10-31 17:32:17,668][oumi][rank0][pid:3788270][MainThread][INFO]][train.py:119] Created training.output_dir absolute path: /home/wizeng/repos/oumi/notebooks/openenv_tutorial/echo_grpo
[2025-10-31 17:32:17,669][oumi][rank0][pid:3788270][MainThread][INFO]][train.py:117] Creating training.telemetry_dir: openenv_tutorial/echo_grpo/telemetry...
[2025-10-31 17:32:17,672][oumi][rank0][pid:3788270][MainThread][INFO]][train.py:119] Created training.telemetry_dir absolute path: /home/wizeng/repos/oumi/notebooks/openenv_tutorial/echo_grpo/telemetry
[2025-10-31 17:32:17,675][oumi][rank0][pid:3788270][MainThread][INFO]][torch_utils.py:80] Torch version: 2.8.0+cu128. NumPy version: 1.26.4
[2025-10-31 17:32:17,675][oumi][rank0][pid:3788270][MainThread][INFO]][torch_utils.py:88] CUDA version: 12.8 
[2025-10-31 17:32:17,676][oumi][rank0][pid:3788270][Main

`torch_dtype` is deprecated! Use `dtype` instead!


[2025-10-31 17:32:19,171][oumi][rank0][pid:3788270][MainThread][INFO]][torch_utils.py:288] 
Model Parameters Summary:
üî¢ Total     parameters: 494,032,768
üîó Embedding parameters: 136,134,656
üéØ Trainable parameters: 494,032,768
üîí Frozen    parameters: 0 (0.00%)

INFO 10-31 17:32:19 [__init__.py:216] Automatically detected platform cuda.
[2025-10-31 17:32:19,942][oumi][rank0][pid:3788270][MainThread][INFO]][torch_profiler_utils.py:164] PROF: Torch Profiler disabled!


  trainer = HuggingFaceTrainer(cls(*args, **kwargs, args=hf_args), processor)
The model is already on multiple devices. Skipping the move to device specified in `args`.


INFO 10-31 17:32:20 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 10-31 17:32:20 [pynccl.py:70] vLLM is using nccl==2.27.3
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:32:20 [__init__.py:1433] Found nccl from library libnccl.so.2
[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:32:20 [pynccl.py:70] vLLM is using nccl==2.27.3
[2025-10-31 17:32:21,330][oumi][rank0][pid:3788270][MainThread][INFO]][device_utils.py:343] GPU Metrics Before Training: GPU runtime info: NVidiaGpuRuntimeInfo(device_index=0, device_count=2, used_memory_mb=75593.0, temperature=33, fan_speed=None, fan_speeds=None, power_usage_watts=123.946, power_limit_watts=700.0, gpu_utilization=0, memory_utilization=0, performance_state=0, clock_speed_graphics=1980, clock_speed_sm=1980, clock_speed_memory=2619).
[2025-10-31 17:32:21,330][oumi][rank0][pid:3788270][MainThread][INFO]][train.py:553] Training init time: 3.665s
[2025-10-31 17:32:21,330][oumi][rank0][pid:3788270][MainThread][INFO]][trai

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
  0%|          | 0/25 [00:00<?, ?it/s]

[1;36m(EngineCore_DP0 pid=3787957)[0;0m INFO 10-31 17:32:22 [block_pool.py:292] Successfully reset prefix cache


Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1391.38it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:01<00:00, 27.21it/s, est. speed input: 1054.58 toks/s, output: 3325.28 toks/s]
  4%|‚ñç         | 1/25 [00:02<01:02,  2.62s/it]

{'loss': -0.3104, 'grad_norm': 5.6875, 'learning_rate': 5e-05, 'num_tokens': 5150.0, 'completions/mean_length': 122.1875, 'completions/min_length': 22.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 0.15625, 'completions/mean_terminated_length': 97.40740966796875, 'completions/min_terminated_length': 22.0, 'completions/max_terminated_length': 244.0, 'rewards/reward_from_env/mean': 60.24374771118164, 'rewards/reward_from_env/std': 46.80415344238281, 'reward': 60.24374771118164, 'reward_std': 20.179214477539062, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.10425987094640732, 'sampling/sampling_logp_difference/max': 1.4170303344726562, 'sampling/importance_sampling_ratio/min': 0.24243289232254028, 'sampling/importance_sampling_ratio/mean': 1.0240256786346436, 'sampling/importance_sampling_ratio/max': 1.5776448249816895, 'entropy': 1.3606750071048737, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1356.72it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.80it/s, est. speed input: 3143.41 toks/s, output: 9801.17 toks/s]
  8%|‚ñä         | 2/25 [00:04<00:48,  2.09s/it]

{'loss': -0.088, 'grad_norm': 3.96875, 'learning_rate': 4.8e-05, 'num_tokens': 14605.0, 'completions/mean_length': 223.71875, 'completions/min_length': 15.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 0.8125, 'completions/mean_terminated_length': 83.83333587646484, 'completions/min_terminated_length': 15.0, 'completions/max_terminated_length': 195.0, 'rewards/reward_from_env/mean': 103.9625015258789, 'rewards/reward_from_env/std': 39.819908142089844, 'reward': 103.9625015258789, 'reward_std': 14.376317977905273, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.0843813493847847, 'sampling/sampling_logp_difference/max': 1.4582233428955078, 'sampling/importance_sampling_ratio/min': 0.23264925181865692, 'sampling/importance_sampling_ratio/mean': 1.0169103145599365, 'sampling/importance_sampling_ratio/max': 1.5537441968917847, 'entropy': 1.0307188630104065, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ra

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1658.48it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.41it/s, est. speed input: 2062.37 toks/s, output: 11055.28 toks/s]
 12%|‚ñà‚ñè        | 3/25 [00:06<00:42,  1.92s/it]

{'loss': -0.0031, 'grad_norm': 3.40625, 'learning_rate': 4.600000000000001e-05, 'num_tokens': 24273.0, 'completions/mean_length': 254.625, 'completions/min_length': 212.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 0.96875, 'completions/mean_terminated_length': 212.0, 'completions/min_terminated_length': 212.0, 'completions/max_terminated_length': 212.0, 'rewards/reward_from_env/mean': 121.828125, 'rewards/reward_from_env/std': 23.6931209564209, 'reward': 121.828125, 'reward_std': 15.864302635192871, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.07195073366165161, 'sampling/sampling_logp_difference/max': 1.2133426666259766, 'sampling/importance_sampling_ratio/min': 0.2972021698951721, 'sampling/importance_sampling_ratio/mean': 1.011971116065979, 'sampling/importance_sampling_ratio/max': 1.4239530563354492, 'entropy': 0.8396566212177277, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1631.87it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.91it/s, est. speed input: 3905.33 toks/s, output: 10986.34 toks/s]
 16%|‚ñà‚ñå        | 4/25 [00:07<00:37,  1.80s/it]

{'loss': 0.0021, 'grad_norm': 3.390625, 'learning_rate': 4.4000000000000006e-05, 'num_tokens': 35377.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 124.26875305175781, 'rewards/reward_from_env/std': 22.02469825744629, 'reward': 124.26875305175781, 'reward_std': 16.020652770996094, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.06552394479513168, 'sampling/sampling_logp_difference/max': 1.3148822784423828, 'sampling/importance_sampling_ratio/min': 0.26850593090057373, 'sampling/importance_sampling_ratio/mean': 1.0130901336669922, 'sampling/importance_sampling_ratio/max': 1.4909359216690063, 'entropy': 0.76953125, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1804.78it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.95it/s, est. speed input: 1632.18 toks/s, output: 10995.63 toks/s]
 20%|‚ñà‚ñà        | 5/25 [00:09<00:34,  1.73s/it]

{'loss': -0.0022, 'grad_norm': 3.625, 'learning_rate': 4.2e-05, 'num_tokens': 44785.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 124.9593734741211, 'rewards/reward_from_env/std': 20.39957046508789, 'reward': 124.9593734741211, 'reward_std': 16.528228759765625, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.07429219037294388, 'sampling/sampling_logp_difference/max': 1.242635726928711, 'sampling/importance_sampling_ratio/min': 0.2886224687099457, 'sampling/importance_sampling_ratio/mean': 1.0142146348953247, 'sampling/importance_sampling_ratio/max': 1.556097149848938, 'entropy': 0.8701171875, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.000732421875, 'clip_

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1583.50it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 37.20it/s, est. speed input: 2278.77 toks/s, output: 9524.26 toks/s]
 24%|‚ñà‚ñà‚ñç       | 6/25 [00:11<00:32,  1.73s/it]

{'loss': -0.0007, 'grad_norm': 3.25, 'learning_rate': 4e-05, 'num_tokens': 54937.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 131.5187530517578, 'rewards/reward_from_env/std': 11.204102516174316, 'reward': 131.5187530517578, 'reward_std': 9.789403915405273, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.06402796506881714, 'sampling/sampling_logp_difference/max': 1.597865104675293, 'sampling/importance_sampling_ratio/min': 0.20232799649238586, 'sampling/importance_sampling_ratio/mean': 1.0114407539367676, 'sampling/importance_sampling_ratio/max': 1.5237730741500854, 'entropy': 0.70703125, 'clip_ratio/low_mean': 0.0003662109375, 'clip_ratio/low_min': 0.0003662109375, 'clip_ratio/high_mean': 0.0006103515625, 'clip_ra

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1428.70it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.04it/s, est. speed input: 3443.20 toks/s, output: 11018.15 toks/s]
 28%|‚ñà‚ñà‚ñä       | 7/25 [00:12<00:30,  1.70s/it]

{'loss': 0.0017, 'grad_norm': 3.046875, 'learning_rate': 3.8e-05, 'num_tokens': 65689.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 138.65936279296875, 'rewards/reward_from_env/std': 7.89592981338501, 'reward': 138.65936279296875, 'reward_std': 5.593094825744629, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.06447682529687881, 'sampling/sampling_logp_difference/max': 1.380960464477539, 'sampling/importance_sampling_ratio/min': 0.2513370215892792, 'sampling/importance_sampling_ratio/mean': 1.0118381977081299, 'sampling/importance_sampling_ratio/max': 1.638052225112915, 'entropy': 0.70703125, 'clip_ratio/low_mean': 0.0003662109375, 'clip_ratio/low_min': 0.0003662109375, 'clip_ratio/high_mean': 0.000244140625, 'clip_

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1692.28it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.56it/s, est. speed input: 3054.11 toks/s, output: 10896.80 toks/s]
 32%|‚ñà‚ñà‚ñà‚ñè      | 8/25 [00:14<00:28,  1.66s/it]

{'loss': 0.0014, 'grad_norm': 3.078125, 'learning_rate': 3.6e-05, 'num_tokens': 76177.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 140.49374389648438, 'rewards/reward_from_env/std': 6.9356465339660645, 'reward': 140.49374389648438, 'reward_std': 6.394322395324707, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.04982556402683258, 'sampling/sampling_logp_difference/max': 1.3971309661865234, 'sampling/importance_sampling_ratio/min': 0.2473054677248001, 'sampling/importance_sampling_ratio/mean': 1.0109286308288574, 'sampling/importance_sampling_ratio/max': 1.5759567022323608, 'entropy': 0.5390625, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.000244140625, 'cl

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1776.87it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.24it/s, est. speed input: 1556.86 toks/s, output: 11070.94 toks/s]
 36%|‚ñà‚ñà‚ñà‚ñå      | 9/25 [00:16<00:27,  1.73s/it]

{'loss': -0.0002, 'grad_norm': 2.78125, 'learning_rate': 3.4000000000000007e-05, 'num_tokens': 85521.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 146.29061889648438, 'rewards/reward_from_env/std': 11.264800071716309, 'reward': 146.29061889648438, 'reward_std': 8.717338562011719, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.05180970951914787, 'sampling/sampling_logp_difference/max': 1.4854364395141602, 'sampling/importance_sampling_ratio/min': 0.22640350461006165, 'sampling/importance_sampling_ratio/mean': 1.011709451675415, 'sampling/importance_sampling_ratio/max': 1.5055581331253052, 'entropy': 0.56396484375, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1612.42it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.11it/s, est. speed input: 2358.33 toks/s, output: 10676.88 toks/s]
 40%|‚ñà‚ñà‚ñà‚ñà      | 10/25 [00:17<00:25,  1.71s/it]

{'loss': -0.0219, 'grad_norm': 2.859375, 'learning_rate': 3.2000000000000005e-05, 'num_tokens': 95426.0, 'completions/mean_length': 253.53125, 'completions/min_length': 177.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 0.96875, 'completions/mean_terminated_length': 177.0, 'completions/min_terminated_length': 177.0, 'completions/max_terminated_length': 177.0, 'rewards/reward_from_env/mean': 151.10000610351562, 'rewards/reward_from_env/std': 14.319016456604004, 'reward': 151.10000610351562, 'reward_std': 11.974573135375977, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.06071959435939789, 'sampling/sampling_logp_difference/max': 1.414407730102539, 'sampling/importance_sampling_ratio/min': 0.24306952953338623, 'sampling/importance_sampling_ratio/mean': 1.013283371925354, 'sampling/importance_sampling_ratio/max': 2.0, 'entropy': 0.6508394777774811, 'clip_ratio/low_mean': 0.00024903831945266575, 'clip_ratio/low_min': 0.00024903831945266575, 'clip

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1577.25it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.01it/s, est. speed input: 2677.89 toks/s, output: 10048.74 toks/s]
 44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 11/25 [00:19<00:23,  1.69s/it]

{'loss': -0.1194, 'grad_norm': 4.46875, 'learning_rate': 3e-05, 'num_tokens': 104893.0, 'completions/mean_length': 233.59375, 'completions/min_length': 7.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 0.90625, 'completions/mean_terminated_length': 17.0, 'completions/min_terminated_length': 7.0, 'completions/max_terminated_length': 37.0, 'rewards/reward_from_env/mean': 139.95623779296875, 'rewards/reward_from_env/std': 44.71029281616211, 'reward': 139.95623779296875, 'reward_std': 26.40770149230957, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.07985985279083252, 'sampling/sampling_logp_difference/max': 1.7661480903625488, 'sampling/importance_sampling_ratio/min': 0.17099036276340485, 'sampling/importance_sampling_ratio/mean': 1.0169445276260376, 'sampling/importance_sampling_ratio/max': 1.905361294746399, 'entropy': 0.8642259538173676, 'clip_ratio/low_mean': 0.00013896609016228467, 'clip_ratio/low_min': 0.00013896609016228467, 'clip_ratio/hi

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1282.76it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.06it/s, est. speed input: 3722.67 toks/s, output: 10768.32 toks/s]
 48%|‚ñà‚ñà‚ñà‚ñà‚ñä     | 12/25 [00:21<00:21,  1.68s/it]

{'loss': -0.0009, 'grad_norm': 4.28125, 'learning_rate': 2.8000000000000003e-05, 'num_tokens': 115917.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 158.69375610351562, 'rewards/reward_from_env/std': 14.300506591796875, 'reward': 158.69375610351562, 'reward_std': 11.313620567321777, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.06296081840991974, 'sampling/sampling_logp_difference/max': 2.0188417434692383, 'sampling/importance_sampling_ratio/min': 0.13280920684337616, 'sampling/importance_sampling_ratio/mean': 1.016083002090454, 'sampling/importance_sampling_ratio/max': 1.487716794013977, 'entropy': 0.671875, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.00

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1635.05it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 40.84it/s, est. speed input: 2021.76 toks/s, output: 10455.90 toks/s]
 52%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè    | 13/25 [00:22<00:20,  1.67s/it]

{'loss': -0.0001, 'grad_norm': 4.71875, 'learning_rate': 2.6000000000000002e-05, 'num_tokens': 125693.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 163.359375, 'rewards/reward_from_env/std': 13.915802955627441, 'reward': 163.359375, 'reward_std': 12.928295135498047, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.07757525146007538, 'sampling/sampling_logp_difference/max': 2.3781018257141113, 'sampling/importance_sampling_ratio/min': 0.09272641688585281, 'sampling/importance_sampling_ratio/mean': 1.0198965072631836, 'sampling/importance_sampling_ratio/max': 1.5324656963348389, 'entropy': 0.89453125, 'clip_ratio/low_mean': 0.0008544921875, 'clip_ratio/low_min': 0.0008544921875, 'clip_ratio/high_mean': 0.000732421875, 

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1258.51it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.02it/s, est. speed input: 3487.91 toks/s, output: 10504.36 toks/s]
 56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 14/25 [00:24<00:18,  1.65s/it]

{'loss': -0.059, 'grad_norm': 4.25, 'learning_rate': 2.4e-05, 'num_tokens': 136348.0, 'completions/mean_length': 249.96875, 'completions/min_length': 63.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 0.96875, 'completions/mean_terminated_length': 63.0, 'completions/min_terminated_length': 63.0, 'completions/max_terminated_length': 63.0, 'rewards/reward_from_env/mean': 167.2687530517578, 'rewards/reward_from_env/std': 26.248825073242188, 'reward': 167.2687530517578, 'reward_std': 19.7154598236084, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.06686343997716904, 'sampling/sampling_logp_difference/max': 1.2733135223388672, 'sampling/importance_sampling_ratio/min': 0.2799026370048523, 'sampling/importance_sampling_ratio/mean': 1.0195963382720947, 'sampling/importance_sampling_ratio/max': 1.6750993728637695, 'entropy': 0.754233181476593, 'clip_ratio/low_mean': 0.0003662109375, 'clip_ratio/low_min': 0.0003662109375, 'clip_ratio/high_mean': 0.00061

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1904.34it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.01it/s, est. speed input: 1387.31 toks/s, output: 11012.33 toks/s]
 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 15/25 [00:25<00:16,  1.63s/it]

{'loss': 0.0007, 'grad_norm': 3.015625, 'learning_rate': 2.2000000000000003e-05, 'num_tokens': 145572.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 173.2843780517578, 'rewards/reward_from_env/std': 9.06617546081543, 'reward': 173.2843780517578, 'reward_std': 8.840656280517578, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.05370745807886124, 'sampling/sampling_logp_difference/max': 1.5851564407348633, 'sampling/importance_sampling_ratio/min': 0.20491573214530945, 'sampling/importance_sampling_ratio/mean': 1.0140749216079712, 'sampling/importance_sampling_ratio/max': 1.6439921855926514, 'entropy': 0.6044921875, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.0

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1568.40it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.42it/s, est. speed input: 2683.29 toks/s, output: 10860.34 toks/s]
 64%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç   | 16/25 [00:27<00:14,  1.62s/it]

{'loss': 0.0022, 'grad_norm': 3.6875, 'learning_rate': 2e-05, 'num_tokens': 155788.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 175.69375610351562, 'rewards/reward_from_env/std': 24.015649795532227, 'reward': 175.69375610351562, 'reward_std': 16.03937530517578, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.07281577587127686, 'sampling/sampling_logp_difference/max': 1.3870906829833984, 'sampling/importance_sampling_ratio/min': 0.24980100989341736, 'sampling/importance_sampling_ratio/mean': 1.0174846649169922, 'sampling/importance_sampling_ratio/max': 1.7054287195205688, 'entropy': 0.85546875, 'clip_ratio/low_mean': 0.000244140625, 'clip_ratio/low_min': 0.000244140625, 'clip_ratio/high_mean': 0.001220703125, 'clip_

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1722.51it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.04it/s, est. speed input: 2001.76 toks/s, output: 11020.36 toks/s]
 68%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä   | 17/25 [00:29<00:12,  1.61s/it]

{'loss': 0.0012, 'grad_norm': 4.25, 'learning_rate': 1.8e-05, 'num_tokens': 165468.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 180.13125610351562, 'rewards/reward_from_env/std': 18.830968856811523, 'reward': 180.13125610351562, 'reward_std': 17.099414825439453, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.07025197893381119, 'sampling/sampling_logp_difference/max': 1.3835391998291016, 'sampling/importance_sampling_ratio/min': 0.2506897449493408, 'sampling/importance_sampling_ratio/mean': 1.0149098634719849, 'sampling/importance_sampling_ratio/max': 2.0, 'entropy': 0.7724609375, 'clip_ratio/low_mean': 0.0003662109375, 'clip_ratio/low_min': 0.0003662109375, 'clip_ratio/high_mean': 0.0003662109375, 'clip_ratio/high

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1262.58it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 41.95it/s, est. speed input: 3671.35 toks/s, output: 10741.24 toks/s]
 72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 18/25 [00:30<00:11,  1.61s/it]

{'loss': 0.0013, 'grad_norm': 3.4375, 'learning_rate': 1.6000000000000003e-05, 'num_tokens': 176460.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 187.6062469482422, 'rewards/reward_from_env/std': 17.654661178588867, 'reward': 187.6062469482422, 'reward_std': 15.463003158569336, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.06519210338592529, 'sampling/sampling_logp_difference/max': 1.522028923034668, 'sampling/importance_sampling_ratio/min': 0.21826860308647156, 'sampling/importance_sampling_ratio/mean': 1.0162379741668701, 'sampling/importance_sampling_ratio/max': 1.509002447128296, 'entropy': 0.748046875, 'clip_ratio/low_mean': 0.000244140625, 'clip_ratio/low_min': 0.000244140625, 'clip_ratio/high_mean': 0.00061

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1816.50it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.03it/s, est. speed input: 1484.77 toks/s, output: 11017.33 toks/s]
 76%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 19/25 [00:32<00:09,  1.59s/it]

{'loss': -0.0013, 'grad_norm': 2.859375, 'learning_rate': 1.4000000000000001e-05, 'num_tokens': 185756.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 193.078125, 'rewards/reward_from_env/std': 13.261190414428711, 'reward': 193.078125, 'reward_std': 10.372503280639648, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.05257996916770935, 'sampling/sampling_logp_difference/max': 1.1692800521850586, 'sampling/importance_sampling_ratio/min': 0.3105904757976532, 'sampling/importance_sampling_ratio/mean': 1.0121604204177856, 'sampling/importance_sampling_ratio/max': 1.7386900186538696, 'entropy': 0.6025390625, 'clip_ratio/low_mean': 0.000244140625, 'clip_ratio/low_min': 0.000244140625, 'clip_ratio/high_mean': 0.00048828125, '

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 2023.55it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.07it/s, est. speed input: 1852.04 toks/s, output: 11026.02 toks/s]
 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 20/25 [00:33<00:08,  1.60s/it]

{'loss': 0.0006, 'grad_norm': 2.984375, 'learning_rate': 1.2e-05, 'num_tokens': 195324.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 200.0968780517578, 'rewards/reward_from_env/std': 11.700551986694336, 'reward': 200.0968780517578, 'reward_std': 10.329671859741211, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.056813858449459076, 'sampling/sampling_logp_difference/max': 1.5537490844726562, 'sampling/importance_sampling_ratio/min': 0.21145372092723846, 'sampling/importance_sampling_ratio/mean': 1.0133092403411865, 'sampling/importance_sampling_ratio/max': 1.617047667503357, 'entropy': 0.6494140625, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0001220703125, 'clip_ratio/high_max':

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1563.00it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.59it/s, est. speed input: 3343.48 toks/s, output: 10903.49 toks/s]
 84%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 21/25 [00:35<00:06,  1.62s/it]

{'loss': -0.0011, 'grad_norm': 2.96875, 'learning_rate': 1e-05, 'num_tokens': 206028.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 199.984375, 'rewards/reward_from_env/std': 8.327726364135742, 'reward': 199.984375, 'reward_std': 7.488068103790283, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.06027976796030998, 'sampling/sampling_logp_difference/max': 1.3746700286865234, 'sampling/importance_sampling_ratio/min': 0.25292304158210754, 'sampling/importance_sampling_ratio/mean': 1.0146417617797852, 'sampling/importance_sampling_ratio/max': 1.6927820444107056, 'entropy': 0.68359375, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.00048828125, 'clip_ratio/high_max

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1648.22it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.39it/s, est. speed input: 1897.13 toks/s, output: 10852.79 toks/s]
 88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 22/25 [00:37<00:04,  1.60s/it]

{'loss': 0.0009, 'grad_norm': 2.90625, 'learning_rate': 8.000000000000001e-06, 'num_tokens': 215652.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 204.25625610351562, 'rewards/reward_from_env/std': 14.855082511901855, 'reward': 204.25625610351562, 'reward_std': 14.283214569091797, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.05508602410554886, 'sampling/sampling_logp_difference/max': 1.4210529327392578, 'sampling/importance_sampling_ratio/min': 0.2414596527814865, 'sampling/importance_sampling_ratio/mean': 1.0129733085632324, 'sampling/importance_sampling_ratio/max': 1.617936611175537, 'entropy': 0.6337890625, 'clip_ratio/low_mean': 0.0003662109375, 'clip_ratio/low_min': 0.0003662109375, 'clip_ratio/high_mean': 0.

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1764.72it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.75it/s, est. speed input: 2244.40 toks/s, output: 10944.05 toks/s]
 92%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 23/25 [00:38<00:03,  1.67s/it]

{'loss': 0.0024, 'grad_norm': 3.3125, 'learning_rate': 6e-06, 'num_tokens': 225524.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 194.9343719482422, 'rewards/reward_from_env/std': 9.946619033813477, 'reward': 194.9343719482422, 'reward_std': 9.08108139038086, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.048330195248126984, 'sampling/sampling_logp_difference/max': 1.457280158996582, 'sampling/importance_sampling_ratio/min': 0.23286877572536469, 'sampling/importance_sampling_ratio/mean': 1.012702226638794, 'sampling/importance_sampling_ratio/max': 1.5781943798065186, 'entropy': 0.56689453125, 'clip_ratio/low_mean': 0.0006103515625, 'clip_ratio/low_min': 0.0006103515625, 'clip_ratio/high_mean': 0.0003662109375, 'clip

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1262.58it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 43.00it/s, est. speed input: 2193.11 toks/s, output: 11008.46 toks/s]
 96%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 24/25 [00:40<00:01,  1.66s/it]

{'loss': 0.0001, 'grad_norm': 3.0, 'learning_rate': 4.000000000000001e-06, 'num_tokens': 235348.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 203.49061584472656, 'rewards/reward_from_env/std': 12.524025917053223, 'reward': 203.49061584472656, 'reward_std': 11.884631156921387, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.05112988501787186, 'sampling/sampling_logp_difference/max': 1.372579574584961, 'sampling/importance_sampling_ratio/min': 0.2534523010253906, 'sampling/importance_sampling_ratio/mean': 1.0137460231781006, 'sampling/importance_sampling_ratio/max': 1.5087206363677979, 'entropy': 0.57177734375, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.00048828125, 'clip_ratio/hi

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 1398.22it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:00<00:00, 42.75it/s, est. speed input: 3110.32 toks/s, output: 10944.83 toks/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 25/25 [00:42<00:00,  1.64s/it]

{'loss': -0.0005, 'grad_norm': 3.0, 'learning_rate': 2.0000000000000003e-06, 'num_tokens': 245868.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 199.96875, 'rewards/reward_from_env/std': 10.715091705322266, 'reward': 199.96875, 'reward_std': 8.32847785949707, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.05209578573703766, 'sampling/sampling_logp_difference/max': 1.2726020812988281, 'sampling/importance_sampling_ratio/min': 0.28010183572769165, 'sampling/importance_sampling_ratio/mean': 1.0139122009277344, 'sampling/importance_sampling_ratio/max': 1.518819808959961, 'entropy': 0.5849609375, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.0001220703125, 'clip_

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 25/25 [00:45<00:00,  1.83s/it]


{'train_runtime': 45.7829, 'train_samples_per_second': 2.184, 'train_steps_per_second': 0.546, 'train_loss': -0.023757427856326105, 'epoch': 1.0}
[2025-10-31 17:33:07,390][oumi][rank0][pid:3788270][MainThread][INFO]][train.py:561] Training is Complete.
[2025-10-31 17:33:07,391][oumi][rank0][pid:3788270][MainThread][INFO]][device_utils.py:343] GPU Metrics After Training: GPU runtime info: NVidiaGpuRuntimeInfo(device_index=0, device_count=2, used_memory_mb=75603.0, temperature=34, fan_speed=None, fan_speeds=None, power_usage_watts=124.915, power_limit_watts=700.0, gpu_utilization=0, memory_utilization=0, performance_state=0, clock_speed_graphics=1980, clock_speed_sm=1980, clock_speed_memory=2619).
[2025-10-31 17:33:07,391][oumi][rank0][pid:3788270][MainThread][INFO]][torch_utils.py:135] Peak GPU memory usage: 10.15 GB
[2025-10-31 17:33:07,391][oumi][rank0][pid:3788270][MainThread][INFO]][train.py:568] Saving final state...
[2025-10-31 17:33:07,395][oumi][rank0][pid:3788270][MainThread][I

CompletedProcess(args=['/home/wizeng/miniconda3/envs/openenv/bin/python', 'openenv_tutorial/train.py'], returncode=0)

If you enabled wandb logging, you should get a reward graph that looks like this. Even though the training duration was short, we can see that the model quickly learned to maximize the reward.


![Echo env reward graph](./assets/openenv_echo_reward.png)

# üß≠ What's Next?

Congrats on finishing this notebook! Feel free to check out our other [notebooks](https://github.com/oumi-ai/oumi/tree/main/notebooks) in the [Oumi GitHub](https://github.com/oumi-ai/oumi), and give us a star! You can also join the Oumi community over on [Discord](https://discord.gg/oumi).

üì∞ Want to keep up with news from Oumi? Subscribe to our [Substack](https://blog.oumi.ai/) and [Youtube](https://www.youtube.com/@Oumi_AI)!

‚ö° Interested in building custom AI in hours, not months? Apply to get [early access](https://oumi-ai.typeform.com/early-access) to the Oumi Platform, or [chat with us](https://calendly.com/d/ctcx-nps-47m/chat-with-us-get-early-access-to-the-oumi-platform) to learn more!