oumi.datasets.grpo.rewards#
GRPO datasets module.
- oumi.datasets.grpo.rewards.compute_sharp_target_token_length_reward(num_tokens: int, *, target_tokens: int)[source]#
Returns maximum reward for inputs that are target_tokens long.
The reward reduces sharply if the actual number of tokens deviates from target_tokens.
The reward is computed as: -|num_tokens - target_tokens|, which penalizes token counts not equal to target_tokens.
- oumi.datasets.grpo.rewards.compute_soft_target_token_length_reward(num_tokens: int, *, target_tokens: int)[source]#
Returns maximum reward for inputs that are target_tokens long.
The reward is in the [0,1] range and reduces smoothly from the maximum value of 1.0 if the actual number of tokens deviates from target_tokens.
The reward is proportional to: x*exp(-x) where x := num_tokens/target_tokens.