oumi.datasets.grpo.rewards#

GRPO reward functions module.

oumi.datasets.grpo.rewards.compute_letter_count_reward(completion: str, target_count: int) float[source]#

Computes the rewards for counting the letters in a string.

Parameters:
  • completion – The completion string from the LLM.

  • target_count – The target count of letters.

Returns:

The reward value.

oumi.datasets.grpo.rewards.compute_sharp_target_token_length_reward(num_tokens: int, *, target_tokens: int)[source]#

Returns maximum reward for inputs that are target_tokens long.

The reward reduces sharply if the actual number of tokens deviates from target_tokens.

The reward is computed as: -|num_tokens - target_tokens|, which penalizes token counts not equal to target_tokens.

oumi.datasets.grpo.rewards.compute_soft_target_token_length_reward(num_tokens: int, *, target_tokens: int)[source]#

Returns maximum reward for inputs that are target_tokens long.

The reward is in the [0,1] range and reduces smoothly from the maximum value of 1.0 if the actual number of tokens deviates from target_tokens.

The reward is proportional to: x*exp(-x) where x := num_tokens/target_tokens.

oumi.datasets.grpo.rewards.countdown_reward(data_source: str, solution_str: str, ground_truth: dict[str, Any], extra_info: dict[str, Any], format_score=0.0, score=1.0) float[source]#

Custom reward function for the Countdown task.

Currently, this function only works with the VERL_GRPO trainer.

Parameters:
  • data_source – The data source.

  • solution_str – The response from the LLM.

  • ground_truth – Dictionary containing target number and available numbers

  • extra_info – Extra information about the sample.

  • format_score – The score for correct format but wrong answer.

  • score – The score for the correct answer.

Returns:

score if the equation is valid and correct, format_score if the answer was parsed properly but the equation is incorrect, 0 if the answer was not parsed properly.