oumi.analyze.analyzers#

Analyzer implementations and result models.

This module contains concrete analyzer implementations that inherit from the base analyzer classes and return typed result models. Each analyzer file contains both the analyzer class and its result model for better cohesion.

class oumi.analyze.analyzers.LengthAnalyzer(tokenizer: Tokenizer | None = None)[source]#

Bases: ConversationAnalyzer[LengthMetrics]

Analyzer for computing token length metrics of conversations.

Computes token counts for conversations using a provided tokenizer. Provides both conversation-level totals and per-message breakdowns.

Example

>>> from oumi.analyze.analyzers.length import LengthAnalyzer, default_tokenizer
>>> from oumi.core.types.conversation import Conversation, Message, Role
>>>
>>> analyzer = LengthAnalyzer(tokenizer=default_tokenizer())
>>> conversation = Conversation(messages=[
...     Message(role=Role.USER, content="Hello, how are you?"),
...     Message(role=Role.ASSISTANT, content="I'm doing well, thanks!"),
... ])
>>> result = analyzer.analyze(conversation)
>>> print(f"Total tokens: {result.total_tokens}")
Total tokens: 12
Parameters:

tokenizer – Tokenizer instance for token counting. Must have an encode(text) -> list method. Use default_tokenizer() for tiktoken, or pass a HuggingFace tokenizer for model-specific counts.

analyze(conversation: Conversation) LengthMetrics[source]#

Analyze token length metrics for a conversation.

Parameters:

conversation – The conversation to analyze.

Returns:

LengthMetrics containing token counts.

analyze_text(text: str) LengthMetrics[source]#

Analyze token length metrics for a single text string.

Convenience method for analyzing text without creating a Conversation.

Parameters:

text – The text to analyze.

Returns:

LengthMetrics for the text (treated as a single message).

class oumi.analyze.analyzers.LengthMetrics(*, total_tokens: int, rendered_tokens: int | None = None, avg_tokens_per_message: float, message_token_counts: list[int], num_messages: int, user_total_tokens: int = 0, assistant_total_tokens: int = 0, system_total_tokens: int = 0, tool_total_tokens: int = 0)[source]#

Bases: BaseModel

Result model for length analysis of conversations.

Example

>>> result = LengthMetrics(
...     total_tokens=25,
...     avg_tokens_per_message=12.5,
...     message_token_counts=[10, 15],
...     num_messages=2,
... )
>>> print(result.total_tokens)
25
assistant_total_tokens: int#
avg_tokens_per_message: float#
message_token_counts: list[int]#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_messages: int#
rendered_tokens: int | None#
system_total_tokens: int#
tool_total_tokens: int#
total_tokens: int#
user_total_tokens: int#
class oumi.analyze.analyzers.Tokenizer(*args, **kwargs)[source]#

Bases: Protocol

Protocol for tokenizers used by LengthAnalyzer.

encode(text: str) list[int][source]#

Encode text to token IDs.

class oumi.analyze.analyzers.TurnStatsAnalyzer[source]#

Bases: ConversationAnalyzer[TurnStatsMetrics]

Analyzer for computing turn statistics of conversations.

Computes turn counts and per-role statistics to help understand conversation structure and balance.

Example

>>> from oumi.analyze.analyzers.turn_stats import TurnStatsAnalyzer
>>> from oumi.core.types.conversation import Conversation, Message, Role
>>>
>>> analyzer = TurnStatsAnalyzer()
>>> conversation = Conversation(messages=[
...     Message(role=Role.USER, content="What is Python?"),
...     Message(
...         role=Role.ASSISTANT,
...         content="Python is a programming language.",
...     ),
... ])
>>> result = analyzer.analyze(conversation)
>>> print(f"Turns: {result.num_turns}")
Turns: 2
analyze(conversation: Conversation) TurnStatsMetrics[source]#

Analyze turn statistics for a conversation.

Parameters:

conversation – The conversation to analyze.

Returns:

TurnStatsMetrics containing turn counts and statistics.

class oumi.analyze.analyzers.TurnStatsMetrics(*, num_turns: int, num_user_turns: int, num_assistant_turns: int, num_tool_turns: int = 0, has_system_message: bool, first_turn_role: str | None = None, last_turn_role: str | None = None)[source]#

Bases: BaseModel

Result model for turn statistics analysis of conversations.

Example

>>> result = TurnStatsMetrics(
...     num_turns=4,
...     num_user_turns=2,
...     num_assistant_turns=2,
...     has_system_message=False,
...     first_turn_role="user",
...     last_turn_role="assistant",
... )
>>> print(result.num_turns)
4
first_turn_role: str | None#
has_system_message: bool#
last_turn_role: str | None#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_assistant_turns: int#
num_tool_turns: int#
num_turns: int#
num_user_turns: int#
oumi.analyze.analyzers.default_tokenizer(encoding: str = 'cl100k_base') Encoding[source]#

Get the default tiktoken tokenizer.

Parameters:

encoding – Tiktoken encoding name. Defaults to “cl100k_base” (GPT-4).

Returns:

Tiktoken encoder instance.