oumi.core.analyze#

Sample analyzer plugin system for Oumi.

This package provides a plugin-based architecture for analyzing conversation data with different types of sample analyzers (length, safety, etc.).

class oumi.core.analyze.DatasetAnalyzer(config: AnalyzeConfig, dataset: BaseMapDataset | None = None)[source]#

Bases: object

Orchestrates the analysis of datasets using multiple sample analyzers.

property analysis_df: DataFrame | None#

Get the merged analysis DataFrame with both message and conversation metrics.

Returns:

DataFrame with columns prefixed by message_ and conversation_ for each analyzer

Raises:

RuntimeError – If analysis has not been run yet.

property analysis_results: DatasetAnalysisResult | None#

Get the analysis results if available.

Returns:

DatasetAnalysisResult if analysis has been run, None otherwise

property analysis_summary: dict[str, Any]#

Get the comprehensive analysis summary.

Returns:

Dictionary containing comprehensive dataset analysis summary

Raises:

RuntimeError – If analysis has not been run yet.

analyze_dataset() None[source]#

Analyze the dataset and store results internally.

This method performs both message-level and conversation-level analysis using the configured sample analyzers. Each analyzer processes entire conversations and returns metrics for both individual messages and conversations as a whole. Results are stored internally and can be accessed via the query() method.

Raises:

ValueError – If no analyzers are configured for analysis.

property conversation_df: DataFrame | None#

Get the conversation-level analysis DataFrame.

Returns:

DataFrame with conversation-level metrics prefixed by conversation_

Raises:

RuntimeError – If analysis has not been run yet.

filter(query_expression: str) BaseMapDataset[source]#

Filter the original dataset based on analysis results.

This method uses analysis results to filter the original dataset, returning a new dataset object containing only the conversations that match the query.

Parameters:

query_expression – Pandas query expression to filter analysis results

Returns:

A new dataset object containing only the filtered conversations

Raises:

RuntimeError – If analysis has not been run yet.

Examples

# Filter for conversations with short messages short_dataset = analyzer.filter(“length_word_count < 10”)

# Filter for conversations with assistant messages assistant_dataset = analyzer.filter(“role == ‘assistant’”)

# Filter for conversations with long user messages long_user_dataset = analyzer.filter(

“role == ‘user’ and length_word_count > 100”

)

property message_df: DataFrame | None#

Get the message-level analysis DataFrame.

Returns:

DataFrame with message-level metrics prefixed by message_

Raises:

RuntimeError – If analysis has not been run yet.

query(query_expression: str) DataFrame[source]#

Query the analysis results using pandas query syntax.

Parameters:

query_expression – Pandas query expression (e.g., “char_count > 10”)

Returns:

DataFrame containing rows that match the query expression

Raises:

RuntimeError – If analysis has not been run yet.

query_conversations(query_expression: str) DataFrame[source]#

Query conversation-level analysis results using pandas query expression.

Parameters:

query_expression – Pandas query expression to filter conversation analysis results

Returns:

DataFrame with filtered conversation analysis results

Raises:

RuntimeError – If analysis has not been run yet.

Examples

# Filter for short conversations long_conversations = analyzer.query_conversations(

“length_token_count > 1000”

)

class oumi.core.analyze.LengthAnalyzer(*, char_count: bool = True, word_count: bool = True, sentence_count: bool = True, token_count: bool = False, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, include_special_tokens: bool = True)[source]#

Bases: SampleAnalyzer

Analyzer that computes various length metrics for text content.

analyze_sample(conversation: Conversation, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None) tuple[list[MessageAnalysisResult], ConversationAnalysisResult][source]#

Analyze a conversation sample and return comprehensive length metrics.

  1. Analyzes each message individually for message-level metrics

  2. Computes conversation-level metrics by: - Aggregating message-level char, word, and sentence counts - Using dataset tokenization for conversation-level token count

Parameters:
  • conversation – The conversation object to analyze

  • tokenizer – Optional tokenizer to use for token counting

Returns:

  • List of MessageAnalysisResult objects for each message

  • ConversationAnalysisResult for the conversation as a whole

Return type:

Tuple containing

compute_conversation_metrics(conversation: Conversation, message_results: list[MessageAnalysisResult] | None = None) ConversationAnalysisResult[source]#

Compute conversation-level length metrics for the entire conversation.

Parameters:
  • conversation – The conversation object to analyze

  • tokenizer – Optional tokenizer to use for token counting

  • message_results – Optional pre-computed message results for aggregation

Returns:

ConversationAnalysisResult containing conversation-level metrics

compute_length_metrics(text_content: str) dict[str, Any][source]#

Compute length metrics for a single text content.

This is a helper function that can be used by both message-level and conversation-level analysis.

Parameters:
  • text_content – The text content to analyze

  • tokenizer – Optional tokenizer to use for token counting

Returns:

Dictionary containing requested length metrics

compute_message_metrics(conversation: Conversation) list[MessageAnalysisResult][source]#

Compute message-level length metrics for each message in the conversation.

Parameters:
  • conversation – The conversation object to analyze

  • tokenizer – Optional tokenizer to use for token counting

Returns:

List of MessageAnalysisResult objects, one for each message

class oumi.core.analyze.SampleAnalyzer[source]#

Bases: ABC

Base class for sample analyzer plugins that analyze individual samples.

abstractmethod analyze_sample(conversation: Conversation, tokenizer: Any | None = None) tuple[list[MessageAnalysisResult], ConversationAnalysisResult][source]#

Analyze a conversation sample and return comprehensive analysis results.

This method analyzes a conversation and returns metrics for both individual messages and the conversation as a whole. Each analyzer can decide its own strategy for computing conversation-level metrics (e.g., aggregating message metrics or implementing custom conversation-level analysis).

Parameters:
  • conversation – The conversation object to analyze

  • tokenizer – Optional tokenizer to use for tokenization-based analysis

Returns:

  • List of MessageAnalysisResult objects for each message

  • ConversationAnalysisResult for the conversation as a whole

Return type:

Tuple containing