oumi.core.analyze#
Sample analyzer plugin system for Oumi.
This package provides a plugin-based architecture for analyzing conversation data with different types of sample analyzers (length, safety, etc.).
- class oumi.core.analyze.DatasetAnalyzer(config: AnalyzeConfig, dataset: BaseMapDataset | None = None)[source]#
Bases:
object
Orchestrates the analysis of datasets using multiple sample analyzers.
- property analysis_df: DataFrame | None#
Get the merged analysis DataFrame with both message and conversation metrics.
- Returns:
DataFrame with columns prefixed by message_ and conversation_ for each analyzer
- Raises:
RuntimeError – If analysis has not been run yet.
- property analysis_results: DatasetAnalysisResult | None#
Get the analysis results if available.
- Returns:
DatasetAnalysisResult if analysis has been run, None otherwise
- property analysis_summary: dict[str, Any]#
Get the comprehensive analysis summary.
- Returns:
Dictionary containing comprehensive dataset analysis summary
- Raises:
RuntimeError – If analysis has not been run yet.
- analyze_dataset() None [source]#
Analyze the dataset and store results internally.
This method performs both message-level and conversation-level analysis using the configured sample analyzers. Each analyzer processes entire conversations and returns metrics for both individual messages and conversations as a whole. Results are stored internally and can be accessed via the query() method.
- Raises:
ValueError – If no analyzers are configured for analysis.
- property conversation_df: DataFrame | None#
Get the conversation-level analysis DataFrame.
- Returns:
DataFrame with conversation-level metrics prefixed by conversation_
- Raises:
RuntimeError – If analysis has not been run yet.
- filter(query_expression: str) BaseMapDataset [source]#
Filter the original dataset based on analysis results.
This method uses analysis results to filter the original dataset, returning a new dataset object containing only the conversations that match the query.
- Parameters:
query_expression – Pandas query expression to filter analysis results
- Returns:
A new dataset object containing only the filtered conversations
- Raises:
RuntimeError – If analysis has not been run yet.
Examples
# Filter for conversations with short messages short_dataset = analyzer.filter(“length_word_count < 10”)
# Filter for conversations with assistant messages assistant_dataset = analyzer.filter(“role == ‘assistant’”)
# Filter for conversations with long user messages long_user_dataset = analyzer.filter(
“role == ‘user’ and length_word_count > 100”
)
- property message_df: DataFrame | None#
Get the message-level analysis DataFrame.
- Returns:
DataFrame with message-level metrics prefixed by message_
- Raises:
RuntimeError – If analysis has not been run yet.
- query(query_expression: str) DataFrame [source]#
Query the analysis results using pandas query syntax.
- Parameters:
query_expression – Pandas query expression (e.g., “char_count > 10”)
- Returns:
DataFrame containing rows that match the query expression
- Raises:
RuntimeError – If analysis has not been run yet.
- query_conversations(query_expression: str) DataFrame [source]#
Query conversation-level analysis results using pandas query expression.
- Parameters:
query_expression – Pandas query expression to filter conversation analysis results
- Returns:
DataFrame with filtered conversation analysis results
- Raises:
RuntimeError – If analysis has not been run yet.
Examples
# Filter for short conversations long_conversations = analyzer.query_conversations(
“length_token_count > 1000”
)
- class oumi.core.analyze.LengthAnalyzer(*, char_count: bool = True, word_count: bool = True, sentence_count: bool = True, token_count: bool = False, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, include_special_tokens: bool = True)[source]#
Bases:
SampleAnalyzer
Analyzer that computes various length metrics for text content.
- analyze_sample(conversation: Conversation, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None) tuple[list[MessageAnalysisResult], ConversationAnalysisResult] [source]#
Analyze a conversation sample and return comprehensive length metrics.
Analyzes each message individually for message-level metrics
Computes conversation-level metrics by: - Aggregating message-level char, word, and sentence counts - Using dataset tokenization for conversation-level token count
- Parameters:
conversation – The conversation object to analyze
tokenizer – Optional tokenizer to use for token counting
- Returns:
List of MessageAnalysisResult objects for each message
ConversationAnalysisResult for the conversation as a whole
- Return type:
Tuple containing
- compute_conversation_metrics(conversation: Conversation, message_results: list[MessageAnalysisResult] | None = None) ConversationAnalysisResult [source]#
Compute conversation-level length metrics for the entire conversation.
- Parameters:
conversation – The conversation object to analyze
tokenizer – Optional tokenizer to use for token counting
message_results – Optional pre-computed message results for aggregation
- Returns:
ConversationAnalysisResult containing conversation-level metrics
- compute_length_metrics(text_content: str) dict[str, Any] [source]#
Compute length metrics for a single text content.
This is a helper function that can be used by both message-level and conversation-level analysis.
- Parameters:
text_content – The text content to analyze
tokenizer – Optional tokenizer to use for token counting
- Returns:
Dictionary containing requested length metrics
- compute_message_metrics(conversation: Conversation) list[MessageAnalysisResult] [source]#
Compute message-level length metrics for each message in the conversation.
- Parameters:
conversation – The conversation object to analyze
tokenizer – Optional tokenizer to use for token counting
- Returns:
List of MessageAnalysisResult objects, one for each message
- class oumi.core.analyze.SampleAnalyzer[source]#
Bases:
ABC
Base class for sample analyzer plugins that analyze individual samples.
- abstractmethod analyze_sample(conversation: Conversation, tokenizer: Any | None = None) tuple[list[MessageAnalysisResult], ConversationAnalysisResult] [source]#
Analyze a conversation sample and return comprehensive analysis results.
This method analyzes a conversation and returns metrics for both individual messages and the conversation as a whole. Each analyzer can decide its own strategy for computing conversation-level metrics (e.g., aggregating message metrics or implementing custom conversation-level analysis).
- Parameters:
conversation – The conversation object to analyze
tokenizer – Optional tokenizer to use for tokenization-based analysis
- Returns:
List of MessageAnalysisResult objects for each message
ConversationAnalysisResult for the conversation as a whole
- Return type:
Tuple containing