oumi.core.analyze#
Sample analyzer plugin system for Oumi.
This package provides a plugin-based architecture for analyzing conversation data with different types of sample analyzers (length, safety, etc.).
- class oumi.core.analyze.DatasetAnalyzer(config: AnalyzeConfig, dataset: BaseMapDataset | None = None)[source]#
Bases:
objectOrchestrates the analysis of datasets using multiple sample analyzers.
- property analysis_df: DataFrame | None#
Get the merged analysis DataFrame with both message and conversation metrics.
- Returns:
DataFrame with columns prefixed by
message_andconversation_for each analyzer.- Raises:
RuntimeError – If analysis has not been run yet.
- property analysis_results: DatasetAnalysisResult | None#
Get the analysis results if available.
- Returns:
DatasetAnalysisResult if analysis has been run, None otherwise
- property analysis_summary: dict[str, Any]#
Get the comprehensive analysis summary.
- Returns:
Dictionary containing comprehensive dataset analysis summary
- Raises:
RuntimeError – If analysis has not been run yet.
- analyze_dataset() None[source]#
Analyze the dataset and store results internally.
This method performs both message-level and conversation-level analysis using the configured sample analyzers. Each analyzer processes entire conversations and returns metrics for both individual messages and conversations as a whole. Results are stored internally and can be accessed via the query() method.
- Raises:
ValueError – If no analyzers are configured for analysis.
- property conversation_df: DataFrame | None#
Get the conversation-level analysis DataFrame.
- Returns:
DataFrame with conversation-level metrics prefixed by
conversation_- Raises:
RuntimeError – If analysis has not been run yet.
- filter(query_expression: str) BaseMapDataset | BaseIterableDataset[source]#
Filter the original dataset based on analysis results.
This method uses analysis results to filter the original dataset, returning a new dataset object containing only the conversations that match the query.
- Parameters:
query_expression – Pandas query expression to filter analysis results
- Returns:
A new dataset object containing only the filtered conversations
- Raises:
RuntimeError – If analysis has not been run yet.
Examples:
# Filter for conversations with short messages short_dataset = analyzer.filter("length_word_count < 10") # Filter for conversations with assistant messages assistant_dataset = analyzer.filter("role == 'assistant'") # Filter for conversations with long user messages long_user_dataset = analyzer.filter( "role == 'user' and length_word_count > 100" )
- get_schema() dict[source]#
Get the schema for the analysis results.
- Returns:
Dictionary containing the schema for the merged DataFrame, combining schemas from all input DataFrames including analyzer-generated columns.
- Raises:
RuntimeError – If analysis has not been run yet.
- property message_df: DataFrame | None#
Get the message-level analysis DataFrame.
- Returns:
DataFrame with message-level metrics prefixed by
message_- Raises:
RuntimeError – If analysis has not been run yet.
- query(query_expression: str) DataFrame[source]#
Query the analysis results using pandas query syntax.
- Parameters:
query_expression – Pandas query expression (e.g., “char_count > 10”)
- Returns:
DataFrame containing rows that match the query expression
- Raises:
RuntimeError – If analysis has not been run yet.
- query_conversations(query_expression: str) DataFrame[source]#
Query conversation-level analysis results using pandas query expression.
- Parameters:
query_expression – Pandas query expression to filter conversation analysis results
- Returns:
DataFrame with filtered conversation analysis results
- Raises:
RuntimeError – If analysis has not been run yet.
Examples:
# Filter for short conversations long_conversations = analyzer.query_conversations( "length_token_count > 1000" )
- class oumi.core.analyze.LengthAnalyzer(*, char_count: bool = True, word_count: bool = True, sentence_count: bool = True, token_count: bool = False, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, include_special_tokens: bool = True)[source]#
Bases:
SampleAnalyzerAnalyzer that computes various length metrics for text content.
- analyze_sample(df: DataFrame, schema: dict | None = None) tuple[DataFrame, dict][source]#
Analyze text fields and return metrics.
- Parameters:
df – Input DataFrame with text fields
schema – Column schema dict to identify text fields
- Returns:
Tuple of (DataFrame with added field-level analysis columns, generated column schema dict)
- class oumi.core.analyze.SampleAnalyzer[source]#
Bases:
ABCBase class for sample analyzer plugins that analyze individual samples.
All analyzers work with pandas DataFrames for efficient processing.
- abstractmethod analyze_sample(df: DataFrame, schema: dict | None = None) tuple[DataFrame, dict][source]#
Analyze text fields and return analysis results.
This method performs analysis on the input DataFrame and returns the DataFrame with added analysis columns along with schema information for the generated columns. All analyzers must implement this method.
- Parameters:
df – Input DataFrame with text fields
schema – Column schema dict to identify text fields
- Returns:
Tuple of (DataFrame with added analysis columns, generated column schema dict). The schema dict maps column names to their schema config with keys: ‘type’, ‘content_type’, ‘description’.