oumi.core.tokenizers

oumi.core.tokenizers#

Tokenizers module for the Oumi (Open Universal Machine Intelligence) library.

This module provides base classes for tokenizers used in the Oumi framework. These base classes serve as foundations for creating custom tokenizers for various natural language processing tasks.

oumi.core.tokenizers.BaseTokenizer#

alias of PreTrainedTokenizerBase

oumi.core.tokenizers.get_default_special_tokens(tokenizer: PreTrainedTokenizerBase | None) SpecialTokensMixin[source]#

Returns the default special tokens for the tokenizer that was provided.

Parameters:

tokenizer – The tokenizer to get special tokens for.

Returns:

The special tokens mixin for the tokenizer.

Description:

This function looks up the special tokens for the provided tokenizer, for a list of known models. If the tokenizer is not recognized, it returns an empty special tokens mixin. This function is used as a fallback mechanism when a special token is required, but is not provided in the tokenizer’s configuration. The primary use case for this is to retrieve the padding special token (pad_token), which is oftentimes not included in the tokenizer’s configuration, even if it exists in the tokenizer’s vocabulary.