ITokenizerService
Last updated
Last updated
The provides a standardized interface for operations involving text tokens, such as counting, splitting, or truncating text based on tokenization.
This service is utilized internally whenever an adapter or tool needs to restrict the size of a block of text to a specific number of tokens or to count the tokens within a block of text. The built-in service is based on the of OpenAI's library, utilizing o200K_base
as the default encoding model.
You have the option to either replace the service entirely or simply change the default model. Since the internal implementation leverages the public class, you can also use this helper directly in your code. However, it's important to note that both the service and the helper are designed specifically for tokenizing and truncating, not for encoding and decoding. Therefore, if your application requires more advanced tokenization services, consider using the tiktoken
library directly or the class.
The code below provides an example of how Wisej.AI utilizes the tokenization service within the document search tool to ensure that the returned context remains below the maximum allowed number of tokens.
The TextTokenizer
class offers several embedded encoders within the Wisej.AI assembly, specifically: p50k_base
, o200k_base
, and cl100k_base
. This functionality is built upon the tiktoken library developed by tryAGI, which is available under the MIT license.
The default implementation of the ITokenizerService
interface is provided by the class. This implementation utilizes the class to efficiently split text into tokens.