ITokenizerService

Overview

The ITokenizerService provides a standardized interface for operations involving text tokens, such as counting, splitting, or truncating text based on tokenization.

This service is utilized internally whenever an adapter or tool needs to restrict the size of a block of text to a specific number of tokens or to count the tokens within a block of text. The built-in service is based on the reimplementation of OpenAI's tiktoken library, utilizing o200K_base as the default encoding model.

You have the option to either replace the service entirely or simply change the default model. Since the internal implementation leverages the public Wisej.AI.Helper.TextTokenizer class, you can also use this helper directly in your code. However, it's important to note that both the service and the helper are designed specifically for tokenizing and truncating, not for encoding and decoding. Therefore, if your application requires more advanced tokenization services, consider using the tiktoken library directly or the Microsoft.ML.Tokenizers class.

The code below provides an example of how Wisej.AI utilizes the tokenization service within the document search tool to ensure that the returned context remains below the maximum allowed number of tokens.

[SmartTool.Tool]
[Description("[DocumentSearchTools.search_documents]")]
protected virtual async Task<string> search_documents(

    [Description("[DocumentSearchTools.search_documents.question]")]
    string question)
{
    var query = await EmbedQuestionAsync(question);
    var documents = await this.EmbeddingStorageService.QueryAsync(
        _collectionName,
        query?.Vectors[0], 100, this.MinSimilarity, _filter);

    var count = 0;
    var sb = new StringBuilder();
    foreach (var document in documents)
    {
        sb.Append(
@$"
Name:'{document.Name}'
{document.Metadata.ToString()}
===
{document.GetChunks()[0]}
===
");
        if (++count >= 100)
            break;
    }

    return this.TokenizerService.TruncateContent(sb.ToString(), this.MaxContextTokens);
}

Default Implementation

The default implementation of the ITokenizerService interface is provided by the DefaultTokenizerService class. This implementation utilizes the Wisej.AI.Helpers.TextTokenizer class to efficiently split text into tokens.

The TextTokenizer class offers several embedded encoders within the Wisej.AI assembly, specifically: p50k_base, o200k_base, and cl100k_base. This functionality is built upon the tiktoken library developed by tryAGI, which is available under the MIT license.

Last updated