TextTokenizer

Wisej.AI.Helpers.TextTokenizer

Namespace: Wisej.AI.Helpers

Assembly: Wisej.AI (3.5.0.0)

Provides methods for tokenizing text content using a specified encoder.

public class TextTokenizer

Supports the following models:

  • o200K_base Used by gpt-4o and gpt-4o-mini.

  • cl100k_base Used by gpt-4-turbo, gpt-4, gpt-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large

  • p50k_base Used by text-davinci-002, text-davinci-003.

Methods

CountTokens(content, name)

Counts the number of tokens in the specified text content using the given encoding model.

Parameter
Type
Description

content

The text content to analyze.

name

The name of the encoder to use for counting tokens. Default is "o200K_base".

Returns: Int32. The number of tokens in the content.

This method calculates the total number of tokens present in the content based on the specified encoding model.


int tokenCount = TextTokenizer.CountTokens("Sample text");
Console.WriteLine($"Number of tokens: {tokenCount}");

Throws:

Tokenize(content, name)

Tokenizes the specified text content using the given encoder.

Parameter
Type
Description

content

The text content to tokenize.

name

The name of the encoder to use for tokenization. Default is "o200K_base".

Returns: IReadOnlyCollection<String>. A read-only collection of tokens extracted from the content.

This method uses the specified encoding model to break down the content into individual tokens.


var tokens = TextTokenizer.Tokenize("Sample text");
foreach (var token in tokens)
{
Console.WriteLine(token);
}

Throws:

Last updated