TextTokenizer

Wisej.AI.Helpers.TextTokenizer

Namespace: Wisej.AI.Helpers

Assembly: Wisej.AI (3.5.0.0)

Provides methods for tokenizing text content using a specified encoder.

public class TextTokenizer

Public Class TextTokenizer

Supports the following models:

o200K_base Used by gpt-4o and gpt-4o-mini.
cl100k_base Used by gpt-4-turbo, gpt-4, gpt-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large
p50k_base Used by text-davinci-002, text-davinci-003.

Methods

Counts the number of tokens in the specified text content using the given encoding model.

Parameter

Type

Description

content

String

The text content to analyze.

name

String

The name of the encoder to use for counting tokens. Default is "o200K_base".

Returns: Int32. The number of tokens in the content.

This method calculates the total number of tokens present in the content based on the specified encoding model.


int tokenCount = TextTokenizer.CountTokens("Sample text");
Console.WriteLine($"Number of tokens: {tokenCount}");

Throws:

Tokenizes the specified text content using the given encoder.

Parameter

Type

Description

content

String

The text content to tokenize.

name

String

The name of the encoder to use for tokenization. Default is "o200K_base".

Returns: IReadOnlyCollection<String>. A read-only collection of tokens extracted from the content.

This method uses the specified encoding model to break down the content into individual tokens.


var tokens = TextTokenizer.Tokenize("Sample text");
foreach (var token in tokens)
{
Console.WriteLine(token);
}

Throws:

Last updated 1 month ago