Usage Metrics

How to keep track of the tokens

Overview

Most AI cloud providers charge based on the number of tokens used. They differentiate between input tokens (the tokens you send) and output tokens (the tokens generated).

Wisej.AI tracks InputTokens, OutputTokens, and CallCount in the Usage property across multiple levels, providing detailed insights into how resources are being utilized:

Component
Usage Description

The total usage related to the SmartPrompt instance is collected, allowing for a comprehensive analysis of resource consumption tied specifically to that instance.

The total usage for the lifetime of the session is collected, offering detailed insights into resource utilization for the entire duration of the session.

Usage metrics utilized by the message. Only the ToolCall and Assistant message roles carry these metrics, as they represent the responses from the LLM, which is what provides the metrics data.

The total usage for the lifetime of the endpoint instance is collected continuously, providing a complete view of resource consumption throughout the instance's duration.

The total usage related to the hub instance is collected continuously and does not reset even if the endpoint associated with the hub changes.

Please note that input tokens aren't limited to just the tokens sent with the question. Because Wisej.AI offers a higher layer of functionality, any AI request may include additional tokens for elements such as the system prompt, tools, and internal agentic loops.

Metrics

All the components mentioned above expose the Usage property of type Metrics. Wisej.AI ensures that this property is kept up-to-date at all times. The usage data is parsed directly from the LLM response using the ReadUsage method in the SmartEndpoint implementation, and it is then tallied across the different layers.

For example, if you use the SmartChatBoxAdapter without any tools and engage in a simple chat, the usage would look like this:

> User: Hello
> Assistant: Hello! How can I assist you today?
Component
CallCount
InputTokens
OutputTokens

SmartPrompt

1
648
9

SmartSession

1
648
9

SmartSession.Message

1
648
9

SmartEndpoint

1
648
9

SmartHub

1
648
9

We made one call by sending "Hello," which is 1 token, and received a response of 9 tokens. However, the AI provider logged 648 input tokens! This is because we didn't just send "Hello"; we also included the built-in Wisej.AI system prompt, which has a minimum size of 648 tokens. If you incorporated some tools, you could easily exceed 2,000 tokens.

> User: Tell me something nice
> Assistant: You are capable of achieving great things, and your 
             potential is limitless. Remember, every day is a new opportunity 
             to grow and make a positive impact. Keep shining bright!
Component
CallCount
InputTokens
OutputTokens

SmartPrompt

2
1312
43

SmartSession

2
1312
43

SmartSession.Message

1
664
34

SmartEndpoint

2
1312
43

SmartHub

2
1312
43

The cost with OpenAI for the queries described above would involve calculating the total number of input and output tokens used and applying the pricing model provided by OpenAI.

Input Tokens: $2.50 / 1M tokens * 1312 = $0.00328

Output Tokens: $10 / 1M tokens * 43 = $0.00043


Total Cost: $0.00371

If we repeated the scenario 1,000 times, the cost would be approximately $3.71. This estimation is based on the number of tokens used in each request and the pricing model typically applied by OpenAI for token usage.

Local Hosting

Local hosting operates under a different cost structure. If you're using a hosted virtual machine (VM), the pricing is typically charged by the hour. In contrast, if you're using owned hardware, the costs involve the purchase of the hardware and its housing, either on-premises or in a data center.

The granular approach Wisej.AI employs to track usage is beneficial in this scenario as well, enabling you to compare the total cost of ownership when using local hosting against the usage cost of an AI cloud provider.

Context Overflow

Wisej.AI utilizes the ContextWindow property value configured on the endpoint in use, along with the current usage metrics, to proactively optimize and trim messages before submitting them to the model.

When Wisej.AI detects that the payload about to be submitted exceeds the allowed context window size in tokens, it employs the ISessionTrimmingService to trim the messages within the session. This process aims to maximize the preservation of memory for the model. The trimming strategy used is governed by the service and is influenced by the values of the TrimmingStrategy and TrimmingPercentage properties.

RollingWindow Approach:

In this method, Wisej.AI reduces the entire message history by half. Crucially, it preserves the System Prompt at the top of the message list. It then methodically removes tool calls and tool responses in pairs to maintain a balanced history. If further reduction is necessary, it starts removing user and assistant messages from the top of the list.

Summarization Approach:

In the summarization method, Wisej.AI constructs a summarization payload that includes half of all the messages, except for the System Prompt at the top which remains intact. This payload is then sent to a summarization prompt (specified under the "[SmartSession.Summarization]" key). The outcome of this process is that all the messages are replaced with a single assistant message containing the summary of half of the history.

Custom:

To implement your own trimming strategy, you need to create a class that implements the ISessionTrimmingService interface and replace the default service with your implementation. Within this class, you can utilize other models or libraries to efficiently manage the overflow of session messages.

Last updated