Usage Metrics
How to keep track of the tokens
Last updated
How to keep track of the tokens
Last updated
Most AI cloud providers charge based on the . They differentiate between input tokens (the tokens you send) and output tokens (the tokens generated).
Wisej.AI tracks InputTokens
, OutputTokens
, and CallCount
in the Usage
property across multiple levels, providing detailed insights into how resources are being utilized:
The total usage related to the SmartPrompt
instance is collected, allowing for a comprehensive analysis of resource consumption tied specifically to that instance.
The total usage for the lifetime of the session is collected, offering detailed insights into resource utilization for the entire duration of the session.
Usage metrics utilized by the message. Only the ToolCall
and Assistant
message roles carry these metrics, as they represent the responses from the LLM, which is what provides the metrics data.
The total usage for the lifetime of the endpoint instance is collected continuously, providing a complete view of resource consumption throughout the instance's duration.
The total usage related to the hub instance is collected continuously and does not reset even if the endpoint associated with the hub changes.
All the components mentioned above expose the Usage
property of type . Wisej.AI ensures that this property is kept up-to-date at all times. The usage data is parsed directly from the LLM response using the method in the SmartEndpoint
implementation, and it is then tallied across the different layers.
For example, if you use the SmartChatBoxAdapter
without any tools and engage in a simple chat, the usage would look like this:
SmartPrompt
SmartSession
SmartSession.Message
SmartEndpoint
SmartHub
We made one call by sending "Hello," which is 1 token, and received a response of 9 tokens. However, the AI provider logged 648 input tokens! This is because we didn't just send "Hello"; we also included the built-in Wisej.AI system prompt, which has a minimum size of 648 tokens. If you incorporated some tools, you could easily exceed 2,000 tokens.
SmartPrompt
SmartSession
SmartSession.Message
SmartEndpoint
SmartHub
Input Tokens: $2.50 / 1M tokens * 1312 = $0.00328
Output Tokens: $10 / 1M tokens * 43 = $0.00043
Total Cost: $0.00371
If we repeated the scenario 1,000 times, the cost would be approximately $3.71. This estimation is based on the number of tokens used in each request and the pricing model typically applied by OpenAI for token usage.
Local hosting operates under a different cost structure. If you're using a hosted virtual machine (VM), the pricing is typically charged by the hour. In contrast, if you're using owned hardware, the costs involve the purchase of the hardware and its housing, either on-premises or in a data center.
The granular approach Wisej.AI employs to track usage is beneficial in this scenario as well, enabling you to compare the total cost of ownership when using local hosting against the usage cost of an AI cloud provider.
When Wisej.AI detects that the payload about to be submitted exceeds the allowed context window size in tokens, it employs the ISessionTrimmingService to trim the messages within the session. This process aims to maximize the preservation of memory for the model. The trimming strategy used is governed by the service and is influenced by the values of the TrimmingStrategy
and TrimmingPercentage
properties.
RollingWindow Approach:
In this method, Wisej.AI reduces the entire message history by half. Crucially, it preserves the System Prompt at the top of the message list. It then methodically removes tool calls and tool responses in pairs to maintain a balanced history. If further reduction is necessary, it starts removing user and assistant messages from the top of the list.
Summarization Approach:
In the summarization method, Wisej.AI constructs a summarization payload that includes half of all the messages, except for the System Prompt at the top which remains intact. This payload is then sent to a summarization prompt (specified under the "[SmartSession.Summarization]" key). The outcome of this process is that all the messages are replaced with a single assistant message containing the summary of half of the history.
Custom:
To implement your own trimming strategy, you need to create a class that implements the ISessionTrimmingService
interface and replace the default service with your implementation. Within this class, you can utilize other models or libraries to efficiently manage the overflow of session messages.
The cost with OpenAI for the queries described above would involve calculating the total number of input and output tokens used and applying the pricing model provided by .
Wisej.AI utilizes the property value configured on the endpoint in use, along with the current usage metrics, to proactively optimize and trim messages before submitting them to the model.