General Concepts
Introduction to LLM Concepts in Wisej.AI
Last updated
Introduction to LLM Concepts in Wisej.AI
Last updated
In this section, we will explore the foundational concepts of Large Language Models (LLMs) as they are utilized within Wisej.AI. This primer is designed to provide software developers with a comprehensive understanding of how LLMs work. We will cover the basic principles, architectural frameworks, and applications of LLMs, ensuring that you have the necessary background to effectively leverage these advanced models in your development projects through Wisej.AI and Wisej.NET.
Large Language Models are a type of artificial intelligence designed to understand, generate, and manipulate human language. These models are trained on vast datasets and use deep learning techniques to achieve natural language processing tasks with impressive accuracy and fluency. Some key attributes of LLMs include:
Contextual Understanding: LLMs comprehend the context of the information they process, enabling them to generate responses that are relevant and coherent within the given framework.
Adaptability: These models can be fine-tuned for a plethora of applications, ranging from content generation to automated customer support, and, in the case of Wisej.AI, augmenting user interface components and widgets.
Scalability: The architecture of LLMs allows them to handle concurrent requests efficiently, which is crucial for real-time interaction in browser-based applications.
LLMs (Large Language Models) are a component of Generative AI, which is built upon the foundational principles of traditional Machine Learning (ML) architecture.
Fundamentally, LLMs are straightforward to use. They operate by predicting the next token in a sequence and continue this process until they reach a specified stopping point or condition, such as the end of a sentence or a predefined token limit.
For example, if the prompt is "Hello, how are you?" the likelihood of the next token being "banana" is very low. In contrast, the probability of the sequence continuing with "Very," followed by "good," then ", thank," "you," "for," and "asking" is much higher.
The complexity of the prompt that a model can handle effectively typically depends on the size of its Artificial Neural Network (ANN). When reviewing a model's description, you might encounter a number like 90B next to its name. This number represents the model's parameters, which are the weights of the ANN, expressed as single-precision floating-point numbers (float32). Consequently, a model with 90 billion parameters consists of 90 billion float32 numbers, resulting in an approximate size of 360 GB.
The "Hello" example is quite trivial. To better illustrate how Wisej.AI can leverage LLMs to deliver valuable functionality to applications—primarily without engaging in chat—consider the following simplified example:
The example above demonstrates how Wisej.AI dynamically generates a prompt by analyzing the code to identify the fields required by a .NET class. The agent message, which is the LLM's response, is formatted as valid JSON. This approach enables the system to read the response, parse it, and populate the appropriate fields in the application. This process is accomplished without engaging in chat and without handling "conversational" responses.
In other words, we are utilizing the LLM in a manner similar to how one would use a SQL database or a cloud service.
This is just the first crucial step in understanding the Wisej.AI architecture. Once this concept is clear, you can appreciate how Wisej.AI leverages this highly structured request-response approach to seamlessly integrate AI features into virtually any aspect of a Wisej.NET application.
The simplest way to understand tools is through an example: if you ask any model, "What time is it?" the model will typically respond with something like, "I'm sorry, but I don't have the capability to provide real-time information such as the current time."
If you ask ChatGPT "what time is it" it will respond correctly - that is because it utilized a tool! But when you ask directly any model anywhere it will respond that it doesn't know.
If we simulate the request-response cycle managed by the Wisej.AI internal agent and have provided the get_current_time
tool to the AI, the query would look like this (the content is very simplified):
Now imagine equipping the LLM with a variety of tools: one for web searches, another for querying your company's database, another for scanning through documents, one for building charts, and yet another for sending emails, among others. These tools effectively expand the AI's capabilities. Consequently, when an application uses Wisej.AI to extract data from a PDF document, it can also fill in missing information, check for anomalies, send alerts via email, and correlate internal documents with public information—all through a single message to the AI provider via Wisej.AI.
LLMs cannot learn or retain information; they have no memory or state. Each request is independent and must include all the necessary information for the LLM to generate the correct subsequent tokens (the response) to the prompt.
A common use case might involve a request to create a table listing all legal deadlines, along with summaries, from 8,000 documents in a folder. It is entirely impractical to concatenate the text from all these documents into a single prompt. Assuming an average of 20 pages per document and 400 words per page, you would end up with approximately 64 million words, which far exceeds the typical limit of 128K tokens.
Indeed, just as an AI (or a person, for that matter) cannot feasibly read billions of web pages but can submit a query to a search engine to receive a list of relevant pages, RAG is used to provide pertinent context to the AI. It does this by extracting the few chunks of text that are relevant to a query from a large collection of data.
For instance, if we want to create a chart of the "5-year cumulative returns" from an 80-page PDF 10-K filing from Apple, we would need to extract and process the relevant data from the document, as it cannot be included in its entirety within the prompt.
Another example would be asking the AI to classify a line item in an invoice according to a list of codes in a chart of accounts stored in a database. Since we cannot submit the entire database, we need to extract the most relevant data to include with the prompt, providing the AI with sufficient information to complete the task accurately.
Embeddings are vector representations of text that position the specific text in relation to other texts within a multidimensional space.
A vector representation is an array of floating-point numbers that indicates the text's position within the space, with the array's length corresponding to the number of dimensions. For instance, a vector with 1536 values places the text within a 1536-dimensional space.
This representation enables applications to efficiently filter large volumes of data by performing similarity searches, which involve calculating the distance between two vectors in the multidimensional space.
Before the introduction of embeddings, a developer would typically split the query into individual words, then iterate through each text chunk, perhaps counting how many keywords appeared in each. Additionally, they might have used traditional letter-swapping algorithms to account for similar words and possibly the relative positions of the keywords. This approach was highly inaccurate, as it only considered character matches and sequences, ignoring conceptual similarity. For instance, the character "5" and the word "five" would never match in such a search.
Using embeddings and various methods to calculate the distance between vectors is a straightforward mathematical operation. It allows applications to process vast amounts of data efficiently and quickly.
Wisej.AI offers a range of services and methods to generate embeddings, with options to store them in memory, on the file system, or in efficient vector databases such as Azure AI Search, Pinecone, or Chroma.
Embeddings stored in memory or the file system should be used solely for development purposes. For deployment, it is recommended to use a vector database for optimal performance and scalability.
When you type "Hello, how are you?" (which is 4 words, 3 spaces and 2 punctuations) you get 6 tokens. But if you type "Dearest, how are you?" you get 7 tokens.
There are various types of tokenizers, and different models may be trained using distinct tokenizers. On average, a token typically consists of about 3.5 characters.
When a request to an LLM exceeds the maximum allowed tokens in the context window, the request will typically fail, resulting in an error. To resolve this, you may need to shorten the input or split it into multiple requests that fit within the context token limit.
Wisej.AI automatically manages content overflow before any errors occur by utilizing the SmartEndpoint.ContextWindow
property as the token limit. When this limit is exceeded, the SmartSession
object optimizes the history using various techniques to ensure smooth operation.
Reranking refers to the process of reordering a list of items, typically search results or recommendations, to improve their relevance or quality with respect to a specific criterion. This is often achieved by applying machine learning algorithms that re-evaluate and adjust the initial ranking, utilizing additional information or more sophisticated models.
When utilizing one of the built-in implementations of the IEmbeddingStorageService
service, the order of relevance is initially determined by the similarity score of the embedding vectors. However, there may be instances where a more precise ordering is desired. Reranking is a common technique used to achieve this. It involves submitting the results of the vector search to a more advanced model, which then reorders the text chunks based on their relevancy to the query. This approach leverages the enhanced capabilities of the model to provide a more accurate order of results.
Utilizing a model typically involves sending a question and receiving a response. There are numerous examples available that demonstrate this process using Semantic Kernel or LangChain. Additionally, you will find many samples of workflows or pipelines that take a response and use it as a new request.
The process can be visualized as follows: Start with the question "What is the capital of China?" which results in the response "Beijing." This response is then used in a follow-up question, "How is the weather in Beijing?" leading to the response "Rainy."
All of this is based on the concept of "conversational AI."
However, Wisej.AI does not employ LLMs in this manner. Instead, it consistently supports context management, tools, and adapters through a composition pattern. Therefore, the smallest unit of AI utilization in Wisej.AI is always an agent.
Subsequently, since the Wisej.AI agents are utilized by adapters operating at the next layer, and these adapters have specific tasks and can incorporate multiple decisions and interactions, we can refer to them as super-agents.
Building on the concept of "harnessing" the AI model, Wisej.AI develops agents and super-agents, along with tools that enhance the model's capabilities. For instance, while a model is inherently limited to appending words (or , to be precise) to a prompt, Wisej.AI constructs structured prompts with specific instructions for obtaining additional information. By parsing these responses and invoking code iteratively (), Wisej.AI creates agents that deliver significantly more value to the system.
Tools are essentially functions that the LLM can "invoke" within its response. See for details on how Wisej.AI defines tools.
If you've read that LLMs can process vast amounts of data in a single request, that's a misconception. In reality, the opposite is true. The combined size of requests and responses is quite limited. While the model itself is likely trained on a vast amount of information, its capacity to generate responses is constrained by what is known as the . This is the maximum number of tokens (for simplicity, you can think of a token as a word) that can be submitted to the ANN at once. It encompasses everything from the system prompt to the final question being submitted. You can think of it as a large string, similar to a lengthy SQL statement. A context window of 128,000 tokens is considered large with the current technology.
We accomplish this by splitting the document into chunks and generating an vector for each chunk. Then, we generate an embedding vector for the query (or a version of the query optimized by the LLM), select the top N chunks based on relevance, and submit only those chunks for processing.
In essence, RAG (Retrieval-Augmented Generation) leverages traditional coding methods, which are capable of processing unlimited amounts of data, to extract a small, relevant subset of information for submission to the LLM (Large Language Model). Utilizing is just one of the many techniques available to accomplish this task.
All models have a limited context window, which is measured in tokens. Tokens are not the same as words, but for simplicity, they can be considered as such. To gain a better understanding of tokens in the context of LLMs, you can try using this tokenizer from OpenAI: .
Wisej.AI includes the ITokenizerService
and offers a default implementation based on OpenAI's . This service can be accessed directly by the Wisej.NET application to count tokens, truncate strings, or split a string into tokens.
Semantic Kernel does it like . Wisej.AI does it automatically. See and for more information.
For a more detailed description of the automatic trimming employed by Wisej.AI, refer to the "" section under Usage Metrics.
Wisej.AI offers the and an overridable method named RerankAsync
within the DocumentTools
and DocumentSearchTools
classes. The default implementation uses the IRerankingService
to reorganize the RAG documents.
Refer to the section in the documentation to see straightforward examples of this approach.