General Concepts

Introduction to LLM Concepts in Wisej.AI

Overview

In this section, we will explore the foundational concepts of Large Language Models (LLMs) as they are utilized within Wisej.AI. This primer is designed to provide software developers with a comprehensive understanding of how LLMs work. We will cover the basic principles, architectural frameworks, and applications of LLMs, ensuring that you have the necessary background to effectively leverage these advanced models in your development projects through Wisej.AI and Wisej.NET.

LLMs are built upon the Artificial Neural Network (ANN) architecture. This document assumes that the reader has at least a basic understanding of how neural networks function.

Large Language Models (LLMs)

Large Language Models are a type of artificial intelligence designed to understand, generate, and manipulate human language. These models are trained on vast datasets and use deep learning techniques to achieve natural language processing tasks with impressive accuracy and fluency. Some key attributes of LLMs include:

Contextual Understanding: LLMs comprehend the context of the information they process, enabling them to generate responses that are relevant and coherent within the given framework.

Adaptability: These models can be fine-tuned for a plethora of applications, ranging from content generation to automated customer support, and, in the case of Wisej.AI, augmenting user interface components and widgets.

Scalability: The architecture of LLMs allows them to handle concurrent requests efficiently, which is crucial for real-time interaction in browser-based applications.

LLMs (Large Language Models) are a component of Generative AI, which is built upon the foundational principles of traditional Machine Learning (ML) architecture.

Fundamentally, LLMs are straightforward to use. They operate by predicting the next token in a sequence and continue this process until they reach a specified stopping point or condition, such as the end of a sentence or a predefined token limit.

For example, if the prompt is "Hello, how are you?" the likelihood of the next token being "banana" is very low. In contrast, the probability of the sequence continuing with "Very," followed by "good," then ", thank," "you," "for," and "asking" is much higher.

The complexity of the prompt that a model can handle effectively typically depends on the size of its Artificial Neural Network (ANN). When reviewing a model's description, you might encounter a number like 90B next to its name. This number represents the model's parameters, which are the weights of the ANN, expressed as single-precision floating-point numbers (float32). Consequently, a model with 90 billion parameters consists of 90 billion float32 numbers, resulting in an approximate size of 360 GB.

The "Hello" example is quite trivial. To better illustrate how Wisej.AI can leverage LLMs to deliver valuable functionality to applications—primarily without engaging in chat—consider the following simplified example:

System Prompt

Your job is to extract the values for the fields listed below.

## Rules:
- Complete the task without asking questions.
- Repeating field are indicated by --.
- Infer missing values from the provided text.
- Carefully follow the instructions for each field specified after the field name.
- Combine multiple values into one value, if necessary.
- Return null for missing values.
- Resolve all the missing values before returning the data.
- Output your response as a JSON object.
- When a value is missing, use all the available tools to find it.

```Fields
Name (First name of the person.)
LastName (Last name of the person.)
Email (Email address. Deduct it from the text if not specified.)
Summary (Summarize the text in less than 256 characters.)
Spam (Detect whether the text is likely to be a SPAM message.)
```

User Message

Dear Client,
Here's introducing Glocal RPO, a global RPO that works with over 100+ clients and a team of 750+ employees.
We help clients reduce costs and support them in Sourcing and Recruiting for Healthcare, Life Sciences & Pharma sectors with our low-cost models.
I would be happy to run a trial on a few positions and showcase how we work.
If you have some time on Monday, can we connect?
Can we schedule a quick call so we can discuss this in more detail?
Eagerly awaiting your response!

Best Regards
Rohit Singh
Digital Marketing Head
[email protected]

Assistant Message

{
  "Name": "Rohit",
  "LastName": "Singh",  
  "Email": "[email protected]",  
  "Summary": "Glocal RPO offers cost-effective sourcing...",
  "Spam": true
}

System Prompt is always the first part of the request sent to the LLM. After that the "User" prompt is the message from the application, and the "Assistant" message is the response from the LLM.

The example above demonstrates how Wisej.AI dynamically generates a prompt by analyzing the code to identify the fields required by a .NET class. The agent message, which is the LLM's response, is formatted as valid JSON. This approach enables the system to read the response, parse it, and populate the appropriate fields in the application. This process is accomplished without engaging in chat and without handling "conversational" responses.

In other words, we are utilizing the LLM in a manner similar to how one would use a SQL database or a cloud service.

This is just the first crucial step in understanding the Wisej.AI architecture. Once this concept is clear, you can appreciate how Wisej.AI leverages this highly structured request-response approach to seamlessly integrate AI features into virtually any aspect of a Wisej.NET application.

Building on the concept of "harnessing" the AI model, Wisej.AI develops agents and super-agents, along with tools that enhance the model's capabilities. For instance, while a model is inherently limited to appending words (or tokens, to be precise) to a prompt, Wisej.AI constructs structured prompts with specific instructions for obtaining additional information. By parsing these responses and invoking code iteratively (tools), Wisej.AI creates agents that deliver significantly more value to the system.

Empowering the AI through Tools

Tools are essentially functions that the LLM can "invoke" within its response. See SmartTools for details on how Wisej.AI defines tools.

The simplest way to understand tools is through an example: if you ask any model, "What time is it?" the model will typically respond with something like, "I'm sorry, but I don't have the capability to provide real-time information such as the current time."

If you ask ChatGPT "what time is it" it will respond correctly - that is because it utilized a tool! But when you ask directly any model anywhere it will respond that it doesn't know.

If we simulate the request-response cycle managed by the Wisej.AI internal agent and have provided the get_current_time tool to the AI, the query would look like this (the content is very simplified):

System Prompt

You can utilize the following tools:
    
    // Returns the current time.
    get_current_time();

User Message

What time is it?

Assistant Tools Message

{
  "tool": "get_current_time()"
}

Execute tool request

return get_current_time();

User Tools Message

get_current_time() = "10:10:44 AM"

Assistant Message

The current time is 10:10:44 AM.

Now imagine equipping the LLM with a variety of tools: one for web searches, another for querying your company's database, another for scanning through documents, one for building charts, and yet another for sending emails, among others. These tools effectively expand the AI's capabilities. Consequently, when an application uses Wisej.AI to extract data from a PDF document, it can also fill in missing information, check for anomalies, send alerts via email, and correlate internal documents with public information—all through a single message to the AI provider via Wisej.AI.

Retrieval-Augmented Generation (RAG)

LLMs cannot learn or retain information; they have no memory or state. Each request is independent and must include all the necessary information for the LLM to generate the correct subsequent tokens (the response) to the prompt.

If you've read that LLMs can process vast amounts of data in a single request, that's a misconception. In reality, the opposite is true. The combined size of requests and responses is quite limited. While the model itself is likely trained on a vast amount of information, its capacity to generate responses is constrained by what is known as the Context Window. This is the maximum number of tokens (for simplicity, you can think of a token as a word) that can be submitted to the ANN at once. It encompasses everything from the system prompt to the final question being submitted. You can think of it as a large string, similar to a lengthy SQL statement. A context window of 128,000 tokens is considered large with the current technology.

A common use case might involve a request to create a table listing all legal deadlines, along with summaries, from 8,000 documents in a folder. It is entirely impractical to concatenate the text from all these documents into a single prompt. Assuming an average of 20 pages per document and 400 words per page, you would end up with approximately 64 million words, which far exceeds the typical limit of 128K tokens.

Indeed, just as an AI (or a person, for that matter) cannot feasibly read billions of web pages but can submit a query to a search engine to receive a list of relevant pages, RAG is used to provide pertinent context to the AI. It does this by extracting the few chunks of text that are relevant to a query from a large collection of data.

Even when using RAG, the process may yield too many chunks, necessitating the truncation of relevant data, which can result in inaccurate responses. There are advanced techniques that can summarize, extract, and aggregate data to process more information effectively. However, there is always a limit to how much data can be handled in this manner.

For instance, if we want to create a chart of the "5-year cumulative returns" from an 80-page PDF 10-K filing from Apple, we would need to extract and process the relevant data from the document, as it cannot be included in its entirety within the prompt.

We accomplish this by splitting the document into chunks and generating an embedding vector for each chunk. Then, we generate an embedding vector for the query (or a version of the query optimized by the LLM), select the top N chunks based on relevance, and submit only those chunks for processing.

Another example would be asking the AI to classify a line item in an invoice according to a list of codes in a chart of accounts stored in a database. Since we cannot submit the entire database, we need to extract the most relevant data to include with the prompt, providing the AI with sufficient information to complete the task accurately.

In essence, RAG (Retrieval-Augmented Generation) leverages traditional coding methods, which are capable of processing unlimited amounts of data, to extract a small, relevant subset of information for submission to the LLM (Large Language Model). Utilizing embeddings is just one of the many techniques available to accomplish this task.

Embeddings (Vectors)

Embeddings are vector representations of text that position the specific text in relation to other texts within a multidimensional space.

A vector representation is an array of floating-point numbers that indicates the text's position within the space, with the array's length corresponding to the number of dimensions. For instance, a vector with 1536 values places the text within a 1536-dimensional space.

This representation enables applications to efficiently filter large volumes of data by performing similarity searches, which involve calculating the distance between two vectors in the multidimensional space.

Before the introduction of embeddings, a developer would typically split the query into individual words, then iterate through each text chunk, perhaps counting how many keywords appeared in each. Additionally, they might have used traditional letter-swapping algorithms to account for similar words and possibly the relative positions of the keywords. This approach was highly inaccurate, as it only considered character matches and sequences, ignoring conceptual similarity. For instance, the character "5" and the word "five" would never match in such a search.

Using embeddings and various methods to calculate the distance between vectors is a straightforward mathematical operation. It allows applications to process vast amounts of data efficiently and quickly.

Wisej.AI offers a range of services and methods to generate embeddings, with options to store them in memory, on the file system, or in efficient vector databases such as Azure AI Search, Pinecone, or Chroma.

Embeddings stored in memory or the file system should be used solely for development purposes. For deployment, it is recommended to use a vector database for optimal performance and scalability.

Context Window and Tokens

All models have a limited context window, which is measured in tokens. Tokens are not the same as words, but for simplicity, they can be considered as such. To gain a better understanding of tokens in the context of LLMs, you can try using this tokenizer from OpenAI: OpenAI Tokenizer.

When you type "Hello, how are you?" (which is 4 words, 3 spaces and 2 punctuations) you get 6 tokens. But if you type "Dearest, how are you?" you get 7 tokens.

There are various types of tokenizers, and different models may be trained using distinct tokenizers. On average, a token typically consists of about 3.5 characters.

Wisej.AI includes the ITokenizerService and offers a default implementation based on OpenAI's Tiktoken. This service can be accessed directly by the Wisej.NET application to count tokens, truncate strings, or split a string into tokens.

When a request to an LLM exceeds the maximum allowed tokens in the context window, the request will typically fail, resulting in an error. To resolve this, you may need to shorten the input or split it into multiple requests that fit within the context token limit.

Wisej.AI automatically manages content overflow before any errors occur by utilizing the SmartEndpoint.ContextWindow property as the token limit. When this limit is exceeded, the SmartSession object optimizes the history using various techniques to ensure smooth operation.

Semantic Kernel does it like this. Wisej.AI does it automatically. See SmartSession.ContextOverflow and SmartSession.OnTrimMessages for more information.

For a more detailed description of the automatic trimming employed by Wisej.AI, refer to the "Context Overflow" section under Usage Metrics.

Reranking

Reranking refers to the process of reordering a list of items, typically search results or recommendations, to improve their relevance or quality with respect to a specific criterion. This is often achieved by applying machine learning algorithms that re-evaluate and adjust the initial ranking, utilizing additional information or more sophisticated models.

When utilizing one of the built-in implementations of the IEmbeddingStorageService service, the order of relevance is initially determined by the similarity score of the embedding vectors. However, there may be instances where a more precise ordering is desired. Reranking is a common technique used to achieve this. It involves submitting the results of the vector search to a more advanced model, which then reorders the text chunks based on their relevancy to the query. This approach leverages the enhanced capabilities of the model to provide a more accurate order of results.

Wisej.AI offers the IRerankingService and an overridable method named RerankAsync within the DocumentTools and DocumentSearchTools classes. The default implementation uses the IRerankingService to reorganize the RAG documents.

What is an Agent and a Super-Agent

Utilizing a model typically involves sending a question and receiving a response. There are numerous examples available that demonstrate this process using Semantic Kernel or LangChain. Additionally, you will find many samples of workflows or pipelines that take a response and use it as a new request.

The process can be visualized as follows: Start with the question "What is the capital of China?" which results in the response "Beijing." This response is then used in a follow-up question, "How is the weather in Beijing?" leading to the response "Rainy."

All of this is based on the concept of "conversational AI."

However, Wisej.AI does not employ LLMs in this manner. Instead, it consistently supports context management, tools, and adapters through a composition pattern. Therefore, the smallest unit of AI utilization in Wisej.AI is always an agent.

Subsequently, since the Wisej.AI agents are utilized by adapters operating at the next layer, and these adapters have specific tasks and can incorporate multiple decisions and interactions, we can refer to them as super-agents.

Refer to the SmartPrompt section in the Architecture documentation to see straightforward examples of this approach.

PreviousGetting Started NextArchitecture

Last updated 1 month ago