DocumentSearchTools

Overview

DocumentSearchTools module is capable of listing, querying, and summarizing documents stored within a vector database. It utilizes the IEmbeddingStorageService to generate Retrieval-Augmented Generation (RAG) content for the AI, ensuring efficient and intelligent information processing.

The preconfigured prompt is this:

#
# DocumentSearchTools
#
[DocumentSearchTools]
Provides tools to search a set of documents to extract the content needed to complete your tasks.
Unless instructed otherwise, use this tool before other tools.

Instructions:
- When referring to a documment always use the full name including the path, i.e. "My documents/report.pdf"

he document name must include the path.

[DocumentSearchTools.query_all_documents]
Queries all documents to extract the most relevant content to answer the user’s question.

[DocumentSearchTools.query_all_documents.question]
Rewrite the user’s question to enhance the search using RAG and embeddings.

[DocumentSearchTools.list_all_documents]
Returns the names of all available documents.

[DocumentSearchTools.query_single_document]
Extracts the most relevant content from a specific document.

[DocumentSearchTools.query_single_document.document_name]
Exact name of the document.

[DocumentSearchTools.query_single_document.question]
Rewrite the user’s question to enhance the search using RAG and embeddings.

[DocumentSearchTools.read_documents_metadata]
Returns the metadata for multiple documents.
Metadata include the title, original file path, document size, number of pages, created date, last modified date, and others.

[DocumentSearchTools.read_documents_metadata.document_names]
Exact names of the documents.

[DocumentSearchTools.summarize_document]
Summarizes a specific document.

[DocumentSearchTools.summarize_document.document_name]
Exact name of the document.

The DocumentSearchTools class provides multiple methods that the AI can use to search a virtually unlimited storage of unstructured data to accomplish a variety of objectives, enhancing its flexibility and adaptability in processing and analyzing data.

Using DocumentSearchTools

To enable the use of the DocumentSearchTools simply add it to a SmartHub, SmartAdapter, SmartSession or SmartPrompt.

this.smartChatBoxAdapter
    .UseTools(
        new DocumentSearchTools("Financials"));
        
// or

this.smartChatBoxAdapter
    .UseTools(
        new DocumentSearchTools());
        
// or

this.smartChatBoxAdapter
    .UseTools(
        new DocumentSearchTools("Reports", (f) => f.Name.EndsWith(".pdf")));

When creating an instance of the DocumentSearchTools, you have the option to specify a collection name, which functions similarly to a folder. Additionally, you can provide a filter callback function. This callback will be used to filter or screen the documents processed by the tool, allowing you to control which documents are considered for further operations.

Additionally, you have the ability to override or set several properties that define the operational limits for the DocumentSearchTools when retrieving content from a vector storage. These properties allow you to tailor the behavior of the tool, ensuring that it meets your specific requirements for the scope and scale of document retrieval.

Reranking

To incorporate reranking functionality for re-ordering chunks retrieved by the vector search, you can override the RerankAsync method. Within your implementation, you have the flexibility to utilize any reranking approach of your choice. This allows you to tailor the reranking process to best suit your specific requirements and improve the relevance of the search results.

Properties

Name

Description

CollectionName

Read-write and overridable. Default is null. The CollectionName parameter is utilized to organize documents within the vector storage database into distinct groups, analogous to virtual folders. This allows for structured management and retrieval of documents. When performing a vector query, it is restricted to operate within a specified collection. Should the CollectionName be left null, the system defaults to using the default collection space. This ensures that documents are always associated with a particular collection space, even if one is not explicitly defined.

TopN

Read-write and overridable. Default is 10. TopN specifies the maximum number of qualifying chunks that are utilized to construct the RAG (Retrieval-Augmented Generation) context. A chunk is deemed qualified if its similarity score exceeds the threshold defined by MinSimilarity. By setting the TopN value, you determine how many of these high-similarity chunks will be included in the context for further processing, allowing you to balance between performance and accuracy according to your application's requirements.

MaxClusters

Read-write and overridable. Default is 5. MaxClusters determines the upper limit on the number of clusters that can be generated by the summarization function, which utilizes the K-means clustering algorithm.

MinSimilarity

Read-write and overridable. Default is 0.25f. This setting defines the minimum similarity threshold used to filter and select qualified chunks from documents. By establishing this threshold, Wisej.AI can efficiently determine which segments of the document are closely aligned with the desired criteria, enhancing the accuracy and relevance of the selection process.

MaxContextTokens

Read-write and overridable. Default is 4096. This setting specifies the maximum number of tokens that can be returned to the AI within the Retrieval-Augmented Generation (RAG) context string. By limiting the token count, you ensure that the context provided to the AI remains concise and manageable.

MaxDocumentsSearch

Read-write and overridable. Default is 100. The MaxDocumentSearch setting restricts the number of documents returned when conducting a generic search query. For example, if you search for documents related to quantum computing research, and the system identifies 10,000 potential matches, this setting will limit the retrieval to a manageable subset of those documents. By controlling the document count, you can ensure more efficient handling and analysis of search results, avoiding overwhelming the system with excessive data.

RerankingEnabled

Read-write and overridable. Default is false. Enables reranking of the vector search results through the IRerankingService.

Services

This tool relies on several services, many of which are pre-configured by default:

ITokenizerService
IEmbeddingStorageService
IEmbeddingGenerationService
IRerankingService

Please ote that the default IEmbeddingStorageService is set to FileSystemEmbeddingStorageService, which is intended solely for testing purposes. Before deploying this tool in an application, it is recommended to choose an appropriate vector storage solution and register the correct service. The process is demonstrated below:

Application.Services
    .AddOrReplaceService<IEmbeddingStorageService>(
        new PineconeEmbeddingStorageService(endpoint));

We recommend registering your own IDocumentConversionService and utilizing a professional library such as Aspose for document-to-text conversion. The built-in converter currently utilizes PdfPig and OpenXML, which are the same tools employed by Semantic Kernel. However, these tools have limitations and may not be capable of accurately processing complex documents containing tables and images.

PreviousDocumentTools NextWebSearchTools

Last updated 1 month ago