ITextSplitterService
Last updated
Last updated
The is employed by Wisej.AI components and tools to divide large blocks of text into smaller, manageable chunks that are suitable for vectorization. At present, this service is primarily utilized by the SmartHub.IngestDocument()
methods.
The default implementation of the ITextSplitterService
is the class. This class is derived from the base TextSplitterServiceBase
class and represents a slightly modified version of the implementation found in .
The RecursiveCharacterTextSplitter approach involves breaking down large blocks of text into smaller chunks by recursively splitting the text at specified character boundaries. This method ensures that the resulting chunks are manageable and suitable for vectorization.
A key feature of this approach is its ability to create overlapping text segments. This overlapping feature ensures that important contextual information is preserved across chunks, enhancing the accuracy and coherence of subsequent text processing tasks.
In the default implementation, the chunkSize
is set to 1000 characters, and the chunkOverlap
is set to 200 characters. The text is split using the separators "\n\n"
, "\n"
, and " "
. These settings ensure efficient text division while maintaining context across chunks. However, you can easily modify these default settings to suit your specific needs or take full control of the service to customize its behavior entirely.
If you choose to implement the ITextSplitterService
yourself, you have the option to derive your implementation from the class and leverage its built-in services. This can simplify your development process by providing a foundation to build upon. Alternatively, you are free to implement any kind of text chunking that your application requires, allowing you to tailor the service to meet your specific needs and preferences.