IDocumentConversionService

Overview

The IDocumentConversionService is utilized by Wisej.AI components that require converting documents into text.

This service is integral for Wisej.AI components that need to process and analyze textual content extracted from non-text documents, enhancing the capabilities of Wisej.NET controls and third-party widgets by enabling them to handle and interpret a wider range of document formats.

Default Implementation

The default implementation is in DefaultDocumentConversionService. Supports formats such as PDF, HTML, and DOCX. It relies on open-source libraries, including PdfPig and OpenXML, the same libraries used by SemanticKernel. However, these libraries may not offer the same level of performance or features as commercial alternatives like Aspose.

Supported conversions:

Format

Description

PDF

The DefaultDocumentConversionService implementation uses PdfPig to convert PDF documents into pages of text. For improved results, especially when dealing with documents that contain tables, it is recommended to use Aspose or other commercial libraries.

Returns an array of pages and fills the Metadata object from the metadata stored in the document with these values: "Pages", "Title", "Author", "Subject".

Currently, the built-in converter does not support images embedded in PDF documents.

HTML

Utilizes HtmlAgilityPack to read HTML content and reconstruct it using simplified tags to represent tables and paragraphs. This approach ensures a more streamlined and consistent representation of the original HTML structure.

Returns the entire content in the first string of the return array and reads the following metadata: "Title", "Author", "Subject", "Description".

Note that reading HTML content cannot execute JavaScript.

DOCX

Uses OpenXML to read and extract text from the paragraphs of Word documents. It's important to note that Word documents do not inherently store page numbers or page breaks, as these are determined during the document's rendering process. If you need to convert a Word document by pages, it is advisable to use a commercial library such as Aspose, which can handle page-specific conversions.

Returns an array of paragraphs and fills the Metadata object from the metadata stored in the XML with the following values: "Title", "Author", "Subject", "Description".

Images, charts or any other embedded object are not supported.

XLSX

Uses OpenXML to read and extract worksheets, rows and cells from the Excel document.

Returns an array of worksheets and fills the Metadata object from the metadata stored in the XML with the following values: "Title", "Author", "Subject", "Description".

Images, charts or any other embedded object are not supported.

TXT, CSV

Text and CSV content is read as-is and returned in the first string of the return array.

PreviousISessionTrimmingService NextIEmbeddingStorageService

Last updated 1 month ago