-
Notifications
You must be signed in to change notification settings - Fork 148
Description
It appears that MANSPIDER (/Extractuos) fails to extract text from certain files with the following errors:
For .xlsx:
[-] Error extracting text from {REDACTED}.xlsx: ParseError("Parse error occurred : Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@f595efa")
For .txt:
[-] Error extracting text from {REDACTED}.txt: ParseError("Parse error occurred : TIKA-198: Illegal IOException from org.apache.tika.parser.image.JpegParser@2bd2d48e")
For .xml:
[-] Error extracting text from {REDACTED}.xml: ParseError("Parse error occurred : Unexpected RuntimeException from org.apache.tika.parser.xml.DcXMLParser@45d8c372")
For .pdf:
[-] Error extracting text from {REDACTED}.pdf: ParseError("Parse error occurred : TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@2e4f9d21")
For .docx:
[-] Error extracting text from {REDACTED}.docx: ParseError("Parse error occurred : TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4861ee7d")
I have tested this issue both by installing MANSPIDER via pipx and by running the latest source code directly from the repository. The issue is present in both cases.
I have manually validated that all of these files are not corrupt and can be downloaded.