mirror of
https://github.com/LogicLabs-OU/OpenArchiver.git
synced 2026-04-06 00:31:57 +02:00
This commit introduces the capability to perform Optical Character Recognition (OCR) on PDF files that consist of images, such as scanned documents. Previously, the system only extracted existing text layers from PDFs, meaning content from scanned documents was not indexed. The text extraction logic is now updated to first check for a text layer. If none is found, it converts the PDF pages to PNG images and runs them through the Tesseract OCR engine. Key changes: - Add `pdf-to-png-converter` dependency to handle PDF-to-image conversion. - Update the text extraction workflow to trigger OCR for textless PDFs. - Add `image/webp` to the list of supported OCR mime types. - Standardize the internal Tesseract data path to `/opt/open-archiver/tessdata` in the Docker configuration and environment variables for consistency.