OpenArchiver

mirror of https://github.com/LogicLabs-OU/OpenArchiver.git synced 2026-04-06 00:31:57 +02:00

Files

Wayne e1a3886431 Add OCR support for image-based PDFs

This commit introduces the capability to perform Optical Character Recognition (OCR) on PDF files that consist of images, such as scanned documents.

Previously, the system only extracted existing text layers from PDFs, meaning content from scanned documents was not indexed. The text extraction logic is now updated to first check for a text layer. If none is found, it converts the PDF pages to PNG images and runs them through the Tesseract OCR engine.

Key changes:
- Add `pdf-to-png-converter` dependency to handle PDF-to-image conversion.
- Update the text extraction workflow to trigger OCR for textless PDFs.
- Add `image/webp` to the list of supported OCR mime types.
- Standardize the internal Tesseract data path to `/opt/open-archiver/tessdata` in the Docker configuration and environment variables for consistency.

2025-09-07 14:55:35 +03:00

backend

Add OCR support for image-based PDFs

2025-09-07 14:55:35 +03:00

frontend

Docs: code formatting (#92 )

2025-09-06 18:06:59 +03:00

types

Feat: Implement API key authentication (#84 )

2025-09-04 15:07:53 +03:00