OCR support for attachment indexing #242

Closed
opened 2026-04-05 16:17:10 +02:00 by MrUnknownDE · 0 comments
Owner

Originally created by @wayneshn on 9/6/2025

To enhance the indexing process to provide more universal text extraction from email attachments. This involves integrating a robust Optical Character Recognition (OCR) library to handle image-based files, improve text recovery from scanned or image-based PDF documents, and support other formats like TIFF.

OCR Library: tesseract.js

tesseract.js is a pure JavaScript port of the Tesseract OCR engine. It runs directly in a Node.js environment.

Plans

  1. Create a centralized OcrService

A singleton service will be created to manage a persistent pool of Tesseract workers for the lifetime of the indexing.worker process.

  1. Update textExtractor.ts to support more file types

The extractText function will be updated to handle a wider range of file types that can benefit from OCR.

  1. Integrate OCR service into the indexing worker

Modify packages/backend/src/workers/indexing.worker.ts to include graceful shutdown for the OcrService.

  1. Install language packs upon Docker build

Modify packages/backend/Dockerfile to install wget and download language packs during the build.

*Originally created by @wayneshn on 9/6/2025* To enhance the indexing process to provide more universal text extraction from email attachments. This involves integrating a robust Optical Character Recognition (OCR) library to handle image-based files, improve text recovery from scanned or image-based PDF documents, and support other formats like TIFF. ### OCR Library: `tesseract.js` `tesseract.js` is a pure JavaScript port of the Tesseract OCR engine. It runs directly in a Node.js environment. ### Plans 1. Create a centralized `OcrService` A singleton service will be created to manage a persistent pool of Tesseract workers for the lifetime of the `indexing.worker` process. 2. Update `textExtractor.ts` to support more file types The `extractText` function will be updated to handle a wider range of file types that can benefit from OCR. 3. Integrate OCR service into the indexing worker Modify `packages/backend/src/workers/indexing.worker.ts` to include graceful shutdown for the `OcrService`. 4. Install language packs upon Docker build Modify `packages/backend/Dockerfile` to install `wget` and download language packs during the build.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github/OpenArchiver#242