OCR support for attachment indexing #242

New Issue

MrUnknownDE · 2026-04-05T16:17:10+02:00

MrUnknownDE commented

2026-04-05 16:17:10 +02:00

Originally created by @wayneshn on 9/6/2025

To enhance the indexing process to provide more universal text extraction from email attachments. This involves integrating a robust Optical Character Recognition (OCR) library to handle image-based files, improve text recovery from scanned or image-based PDF documents, and support other formats like TIFF.

OCR Library: `tesseract.js`

tesseract.js is a pure JavaScript port of the Tesseract OCR engine. It runs directly in a Node.js environment.

Plans

Create a centralized OcrService

A singleton service will be created to manage a persistent pool of Tesseract workers for the lifetime of the indexing.worker process.

Update textExtractor.ts to support more file types

The extractText function will be updated to handle a wider range of file types that can benefit from OCR.

Integrate OCR service into the indexing worker

Modify packages/backend/src/workers/indexing.worker.ts to include graceful shutdown for the OcrService.

Install language packs upon Docker build

Modify packages/backend/Dockerfile to install wget and download language packs during the build.

*Originally created by @wayneshn on 9/6/2025* To enhance the indexing process to provide more universal text extraction from email attachments. This involves integrating a robust Optical Character Recognition (OCR) library to handle image-based files, improve text recovery from scanned or image-based PDF documents, and support other formats like TIFF. ### OCR Library: `tesseract.js` `tesseract.js` is a pure JavaScript port of the Tesseract OCR engine. It runs directly in a Node.js environment. ### Plans 1. Create a centralized `OcrService` A singleton service will be created to manage a persistent pool of Tesseract workers for the lifetime of the `indexing.worker` process. 2. Update `textExtractor.ts` to support more file types The `extractText` function will be updated to handle a wider range of file types that can benefit from OCR. 3. Integrate OCR service into the indexing worker Modify `packages/backend/src/workers/indexing.worker.ts` to include graceful shutdown for the `OcrService`. 4. Install language packs upon Docker build Modify `packages/backend/Dockerfile` to install `wget` and download language packs during the build.

MrUnknownDE closed this issue

2026-04-05 16:17:10 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github/OpenArchiver#242

OCR support for attachment indexing #242

OCR Library: tesseract.js

Plans

OCR Library: `tesseract.js`