Feat: Tika Integration and Batch Indexing #203

Closed
opened 2026-04-05 16:17:02 +02:00 by MrUnknownDE · 0 comments
Owner

Originally created by @wayneshn on 9/24/2025

This PR introduces two features to enhance Open Archiver's capabilities: Apache Tika integration for text extraction from most attachment types and batch indexing for improved performance in Meilisearch.


Key Features

1. Apache Tika Integration for OCR

  • Enhanced Text Extraction: We've integrated Apache Tika to provide text and metadata extraction from a wide range of file types, including PDFs, Office documents, and image-based files. This significantly improves the search capabilities by making the content of attachments fully searchable.
  • New OcrService: A new OcrService has been implemented to handle OCR operations. It includes:
    • Caching: A simple LRU cache for Tika results to reduce redundant processing and improve performance.
    • Semaphore: A semaphore to manage concurrent Tika requests, preventing resource exhaustion.
    • Health Check: A health check for the Tika server on startup.
  • Configuration: The Tika integration is enabled by setting the TIKA_URL environment variable.

2. Batch Indexing for Meilisearch

  • Improved Indexing Performance: The indexing process now supports batching, which significantly speeds up the ingestion and indexing of large volumes of emails.
  • Configurable Batch Size: The batch size can be configured using the MEILI_INDEXING_BATCH environment variable, allowing administrators to tune the indexing performance based on their hardware and workload.
  • Refactored Indexing Logic: The IndexingService and related processors have been updated to support the new batch indexing workflow.

Other Changes

  • Configuration: Added new environment variables for Tika and batch indexing to .env.example and the configuration files.
  • Docker Compose: Added a Tika service to the docker-compose.yml file.
  • Refactoring: Minor refactoring in the IngestionService and other related services to support the new features.
  • Types: Added new types for pending emails to be indexed.
*Originally created by @wayneshn on 9/24/2025* This PR introduces two features to enhance Open Archiver's capabilities: **Apache Tika integration** for text extraction from most attachment types and **batch indexing** for improved performance in Meilisearch. --- ### Key Features #### 1. Apache Tika Integration for OCR - **Enhanced Text Extraction**: We've integrated **Apache Tika** to provide text and metadata extraction from a wide range of file types, including PDFs, Office documents, and image-based files. This significantly improves the search capabilities by making the content of attachments fully searchable. - **New `OcrService`**: A new `OcrService` has been implemented to handle OCR operations. It includes: - **Caching**: A simple LRU cache for Tika results to reduce redundant processing and improve performance. - **Semaphore**: A semaphore to manage concurrent Tika requests, preventing resource exhaustion. - **Health Check**: A health check for the Tika server on startup. - **Configuration**: The Tika integration is enabled by setting the `TIKA_URL` environment variable. --- #### 2. Batch Indexing for Meilisearch - **Improved Indexing Performance**: The indexing process now supports **batching**, which significantly speeds up the ingestion and indexing of large volumes of emails. - **Configurable Batch Size**: The batch size can be configured using the `MEILI_INDEXING_BATCH` environment variable, allowing administrators to tune the indexing performance based on their hardware and workload. - **Refactored Indexing Logic**: The `IndexingService` and related processors have been updated to support the new batch indexing workflow. --- ### Other Changes - **Configuration**: Added new environment variables for Tika and batch indexing to `.env.example` and the configuration files. - **Docker Compose**: Added a Tika service to the `docker-compose.yml` file. - **Refactoring**: Minor refactoring in the `IngestionService` and other related services to support the new features. - **Types**: Added new types for pending emails to be indexed.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github/OpenArchiver#203