Files
OpenArchiver/docker-compose.yml
Wei S. d372ef7566 Feat: Tika Integration and Batch Indexing (#132)
* Feat/tika integration (#94)

* feat(Tika) Integration von Tika zur Textextraktion

* feat(Tika) Integration of Apache Tika for text extraction

* feat(Tika): Complete Tika integration with text extraction and docker-compose setup

- Add Tika service to docker-compose.yml
- Implement text sanitization and document validation
- Improve batch processing with concurrency control

* fix(comments) translated comments into english
fix(docker) removed ports (only used for testing)

* feat(indexing): Implement batch indexing for Meilisearch

This change introduces batch processing for indexing emails into Meilisearch to significantly improve performance and throughput during ingestion. This change is based on the batch processing method previously contributed by @axeldunkel.

Previously, each email was indexed individually, resulting in a high number of separate API calls. This approach was inefficient, especially for large mailboxes.

The `processMailbox` queue worker now accumulates emails into a batch before sending them to the `IndexingService`. The service then uses the `addDocuments` Meilisearch API endpoint to index the entire batch in a single request, reducing network overhead and improving indexing speed.

A new environment variable, `MEILI_INDEXING_BATCH`, has been added to make the batch size configurable, with a default of 500.

Additionally, this commit includes minor refactoring:
- The `TikaService` has been moved to its own dedicated file.
- The `PendingEmail` type has been moved to the shared `@open-archiver/types` package.

* chore(jobs): make continuous sync job scheduling idempotent

Adds a static `jobId` to the repeatable 'schedule-continuous-sync' job.

This prevents duplicate jobs from being scheduled if the server restarts. By providing a unique ID, the queue will update the existing repeatable job instead of creating a new one, ensuring the sync runs only at the configured frequency.

---------

Co-authored-by: axeldunkel <53174090+axeldunkel@users.noreply.github.com>
Co-authored-by: Wayne <5291640+ringoinca@users.noreply.github.com>
2025-09-26 11:34:32 +02:00

75 lines
1.8 KiB
YAML

version: '3.8'
services:
open-archiver:
image: logiclabshq/open-archiver:latest
container_name: open-archiver
restart: unless-stopped
ports:
- '3000:3000' # Frontend
env_file:
- .env
volumes:
- archiver-data:/var/data/open-archiver
depends_on:
- postgres
- valkey
- meilisearch
networks:
- open-archiver-net
postgres:
image: postgres:17-alpine
container_name: postgres
restart: unless-stopped
environment:
POSTGRES_DB: ${POSTGRES_DB:-open_archive}
POSTGRES_USER: ${POSTGRES_USER:-admin}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-password}
volumes:
- pgdata:/var/lib/postgresql/data
networks:
- open-archiver-net
valkey:
image: valkey/valkey:8-alpine
container_name: valkey
restart: unless-stopped
command: valkey-server --requirepass ${REDIS_PASSWORD}
volumes:
- valkeydata:/data
networks:
- open-archiver-net
meilisearch:
image: getmeili/meilisearch:v1.15
container_name: meilisearch
restart: unless-stopped
environment:
MEILI_MASTER_KEY: ${MEILI_MASTER_KEY:-aSampleMasterKey}
volumes:
- meilidata:/meili_data
networks:
- open-archiver-net
tika:
image: apache/tika:3.2.2.0-full
container_name: tika
restart: always
networks:
- open-archiver-net
volumes:
pgdata:
driver: local
valkeydata:
driver: local
meilidata:
driver: local
archiver-data:
driver: local
networks:
open-archiver-net:
driver: bridge