Files
OpenArchiver/.env.example
Wei S. d372ef7566 Feat: Tika Integration and Batch Indexing (#132)
* Feat/tika integration (#94)

* feat(Tika) Integration von Tika zur Textextraktion

* feat(Tika) Integration of Apache Tika for text extraction

* feat(Tika): Complete Tika integration with text extraction and docker-compose setup

- Add Tika service to docker-compose.yml
- Implement text sanitization and document validation
- Improve batch processing with concurrency control

* fix(comments) translated comments into english
fix(docker) removed ports (only used for testing)

* feat(indexing): Implement batch indexing for Meilisearch

This change introduces batch processing for indexing emails into Meilisearch to significantly improve performance and throughput during ingestion. This change is based on the batch processing method previously contributed by @axeldunkel.

Previously, each email was indexed individually, resulting in a high number of separate API calls. This approach was inefficient, especially for large mailboxes.

The `processMailbox` queue worker now accumulates emails into a batch before sending them to the `IndexingService`. The service then uses the `addDocuments` Meilisearch API endpoint to index the entire batch in a single request, reducing network overhead and improving indexing speed.

A new environment variable, `MEILI_INDEXING_BATCH`, has been added to make the batch size configurable, with a default of 500.

Additionally, this commit includes minor refactoring:
- The `TikaService` has been moved to its own dedicated file.
- The `PendingEmail` type has been moved to the shared `@open-archiver/types` package.

* chore(jobs): make continuous sync job scheduling idempotent

Adds a static `jobId` to the repeatable 'schedule-continuous-sync' job.

This prevents duplicate jobs from being scheduled if the server restarts. By providing a unique ID, the queue will update the existing repeatable job instead of creating a new one, ensuring the sync runs only at the configured frequency.

---------

Co-authored-by: axeldunkel <53174090+axeldunkel@users.noreply.github.com>
Co-authored-by: Wayne <5291640+ringoinca@users.noreply.github.com>
2025-09-26 11:34:32 +02:00

82 lines
2.9 KiB
Plaintext

# --- Application Settings ---
# Set to 'production' for production environments
NODE_ENV=development
PORT_BACKEND=4000
PORT_FRONTEND=3000
# The frequency of continuous email syncing. Default is every minutes, but you can change it to another value based on your needs.
SYNC_FREQUENCY='* * * * *'
# --- Docker Compose Service Configuration ---
# These variables are used by docker-compose.yml to configure the services. Leave them unchanged if you use Docker services for Postgresql, Valkey (Redis) and Meilisearch. If you decide to use your own instances of these services, you can substitute them with your own connection credentials.
# PostgreSQL
POSTGRES_DB=open_archive
POSTGRES_USER=admin
POSTGRES_PASSWORD=password
DATABASE_URL="postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}"
# Meilisearch
MEILI_MASTER_KEY=aSampleMasterKey
MEILI_HOST=http://meilisearch:7700
# The number of emails to batch together for indexing. Defaults to 500.
MEILI_INDEXING_BATCH=500
# Redis (We use Valkey, which is Redis-compatible and open source)
REDIS_HOST=valkey
REDIS_PORT=6379
REDIS_PASSWORD=defaultredispassword
# If you run Valkey service from Docker Compose, set the REDIS_TLS_ENABLED variable to false.
REDIS_TLS_ENABLED=false
# --- Storage Settings ---
# Choose your storage backend. Valid options are 'local' or 's3'.
STORAGE_TYPE=local
# The maximum request body size to accept in bytes including while streaming. The body size can also be specified with a unit suffix for kilobytes (K), megabytes (M), or gigabytes (G). For example, 512K or 1M. Defaults to 512kb. Or the value of Infinity if you don't want any upload limit.
BODY_SIZE_LIMIT=100M
# --- Local Storage Settings ---
# The path inside the container where files will be stored.
# This is mapped to a Docker volume for persistence.
# This is only used if STORAGE_TYPE is 'local'.
STORAGE_LOCAL_ROOT_PATH=/var/data/open-archiver
# --- S3-Compatible Storage Settings ---
# These are only used if STORAGE_TYPE is 's3'.
STORAGE_S3_ENDPOINT=
STORAGE_S3_BUCKET=
STORAGE_S3_ACCESS_KEY_ID=
STORAGE_S3_SECRET_ACCESS_KEY=
STORAGE_S3_REGION=
# Set to 'true' for MinIO and other non-AWS S3 services
STORAGE_S3_FORCE_PATH_STYLE=false
# --- Security & Authentication ---
# Rate Limiting
# The window in milliseconds for which API requests are checked. Defaults to 60000 (1 minute).
RATE_LIMIT_WINDOW_MS=60000
# The maximum number of API requests allowed from an IP within the window. Defaults to 100.
RATE_LIMIT_MAX_REQUESTS=100
# JWT
# IMPORTANT: Change this to a long, random, and secret string in your .env file
JWT_SECRET=a-very-secret-key-that-you-should-change
JWT_EXPIRES_IN="7d"
# Master Encryption Key for sensitive data (Such as Ingestion source credentials and passwords)
# IMPORTANT: Generate a secure, random 32-byte hex string for this
# You can use `openssl rand -hex 32` to generate a key.
ENCRYPTION_KEY=
# Apache Tika Integration
# ONLY active if TIKA_URL is set
TIKA_URL=http://tika:9998