add OCR docs

2026-04-06 00:31:57 +02:00 · 2025-09-26 12:08:34 +02:00
parent d372ef7566
commit b49d8a78ce
4 changed files with 116 additions and 301 deletions
--- a/docs/.vitepress/config.mts
+++ b/docs/.vitepress/config.mts
@@ -100,6 +100,7 @@ export default defineConfig({
 				items: [
 					{ text: 'Overview', link: '/services/' },
 					{ text: 'Storage Service', link: '/services/storage-service' },
+					{ text: 'OCR Service', link: '/services/ocr-service' },
 					{
 						text: 'IAM Service',
 						items: [{ text: 'IAM Policies', link: '/services/iam-service/iam-policy' }],
--- a/docs/services/iam-service.md
+++ b/docs/services/iam-service.md
@@ -1,289 +0,0 @@
-# IAM Policies
-
-This document provides a guide to creating and managing IAM policies in Open Archiver. It is intended for developers and administrators who need to configure granular access control for users and roles.
-
-## Policy Structure
-
-IAM policies are defined as an array of JSON objects, where each object represents a single permission rule. The structure of a policy object is as follows:
-
-```json
-{
-	"action": "read" OR ["read", "create"],
-	"subject": "ingestion" OR ["ingestion", "dashboard"],
-	"conditions": {
-		"field_name": "value"
-	},
-	"inverted": false OR true,
-}
-```
-
- `action`: The action(s) to be performed on the subject. Can be a single string or an array of strings.
- `subject`: The resource(s) or entity on which the action is to be performed. Can be a single string or an array of strings.
- `conditions`: (Optional) A set of conditions that must be met for the permission to be granted.
- `inverted`: (Optional) When set to `true`, this inverts the rule, turning it from a "can" rule into a "cannot" rule. This is useful for creating exceptions to broader permissions.
-
-## Actions
-
-The following actions are available for use in IAM policies:
-
- `manage`: A wildcard action that grants all permissions on a subject (`create`, `read`, `update`, `delete`, `search`, `sync`).
- `create`: Allows the user to create a new resource.
- `read`: Allows the user to view a resource.
- `update`: Allows the user to modify an existing resource.
- `delete`: Allows the user to delete a resource.
- `search`: Allows the user to search for resources.
- `sync`: Allows the user to synchronize a resource.
-
-## Subjects
-
-The following subjects are available for use in IAM policies:
-
- `all`: A wildcard subject that represents all resources.
- `archive`: Represents archived emails.
- `ingestion`: Represents ingestion sources.
- `settings`: Represents system settings.
- `users`: Represents user accounts.
- `roles`: Represents user roles.
- `dashboard`: Represents the dashboard.
-
-## Advanced Conditions with MongoDB-Style Queries
-
-Conditions are the key to creating fine-grained access control rules. They are defined as a JSON object where each key represents a field on the subject, and the value defines the criteria for that field.
-
-All conditions within a single rule are implicitly joined with an **AND** logic. This means that for a permission to be granted, the resource must satisfy _all_ specified conditions.
-
-The power of this system comes from its use of a subset of [MongoDB's query language](https://www.mongodb.com/docs/manual/), which provides a flexible and expressive way to define complex rules. These rules are translated into native queries for both the PostgreSQL database (via Drizzle ORM) and the Meilisearch engine.
-
-### Supported Operators and Examples
-
-Here is a detailed breakdown of the supported operators with examples.
-
-#### `$eq` (Equal)
-
-This is the default operator. If you provide a simple key-value pair, it is treated as an equality check.
-
-```json
-// This rule...
-{ "status": "active" }
-
-// ...is equivalent to this:
-{ "status": { "$eq": "active" } }
-```
-
-**Use Case**: Grant access to an ingestion source only if its status is `active`.
-
-#### `$ne` (Not Equal)
-
-Matches documents where the field value is not equal to the specified value.
-
-```json
-{ "provider": { "$ne": "pst_import" } }
-```
-
-**Use Case**: Allow a user to see all ingestion sources except for PST imports.
-
-#### `$in` (In Array)
-
-Matches documents where the field value is one of the values in the specified array.
-
-```json
-{
-	"id": {
-		"$in": ["INGESTION_ID_1", "INGESTION_ID_2"]
-	}
-}
-```
-
-**Use Case**: Grant an auditor access to a specific list of ingestion sources.
-
-#### `$nin` (Not In Array)
-
-Matches documents where the field value is not one of the values in the specified array.
-
-```json
-{ "provider": { "$nin": ["pst_import", "eml_import"] } }
-```
-
-**Use Case**: Hide all manual import sources from a specific user role.
-
-#### `$lt` / `$lte` (Less Than / Less Than or Equal)
-
-Matches documents where the field value is less than (`$lt`) or less than or equal to (`$lte`) the specified value. This is useful for numeric or date-based comparisons.
-
-```json
-{ "sentAt": { "$lt": "2024-01-01T00:00:00.000Z" } }
-```
-
-#### `$gt` / `$gte` (Greater Than / Greater Than or Equal)
-
-Matches documents where the field value is greater than (`$gt`) or greater than or equal to (`$gte`) the specified value.
-
-```json
-{ "sentAt": { "$lt": "2024-01-01T00:00:00.000Z" } }
-```
-
-#### `$exists`
-
-Matches documents that have (or do not have) the specified field.
-
-```json
-// Grant access only if a 'lastSyncStatusMessage' exists
-{ "lastSyncStatusMessage": { "$exists": true } }
-```
-
-## Inverted Rules: Creating Exceptions with `cannot`
-
-By default, all rules are "can" rules, meaning they grant permissions. However, you can create a "cannot" rule by adding `"inverted": true` to a policy object. This is extremely useful for creating exceptions to broader permissions.
-
-A common pattern is to grant broad access and then use an inverted rule to carve out a specific restriction.
-
-**Use Case**: Grant a user access to all ingestion sources _except_ for one specific source.
-
-This is achieved with two rules:
-
-1.  A "can" rule that grants `read` access to the `ingestion` subject.
-2.  An inverted "cannot" rule that denies `read` access for the specific ingestion `id`.
-
-```json
-[
-	{
-		"action": "read",
-		"subject": "ingestion"
-	},
-	{
-		"inverted": true,
-		"action": "read",
-		"subject": "ingestion",
-		"conditions": {
-			"id": "SPECIFIC_INGESTION_ID_TO_EXCLUDE"
-		}
-	}
-]
-```
-
-## Policy Evaluation Logic
-
-The system evaluates policies by combining all relevant rules for a user. The logic is simple:
-
- A user has permission if at least one `can` rule allows it.
- A permission is denied if a `cannot` (`"inverted": true`) rule explicitly forbids it, even if a `can` rule allows it. `cannot` rules always take precedence.
-
-### Dynamic Policies with Placeholders
-
-To create dynamic policies that are specific to the current user, you can use the `${user.id}` placeholder in the `conditions` object. This placeholder will be replaced with the ID of the current user at runtime.
-
-## Special Permissions for User and Role Management
-
-It is important to note that while `read` access to `users` and `roles` can be granted granularly, any actions that modify these resources (`create`, `update`, `delete`) are restricted to Super Admins.
-
-A user must have the `{ "action": "manage", "subject": "all" }` permission (Typically a Super Admin role) to manage users and roles. This is a security measure to prevent unauthorized changes to user accounts and permissions.
-
-## Policy Examples
-
-Here are several examples based on the default roles in the system, demonstrating how to combine actions, subjects, and conditions to achieve specific access control scenarios.
-
-### Administrator
-
-This policy grants a user full access to all resources using wildcards.
-
-```json
-[
-	{
-		"action": "manage",
-		"subject": "all"
-	}
-]
-```
-
-### End-User
-
-This policy allows a user to view the dashboard, create new ingestion sources, and fully manage the ingestion sources they own.
-
-```json
-[
-	{
-		"action": "read",
-		"subject": "dashboard"
-	},
-	{
-		"action": "create",
-		"subject": "ingestion"
-	},
-	{
-		"action": "manage",
-		"subject": "ingestion",
-		"conditions": {
-			"userId": "${user.id}"
-		}
-	},
-	{
-		"action": "manage",
-		"subject": "archive",
-		"conditions": {
-			"ingestionSource.userId": "${user.id}" // also needs to give permission to archived emails created by the user
-		}
-	}
-]
-```
-
-### Global Read-Only Auditor
-
-This policy grants read and search access across most of the application's resources, making it suitable for an auditor who needs to view data without modifying it.
-
-```json
-[
-	{
-		"action": ["read", "search"],
-		"subject": ["ingestion", "archive", "dashboard", "users", "roles"]
-	}
-]
-```
-
-### Ingestion Admin
-
-This policy grants full control over all ingestion sources and archives, but no other resources.
-
-```json
-[
-	{
-		"action": "manage",
-		"subject": "ingestion"
-	}
-]
-```
-
-### Auditor for Specific Ingestion Sources
-
-This policy demonstrates how to grant access to a specific list of ingestion sources using the `$in` operator.
-
-```json
-[
-	{
-		"action": ["read", "search"],
-		"subject": "ingestion",
-		"conditions": {
-			"id": {
-				"$in": ["INGESTION_ID_1", "INGESTION_ID_2"]
-			}
-		}
-	}
-]
-```
-
-### Limit Access to a Specific Mailbox
-
-This policy grants a user access to a specific ingestion source, but only allows them to see emails belonging to a single user within that source.
-
-This is achieved by defining two specific `can` rules: The rule grants `read` and `search` access to the `archive` subject, but the `userEmail` must match.
-
-```json
-[
-	{
-		"action": ["read", "search"],
-		"subject": "archive",
-		"conditions": {
-			"userEmail": "user1@example.com"
-		}
-	}
-]
-```
--- a/docs/services/ocr-service.md
+++ b/docs/services/ocr-service.md
@@ -0,0 +1,96 @@
+# OCR Service
+
+The OCR (Optical Character Recognition) and text extraction service is responsible for extracting plain text content from various file formats, such as PDFs, Office documents, and more. This is a crucial component for making email attachments searchable.
+
+## Overview
+
+The system employs a two-pronged approach for text extraction:
+
+1.  **Primary Extractor (Apache Tika)**: A powerful and versatile toolkit that can extract text from a wide variety of file formats. It is the recommended method for its superior performance and format support.
+2.  **Legacy Extractor**: A fallback mechanism that uses a combination of libraries (`pdf2json`, `mammoth`, `xlsx`) for common file types like PDF, DOCX, and XLSX. This is used when Apache Tika is not configured.
+
+The main logic resides in `packages/backend/src/helpers/textExtractor.ts`, which decides which extraction method to use based on the application's configuration.
+
+## Configuration
+
+To enable the primary text extraction method, you must configure the URL of an Apache Tika server instance in your environment variables.
+
+In your `.env` file, set the `TIKA_URL`:
+
+```env
+# .env.example
+
+# Apache Tika Integration
+# ONLY active if TIKA_URL is set
+TIKA_URL=http://tika:9998
+```
+
+If `TIKA_URL` is not set, the system will automatically fall back to the legacy extraction methods. The service performs a health check on startup to verify connectivity with the Tika server.
+
+## File Size Limits
+
+To prevent excessive memory usage and processing time, the service imposes a general size limit on files submitted for text extraction. Files larger than the configured limit will be skipped.
+
+- **With Apache Tika**: The maximum file size is **100MB**.
+- **With Legacy Fallback**: The maximum file size is **50MB**.
+
+## Supported File Formats
+
+The service's ability to extract text depends on whether it's using Apache Tika or the legacy fallback methods.
+
+### With Apache Tika
+
+When `TIKA_URL` is configured, the service can process a vast range of file formats. Apache Tika is designed for broad compatibility and supports hundreds of file types, including but not limited to:
+
+- Portable Document Format (PDF)
+- Microsoft Office formats (DOC, DOCX, PPT, PPTX, XLS, XLSX)
+- OpenDocument Formats (ODT, ODS, ODP)
+- Rich Text Format (RTF)
+- Plain Text (TXT, CSV, JSON, XML, HTML)
+- Image formats with OCR capabilities (PNG, JPEG, TIFF)
+- Archive formats (ZIP, TAR, GZ)
+- Email formats (EML, MSG)
+
+For a complete and up-to-date list, please refer to the official [Apache Tika documentation](https://tika.apache.org/3.2.3/formats.html).
+
+### With Legacy Fallback
+
+When Tika is not configured, text extraction is limited to the following formats:
+
+- `application/pdf` (PDF)
+- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (DOCX)
+- `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` (XLSX)
+- Plain text formats such as `text/*`, `application/json`, and `application/xml`.
+
+## Features of the Tika Integration (`OcrService`)
+
+The `OcrService` (`packages/backend/src/services/OcrService.ts`) provides several enhancements to make text extraction efficient and robust.
+
+### Caching
+
+To avoid redundant processing of the same file, the service implements a simple LRU (Least Recently Used) cache.
+
+- **Cache Key**: A SHA-256 hash of the file's buffer is used as the cache key.
+- **Functionality**: If a file with the same hash is processed again, the text content is served directly from the cache, saving significant processing time.
+- **Statistics**: The service keeps track of cache hits, misses, and the hit rate for performance monitoring.
+
+### Concurrency Management (Semaphore)
+
+Extracting text from large files can be resource-intensive. To prevent the Tika server from being overwhelmed by multiple requests for the _same file_ simultaneously (e.g., during a large import), a semaphore mechanism is used.
+
+- **Functionality**: If a request for a specific file (identified by its hash) is already in progress, any subsequent requests for the same file will wait for the first one to complete and then use its result.
+- **Benefit**: This deduplicates parallel processing efforts and reduces unnecessary load on the Tika server.
+
+### Health Check and DNS Fallback
+
+- **Availability Check**: The service includes a `checkTikaAvailability` method to verify that the Tika server is reachable and operational. This check is performed on application startup.
+- **DNS Fallback**: For convenience in Docker environments, if the Tika URL uses the hostname `tika` (e.g., `http://tika:9998`), the service will automatically attempt a fallback to `localhost` if the initial connection fails.
+
+## Legacy Fallback Methods
+
+When Tika is not available, the `extractTextLegacy` function in `textExtractor.ts` handles extraction for a limited set of MIME types:
+
+- `application/pdf`: Processed using `pdf2json`. Includes a 50MB size limit and a 5-second timeout to prevent memory issues.
+- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (DOCX): Processed using `mammoth`.
+- `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` (XLSX): Processed using `xlsx`.
+- Plain text formats (`text/*`, `application/json`, `application/xml`): Converted directly from the buffer.
--- a/docs/user-guides/installation.md
+++ b/docs/user-guides/installation.md
@@ -76,18 +76,19 @@ Here is a complete list of environment variables available for configuration:

 These variables are used by `docker-compose.yml` to configure the services.

-| Variable            | Description                                     | Default Value                                            |
-| ------------------- | ----------------------------------------------- | -------------------------------------------------------- |
-| `POSTGRES_DB`       | The name of the PostgreSQL database.            | `open_archive`                                           |
-| `POSTGRES_USER`     | The username for the PostgreSQL database.       | `admin`                                                  |
-| `POSTGRES_PASSWORD` | The password for the PostgreSQL database.       | `password`                                               |
-| `DATABASE_URL`      | The connection URL for the PostgreSQL database. | `postgresql://admin:password@postgres:5432/open_archive` |
-| `MEILI_MASTER_KEY`  | The master key for Meilisearch.                 | `aSampleMasterKey`                                       |
-| `MEILI_HOST`        | The host for the Meilisearch service.           | `http://meilisearch:7700`                                |
-| `REDIS_HOST`        | The host for the Valkey (Redis) service.        | `valkey`                                                 |
-| `REDIS_PORT`        | The port for the Valkey (Redis) service.        | `6379`                                                   |
-| `REDIS_PASSWORD`    | The password for the Valkey (Redis) service.    | `defaultredispassword`                                   |
-| `REDIS_TLS_ENABLED` | Enable or disable TLS for Redis.                | `false`                                                  |
+| Variable               | Description                                          | Default Value                                            |
+| ---------------------- | ---------------------------------------------------- | -------------------------------------------------------- |
+| `POSTGRES_DB`          | The name of the PostgreSQL database.                 | `open_archive`                                           |
+| `POSTGRES_USER`        | The username for the PostgreSQL database.            | `admin`                                                  |
+| `POSTGRES_PASSWORD`    | The password for the PostgreSQL database.            | `password`                                               |
+| `DATABASE_URL`         | The connection URL for the PostgreSQL database.      | `postgresql://admin:password@postgres:5432/open_archive` |
+| `MEILI_MASTER_KEY`     | The master key for Meilisearch.                      | `aSampleMasterKey`                                       |
+| `MEILI_HOST`           | The host for the Meilisearch service.                | `http://meilisearch:7700`                                |
+| `MEILI_INDEXING_BATCH` | The number of emails to batch together for indexing. | `500`                                                    |
+| `REDIS_HOST`           | The host for the Valkey (Redis) service.             | `valkey`                                                 |
+| `REDIS_PORT`           | The port for the Valkey (Redis) service.             | `6379`                                                   |
+| `REDIS_PASSWORD`       | The password for the Valkey (Redis) service.         | `defaultredispassword`                                   |
+| `REDIS_TLS_ENABLED`    | Enable or disable TLS for Redis.                     | `false`                                                  |

 #### Storage Settings

@@ -114,6 +115,12 @@ These variables are used by `docker-compose.yml` to configure the services.
 | `RATE_LIMIT_MAX_REQUESTS`        | The maximum number of API requests allowed from an IP within the window.                                                                       | `100`                                      |
 | `ENCRYPTION_KEY`                 | A 32-byte hex string for encrypting sensitive data in the database.                                                                            |                                            |

+#### Apache Tika Integration
+
+| Variable   | Description                                                                                                                                                                          | Default Value      |
+| ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------ |
+| `TIKA_URL` | Optional. The URL of an Apache Tika server for advanced text extraction from attachments. If not set, the application falls back to built-in parsers for PDF, Word, and Excel files. | `http://tika:9998` |
+
 ## 3. Run the Application

 Once you have configured your `.env` file, you can start all the services using Docker Compose: