diff --git a/docs/.vitepress/config.mts b/docs/.vitepress/config.mts index b8c6848..92d951a 100644 --- a/docs/.vitepress/config.mts +++ b/docs/.vitepress/config.mts @@ -100,6 +100,7 @@ export default defineConfig({ items: [ { text: 'Overview', link: '/services/' }, { text: 'Storage Service', link: '/services/storage-service' }, + { text: 'OCR Service', link: '/services/ocr-service' }, { text: 'IAM Service', items: [{ text: 'IAM Policies', link: '/services/iam-service/iam-policy' }], diff --git a/docs/services/iam-service.md b/docs/services/iam-service.md deleted file mode 100644 index 24f9c0b..0000000 --- a/docs/services/iam-service.md +++ /dev/null @@ -1,289 +0,0 @@ -# IAM Policies - -This document provides a guide to creating and managing IAM policies in Open Archiver. It is intended for developers and administrators who need to configure granular access control for users and roles. - -## Policy Structure - -IAM policies are defined as an array of JSON objects, where each object represents a single permission rule. The structure of a policy object is as follows: - -```json -{ - "action": "read" OR ["read", "create"], - "subject": "ingestion" OR ["ingestion", "dashboard"], - "conditions": { - "field_name": "value" - }, - "inverted": false OR true, -} -``` - -- `action`: The action(s) to be performed on the subject. Can be a single string or an array of strings. -- `subject`: The resource(s) or entity on which the action is to be performed. Can be a single string or an array of strings. -- `conditions`: (Optional) A set of conditions that must be met for the permission to be granted. -- `inverted`: (Optional) When set to `true`, this inverts the rule, turning it from a "can" rule into a "cannot" rule. This is useful for creating exceptions to broader permissions. - -## Actions - -The following actions are available for use in IAM policies: - -- `manage`: A wildcard action that grants all permissions on a subject (`create`, `read`, `update`, `delete`, `search`, `sync`). -- `create`: Allows the user to create a new resource. -- `read`: Allows the user to view a resource. -- `update`: Allows the user to modify an existing resource. -- `delete`: Allows the user to delete a resource. -- `search`: Allows the user to search for resources. -- `sync`: Allows the user to synchronize a resource. - -## Subjects - -The following subjects are available for use in IAM policies: - -- `all`: A wildcard subject that represents all resources. -- `archive`: Represents archived emails. -- `ingestion`: Represents ingestion sources. -- `settings`: Represents system settings. -- `users`: Represents user accounts. -- `roles`: Represents user roles. -- `dashboard`: Represents the dashboard. - -## Advanced Conditions with MongoDB-Style Queries - -Conditions are the key to creating fine-grained access control rules. They are defined as a JSON object where each key represents a field on the subject, and the value defines the criteria for that field. - -All conditions within a single rule are implicitly joined with an **AND** logic. This means that for a permission to be granted, the resource must satisfy _all_ specified conditions. - -The power of this system comes from its use of a subset of [MongoDB's query language](https://www.mongodb.com/docs/manual/), which provides a flexible and expressive way to define complex rules. These rules are translated into native queries for both the PostgreSQL database (via Drizzle ORM) and the Meilisearch engine. - -### Supported Operators and Examples - -Here is a detailed breakdown of the supported operators with examples. - -#### `$eq` (Equal) - -This is the default operator. If you provide a simple key-value pair, it is treated as an equality check. - -```json -// This rule... -{ "status": "active" } - -// ...is equivalent to this: -{ "status": { "$eq": "active" } } -``` - -**Use Case**: Grant access to an ingestion source only if its status is `active`. - -#### `$ne` (Not Equal) - -Matches documents where the field value is not equal to the specified value. - -```json -{ "provider": { "$ne": "pst_import" } } -``` - -**Use Case**: Allow a user to see all ingestion sources except for PST imports. - -#### `$in` (In Array) - -Matches documents where the field value is one of the values in the specified array. - -```json -{ - "id": { - "$in": ["INGESTION_ID_1", "INGESTION_ID_2"] - } -} -``` - -**Use Case**: Grant an auditor access to a specific list of ingestion sources. - -#### `$nin` (Not In Array) - -Matches documents where the field value is not one of the values in the specified array. - -```json -{ "provider": { "$nin": ["pst_import", "eml_import"] } } -``` - -**Use Case**: Hide all manual import sources from a specific user role. - -#### `$lt` / `$lte` (Less Than / Less Than or Equal) - -Matches documents where the field value is less than (`$lt`) or less than or equal to (`$lte`) the specified value. This is useful for numeric or date-based comparisons. - -```json -{ "sentAt": { "$lt": "2024-01-01T00:00:00.000Z" } } -``` - -#### `$gt` / `$gte` (Greater Than / Greater Than or Equal) - -Matches documents where the field value is greater than (`$gt`) or greater than or equal to (`$gte`) the specified value. - -```json -{ "sentAt": { "$lt": "2024-01-01T00:00:00.000Z" } } -``` - -#### `$exists` - -Matches documents that have (or do not have) the specified field. - -```json -// Grant access only if a 'lastSyncStatusMessage' exists -{ "lastSyncStatusMessage": { "$exists": true } } -``` - -## Inverted Rules: Creating Exceptions with `cannot` - -By default, all rules are "can" rules, meaning they grant permissions. However, you can create a "cannot" rule by adding `"inverted": true` to a policy object. This is extremely useful for creating exceptions to broader permissions. - -A common pattern is to grant broad access and then use an inverted rule to carve out a specific restriction. - -**Use Case**: Grant a user access to all ingestion sources _except_ for one specific source. - -This is achieved with two rules: - -1. A "can" rule that grants `read` access to the `ingestion` subject. -2. An inverted "cannot" rule that denies `read` access for the specific ingestion `id`. - -```json -[ - { - "action": "read", - "subject": "ingestion" - }, - { - "inverted": true, - "action": "read", - "subject": "ingestion", - "conditions": { - "id": "SPECIFIC_INGESTION_ID_TO_EXCLUDE" - } - } -] -``` - -## Policy Evaluation Logic - -The system evaluates policies by combining all relevant rules for a user. The logic is simple: - -- A user has permission if at least one `can` rule allows it. -- A permission is denied if a `cannot` (`"inverted": true`) rule explicitly forbids it, even if a `can` rule allows it. `cannot` rules always take precedence. - -### Dynamic Policies with Placeholders - -To create dynamic policies that are specific to the current user, you can use the `${user.id}` placeholder in the `conditions` object. This placeholder will be replaced with the ID of the current user at runtime. - -## Special Permissions for User and Role Management - -It is important to note that while `read` access to `users` and `roles` can be granted granularly, any actions that modify these resources (`create`, `update`, `delete`) are restricted to Super Admins. - -A user must have the `{ "action": "manage", "subject": "all" }` permission (Typically a Super Admin role) to manage users and roles. This is a security measure to prevent unauthorized changes to user accounts and permissions. - -## Policy Examples - -Here are several examples based on the default roles in the system, demonstrating how to combine actions, subjects, and conditions to achieve specific access control scenarios. - -### Administrator - -This policy grants a user full access to all resources using wildcards. - -```json -[ - { - "action": "manage", - "subject": "all" - } -] -``` - -### End-User - -This policy allows a user to view the dashboard, create new ingestion sources, and fully manage the ingestion sources they own. - -```json -[ - { - "action": "read", - "subject": "dashboard" - }, - { - "action": "create", - "subject": "ingestion" - }, - { - "action": "manage", - "subject": "ingestion", - "conditions": { - "userId": "${user.id}" - } - }, - { - "action": "manage", - "subject": "archive", - "conditions": { - "ingestionSource.userId": "${user.id}" // also needs to give permission to archived emails created by the user - } - } -] -``` - -### Global Read-Only Auditor - -This policy grants read and search access across most of the application's resources, making it suitable for an auditor who needs to view data without modifying it. - -```json -[ - { - "action": ["read", "search"], - "subject": ["ingestion", "archive", "dashboard", "users", "roles"] - } -] -``` - -### Ingestion Admin - -This policy grants full control over all ingestion sources and archives, but no other resources. - -```json -[ - { - "action": "manage", - "subject": "ingestion" - } -] -``` - -### Auditor for Specific Ingestion Sources - -This policy demonstrates how to grant access to a specific list of ingestion sources using the `$in` operator. - -```json -[ - { - "action": ["read", "search"], - "subject": "ingestion", - "conditions": { - "id": { - "$in": ["INGESTION_ID_1", "INGESTION_ID_2"] - } - } - } -] -``` - -### Limit Access to a Specific Mailbox - -This policy grants a user access to a specific ingestion source, but only allows them to see emails belonging to a single user within that source. - -This is achieved by defining two specific `can` rules: The rule grants `read` and `search` access to the `archive` subject, but the `userEmail` must match. - -```json -[ - { - "action": ["read", "search"], - "subject": "archive", - "conditions": { - "userEmail": "user1@example.com" - } - } -] -``` diff --git a/docs/services/ocr-service.md b/docs/services/ocr-service.md new file mode 100644 index 0000000..f719ffa --- /dev/null +++ b/docs/services/ocr-service.md @@ -0,0 +1,96 @@ +# OCR Service + +The OCR (Optical Character Recognition) and text extraction service is responsible for extracting plain text content from various file formats, such as PDFs, Office documents, and more. This is a crucial component for making email attachments searchable. + +## Overview + +The system employs a two-pronged approach for text extraction: + +1. **Primary Extractor (Apache Tika)**: A powerful and versatile toolkit that can extract text from a wide variety of file formats. It is the recommended method for its superior performance and format support. +2. **Legacy Extractor**: A fallback mechanism that uses a combination of libraries (`pdf2json`, `mammoth`, `xlsx`) for common file types like PDF, DOCX, and XLSX. This is used when Apache Tika is not configured. + +The main logic resides in `packages/backend/src/helpers/textExtractor.ts`, which decides which extraction method to use based on the application's configuration. + +## Configuration + +To enable the primary text extraction method, you must configure the URL of an Apache Tika server instance in your environment variables. + +In your `.env` file, set the `TIKA_URL`: + +```env +# .env.example + +# Apache Tika Integration +# ONLY active if TIKA_URL is set +TIKA_URL=http://tika:9998 +``` + +If `TIKA_URL` is not set, the system will automatically fall back to the legacy extraction methods. The service performs a health check on startup to verify connectivity with the Tika server. + +## File Size Limits + +To prevent excessive memory usage and processing time, the service imposes a general size limit on files submitted for text extraction. Files larger than the configured limit will be skipped. + +- **With Apache Tika**: The maximum file size is **100MB**. +- **With Legacy Fallback**: The maximum file size is **50MB**. + +## Supported File Formats + +The service's ability to extract text depends on whether it's using Apache Tika or the legacy fallback methods. + +### With Apache Tika + +When `TIKA_URL` is configured, the service can process a vast range of file formats. Apache Tika is designed for broad compatibility and supports hundreds of file types, including but not limited to: + +- Portable Document Format (PDF) +- Microsoft Office formats (DOC, DOCX, PPT, PPTX, XLS, XLSX) +- OpenDocument Formats (ODT, ODS, ODP) +- Rich Text Format (RTF) +- Plain Text (TXT, CSV, JSON, XML, HTML) +- Image formats with OCR capabilities (PNG, JPEG, TIFF) +- Archive formats (ZIP, TAR, GZ) +- Email formats (EML, MSG) + +For a complete and up-to-date list, please refer to the official [Apache Tika documentation](https://tika.apache.org/3.2.3/formats.html). + +### With Legacy Fallback + +When Tika is not configured, text extraction is limited to the following formats: + +- `application/pdf` (PDF) +- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (DOCX) +- `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` (XLSX) +- Plain text formats such as `text/*`, `application/json`, and `application/xml`. + +## Features of the Tika Integration (`OcrService`) + +The `OcrService` (`packages/backend/src/services/OcrService.ts`) provides several enhancements to make text extraction efficient and robust. + +### Caching + +To avoid redundant processing of the same file, the service implements a simple LRU (Least Recently Used) cache. + +- **Cache Key**: A SHA-256 hash of the file's buffer is used as the cache key. +- **Functionality**: If a file with the same hash is processed again, the text content is served directly from the cache, saving significant processing time. +- **Statistics**: The service keeps track of cache hits, misses, and the hit rate for performance monitoring. + +### Concurrency Management (Semaphore) + +Extracting text from large files can be resource-intensive. To prevent the Tika server from being overwhelmed by multiple requests for the _same file_ simultaneously (e.g., during a large import), a semaphore mechanism is used. + +- **Functionality**: If a request for a specific file (identified by its hash) is already in progress, any subsequent requests for the same file will wait for the first one to complete and then use its result. +- **Benefit**: This deduplicates parallel processing efforts and reduces unnecessary load on the Tika server. + +### Health Check and DNS Fallback + +- **Availability Check**: The service includes a `checkTikaAvailability` method to verify that the Tika server is reachable and operational. This check is performed on application startup. +- **DNS Fallback**: For convenience in Docker environments, if the Tika URL uses the hostname `tika` (e.g., `http://tika:9998`), the service will automatically attempt a fallback to `localhost` if the initial connection fails. + +## Legacy Fallback Methods + +When Tika is not available, the `extractTextLegacy` function in `textExtractor.ts` handles extraction for a limited set of MIME types: + +- `application/pdf`: Processed using `pdf2json`. Includes a 50MB size limit and a 5-second timeout to prevent memory issues. +- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (DOCX): Processed using `mammoth`. +- `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` (XLSX): Processed using `xlsx`. +- Plain text formats (`text/*`, `application/json`, `application/xml`): Converted directly from the buffer. diff --git a/docs/user-guides/installation.md b/docs/user-guides/installation.md index abb342e..2d34b5d 100644 --- a/docs/user-guides/installation.md +++ b/docs/user-guides/installation.md @@ -76,18 +76,19 @@ Here is a complete list of environment variables available for configuration: These variables are used by `docker-compose.yml` to configure the services. -| Variable | Description | Default Value | -| ------------------- | ----------------------------------------------- | -------------------------------------------------------- | -| `POSTGRES_DB` | The name of the PostgreSQL database. | `open_archive` | -| `POSTGRES_USER` | The username for the PostgreSQL database. | `admin` | -| `POSTGRES_PASSWORD` | The password for the PostgreSQL database. | `password` | -| `DATABASE_URL` | The connection URL for the PostgreSQL database. | `postgresql://admin:password@postgres:5432/open_archive` | -| `MEILI_MASTER_KEY` | The master key for Meilisearch. | `aSampleMasterKey` | -| `MEILI_HOST` | The host for the Meilisearch service. | `http://meilisearch:7700` | -| `REDIS_HOST` | The host for the Valkey (Redis) service. | `valkey` | -| `REDIS_PORT` | The port for the Valkey (Redis) service. | `6379` | -| `REDIS_PASSWORD` | The password for the Valkey (Redis) service. | `defaultredispassword` | -| `REDIS_TLS_ENABLED` | Enable or disable TLS for Redis. | `false` | +| Variable | Description | Default Value | +| ---------------------- | ---------------------------------------------------- | -------------------------------------------------------- | +| `POSTGRES_DB` | The name of the PostgreSQL database. | `open_archive` | +| `POSTGRES_USER` | The username for the PostgreSQL database. | `admin` | +| `POSTGRES_PASSWORD` | The password for the PostgreSQL database. | `password` | +| `DATABASE_URL` | The connection URL for the PostgreSQL database. | `postgresql://admin:password@postgres:5432/open_archive` | +| `MEILI_MASTER_KEY` | The master key for Meilisearch. | `aSampleMasterKey` | +| `MEILI_HOST` | The host for the Meilisearch service. | `http://meilisearch:7700` | +| `MEILI_INDEXING_BATCH` | The number of emails to batch together for indexing. | `500` | +| `REDIS_HOST` | The host for the Valkey (Redis) service. | `valkey` | +| `REDIS_PORT` | The port for the Valkey (Redis) service. | `6379` | +| `REDIS_PASSWORD` | The password for the Valkey (Redis) service. | `defaultredispassword` | +| `REDIS_TLS_ENABLED` | Enable or disable TLS for Redis. | `false` | #### Storage Settings @@ -114,6 +115,12 @@ These variables are used by `docker-compose.yml` to configure the services. | `RATE_LIMIT_MAX_REQUESTS` | The maximum number of API requests allowed from an IP within the window. | `100` | | `ENCRYPTION_KEY` | A 32-byte hex string for encrypting sensitive data in the database. | | +#### Apache Tika Integration + +| Variable | Description | Default Value | +| ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------ | +| `TIKA_URL` | Optional. The URL of an Apache Tika server for advanced text extraction from attachments. If not set, the application falls back to built-in parsers for PDF, Word, and Excel files. | `http://tika:9998` | + ## 3. Run the Application Once you have configured your `.env` file, you can start all the services using Docker Compose: