mirror of
https://github.com/LogicLabs-OU/OpenArchiver.git
synced 2026-04-06 00:31:57 +02:00
add OCR docs
This commit is contained in:
@@ -100,6 +100,7 @@ export default defineConfig({
|
||||
items: [
|
||||
{ text: 'Overview', link: '/services/' },
|
||||
{ text: 'Storage Service', link: '/services/storage-service' },
|
||||
{ text: 'OCR Service', link: '/services/ocr-service' },
|
||||
{
|
||||
text: 'IAM Service',
|
||||
items: [{ text: 'IAM Policies', link: '/services/iam-service/iam-policy' }],
|
||||
|
||||
@@ -1,289 +0,0 @@
|
||||
# IAM Policies
|
||||
|
||||
This document provides a guide to creating and managing IAM policies in Open Archiver. It is intended for developers and administrators who need to configure granular access control for users and roles.
|
||||
|
||||
## Policy Structure
|
||||
|
||||
IAM policies are defined as an array of JSON objects, where each object represents a single permission rule. The structure of a policy object is as follows:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "read" OR ["read", "create"],
|
||||
"subject": "ingestion" OR ["ingestion", "dashboard"],
|
||||
"conditions": {
|
||||
"field_name": "value"
|
||||
},
|
||||
"inverted": false OR true,
|
||||
}
|
||||
```
|
||||
|
||||
- `action`: The action(s) to be performed on the subject. Can be a single string or an array of strings.
|
||||
- `subject`: The resource(s) or entity on which the action is to be performed. Can be a single string or an array of strings.
|
||||
- `conditions`: (Optional) A set of conditions that must be met for the permission to be granted.
|
||||
- `inverted`: (Optional) When set to `true`, this inverts the rule, turning it from a "can" rule into a "cannot" rule. This is useful for creating exceptions to broader permissions.
|
||||
|
||||
## Actions
|
||||
|
||||
The following actions are available for use in IAM policies:
|
||||
|
||||
- `manage`: A wildcard action that grants all permissions on a subject (`create`, `read`, `update`, `delete`, `search`, `sync`).
|
||||
- `create`: Allows the user to create a new resource.
|
||||
- `read`: Allows the user to view a resource.
|
||||
- `update`: Allows the user to modify an existing resource.
|
||||
- `delete`: Allows the user to delete a resource.
|
||||
- `search`: Allows the user to search for resources.
|
||||
- `sync`: Allows the user to synchronize a resource.
|
||||
|
||||
## Subjects
|
||||
|
||||
The following subjects are available for use in IAM policies:
|
||||
|
||||
- `all`: A wildcard subject that represents all resources.
|
||||
- `archive`: Represents archived emails.
|
||||
- `ingestion`: Represents ingestion sources.
|
||||
- `settings`: Represents system settings.
|
||||
- `users`: Represents user accounts.
|
||||
- `roles`: Represents user roles.
|
||||
- `dashboard`: Represents the dashboard.
|
||||
|
||||
## Advanced Conditions with MongoDB-Style Queries
|
||||
|
||||
Conditions are the key to creating fine-grained access control rules. They are defined as a JSON object where each key represents a field on the subject, and the value defines the criteria for that field.
|
||||
|
||||
All conditions within a single rule are implicitly joined with an **AND** logic. This means that for a permission to be granted, the resource must satisfy _all_ specified conditions.
|
||||
|
||||
The power of this system comes from its use of a subset of [MongoDB's query language](https://www.mongodb.com/docs/manual/), which provides a flexible and expressive way to define complex rules. These rules are translated into native queries for both the PostgreSQL database (via Drizzle ORM) and the Meilisearch engine.
|
||||
|
||||
### Supported Operators and Examples
|
||||
|
||||
Here is a detailed breakdown of the supported operators with examples.
|
||||
|
||||
#### `$eq` (Equal)
|
||||
|
||||
This is the default operator. If you provide a simple key-value pair, it is treated as an equality check.
|
||||
|
||||
```json
|
||||
// This rule...
|
||||
{ "status": "active" }
|
||||
|
||||
// ...is equivalent to this:
|
||||
{ "status": { "$eq": "active" } }
|
||||
```
|
||||
|
||||
**Use Case**: Grant access to an ingestion source only if its status is `active`.
|
||||
|
||||
#### `$ne` (Not Equal)
|
||||
|
||||
Matches documents where the field value is not equal to the specified value.
|
||||
|
||||
```json
|
||||
{ "provider": { "$ne": "pst_import" } }
|
||||
```
|
||||
|
||||
**Use Case**: Allow a user to see all ingestion sources except for PST imports.
|
||||
|
||||
#### `$in` (In Array)
|
||||
|
||||
Matches documents where the field value is one of the values in the specified array.
|
||||
|
||||
```json
|
||||
{
|
||||
"id": {
|
||||
"$in": ["INGESTION_ID_1", "INGESTION_ID_2"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Use Case**: Grant an auditor access to a specific list of ingestion sources.
|
||||
|
||||
#### `$nin` (Not In Array)
|
||||
|
||||
Matches documents where the field value is not one of the values in the specified array.
|
||||
|
||||
```json
|
||||
{ "provider": { "$nin": ["pst_import", "eml_import"] } }
|
||||
```
|
||||
|
||||
**Use Case**: Hide all manual import sources from a specific user role.
|
||||
|
||||
#### `$lt` / `$lte` (Less Than / Less Than or Equal)
|
||||
|
||||
Matches documents where the field value is less than (`$lt`) or less than or equal to (`$lte`) the specified value. This is useful for numeric or date-based comparisons.
|
||||
|
||||
```json
|
||||
{ "sentAt": { "$lt": "2024-01-01T00:00:00.000Z" } }
|
||||
```
|
||||
|
||||
#### `$gt` / `$gte` (Greater Than / Greater Than or Equal)
|
||||
|
||||
Matches documents where the field value is greater than (`$gt`) or greater than or equal to (`$gte`) the specified value.
|
||||
|
||||
```json
|
||||
{ "sentAt": { "$lt": "2024-01-01T00:00:00.000Z" } }
|
||||
```
|
||||
|
||||
#### `$exists`
|
||||
|
||||
Matches documents that have (or do not have) the specified field.
|
||||
|
||||
```json
|
||||
// Grant access only if a 'lastSyncStatusMessage' exists
|
||||
{ "lastSyncStatusMessage": { "$exists": true } }
|
||||
```
|
||||
|
||||
## Inverted Rules: Creating Exceptions with `cannot`
|
||||
|
||||
By default, all rules are "can" rules, meaning they grant permissions. However, you can create a "cannot" rule by adding `"inverted": true` to a policy object. This is extremely useful for creating exceptions to broader permissions.
|
||||
|
||||
A common pattern is to grant broad access and then use an inverted rule to carve out a specific restriction.
|
||||
|
||||
**Use Case**: Grant a user access to all ingestion sources _except_ for one specific source.
|
||||
|
||||
This is achieved with two rules:
|
||||
|
||||
1. A "can" rule that grants `read` access to the `ingestion` subject.
|
||||
2. An inverted "cannot" rule that denies `read` access for the specific ingestion `id`.
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"action": "read",
|
||||
"subject": "ingestion"
|
||||
},
|
||||
{
|
||||
"inverted": true,
|
||||
"action": "read",
|
||||
"subject": "ingestion",
|
||||
"conditions": {
|
||||
"id": "SPECIFIC_INGESTION_ID_TO_EXCLUDE"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## Policy Evaluation Logic
|
||||
|
||||
The system evaluates policies by combining all relevant rules for a user. The logic is simple:
|
||||
|
||||
- A user has permission if at least one `can` rule allows it.
|
||||
- A permission is denied if a `cannot` (`"inverted": true`) rule explicitly forbids it, even if a `can` rule allows it. `cannot` rules always take precedence.
|
||||
|
||||
### Dynamic Policies with Placeholders
|
||||
|
||||
To create dynamic policies that are specific to the current user, you can use the `${user.id}` placeholder in the `conditions` object. This placeholder will be replaced with the ID of the current user at runtime.
|
||||
|
||||
## Special Permissions for User and Role Management
|
||||
|
||||
It is important to note that while `read` access to `users` and `roles` can be granted granularly, any actions that modify these resources (`create`, `update`, `delete`) are restricted to Super Admins.
|
||||
|
||||
A user must have the `{ "action": "manage", "subject": "all" }` permission (Typically a Super Admin role) to manage users and roles. This is a security measure to prevent unauthorized changes to user accounts and permissions.
|
||||
|
||||
## Policy Examples
|
||||
|
||||
Here are several examples based on the default roles in the system, demonstrating how to combine actions, subjects, and conditions to achieve specific access control scenarios.
|
||||
|
||||
### Administrator
|
||||
|
||||
This policy grants a user full access to all resources using wildcards.
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"action": "manage",
|
||||
"subject": "all"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### End-User
|
||||
|
||||
This policy allows a user to view the dashboard, create new ingestion sources, and fully manage the ingestion sources they own.
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"action": "read",
|
||||
"subject": "dashboard"
|
||||
},
|
||||
{
|
||||
"action": "create",
|
||||
"subject": "ingestion"
|
||||
},
|
||||
{
|
||||
"action": "manage",
|
||||
"subject": "ingestion",
|
||||
"conditions": {
|
||||
"userId": "${user.id}"
|
||||
}
|
||||
},
|
||||
{
|
||||
"action": "manage",
|
||||
"subject": "archive",
|
||||
"conditions": {
|
||||
"ingestionSource.userId": "${user.id}" // also needs to give permission to archived emails created by the user
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Global Read-Only Auditor
|
||||
|
||||
This policy grants read and search access across most of the application's resources, making it suitable for an auditor who needs to view data without modifying it.
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"action": ["read", "search"],
|
||||
"subject": ["ingestion", "archive", "dashboard", "users", "roles"]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Ingestion Admin
|
||||
|
||||
This policy grants full control over all ingestion sources and archives, but no other resources.
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"action": "manage",
|
||||
"subject": "ingestion"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Auditor for Specific Ingestion Sources
|
||||
|
||||
This policy demonstrates how to grant access to a specific list of ingestion sources using the `$in` operator.
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"action": ["read", "search"],
|
||||
"subject": "ingestion",
|
||||
"conditions": {
|
||||
"id": {
|
||||
"$in": ["INGESTION_ID_1", "INGESTION_ID_2"]
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Limit Access to a Specific Mailbox
|
||||
|
||||
This policy grants a user access to a specific ingestion source, but only allows them to see emails belonging to a single user within that source.
|
||||
|
||||
This is achieved by defining two specific `can` rules: The rule grants `read` and `search` access to the `archive` subject, but the `userEmail` must match.
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"action": ["read", "search"],
|
||||
"subject": "archive",
|
||||
"conditions": {
|
||||
"userEmail": "user1@example.com"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
96
docs/services/ocr-service.md
Normal file
96
docs/services/ocr-service.md
Normal file
@@ -0,0 +1,96 @@
|
||||
# OCR Service
|
||||
|
||||
The OCR (Optical Character Recognition) and text extraction service is responsible for extracting plain text content from various file formats, such as PDFs, Office documents, and more. This is a crucial component for making email attachments searchable.
|
||||
|
||||
## Overview
|
||||
|
||||
The system employs a two-pronged approach for text extraction:
|
||||
|
||||
1. **Primary Extractor (Apache Tika)**: A powerful and versatile toolkit that can extract text from a wide variety of file formats. It is the recommended method for its superior performance and format support.
|
||||
2. **Legacy Extractor**: A fallback mechanism that uses a combination of libraries (`pdf2json`, `mammoth`, `xlsx`) for common file types like PDF, DOCX, and XLSX. This is used when Apache Tika is not configured.
|
||||
|
||||
The main logic resides in `packages/backend/src/helpers/textExtractor.ts`, which decides which extraction method to use based on the application's configuration.
|
||||
|
||||
## Configuration
|
||||
|
||||
To enable the primary text extraction method, you must configure the URL of an Apache Tika server instance in your environment variables.
|
||||
|
||||
In your `.env` file, set the `TIKA_URL`:
|
||||
|
||||
```env
|
||||
# .env.example
|
||||
|
||||
# Apache Tika Integration
|
||||
# ONLY active if TIKA_URL is set
|
||||
TIKA_URL=http://tika:9998
|
||||
```
|
||||
|
||||
If `TIKA_URL` is not set, the system will automatically fall back to the legacy extraction methods. The service performs a health check on startup to verify connectivity with the Tika server.
|
||||
|
||||
## File Size Limits
|
||||
|
||||
To prevent excessive memory usage and processing time, the service imposes a general size limit on files submitted for text extraction. Files larger than the configured limit will be skipped.
|
||||
|
||||
- **With Apache Tika**: The maximum file size is **100MB**.
|
||||
- **With Legacy Fallback**: The maximum file size is **50MB**.
|
||||
|
||||
## Supported File Formats
|
||||
|
||||
The service's ability to extract text depends on whether it's using Apache Tika or the legacy fallback methods.
|
||||
|
||||
### With Apache Tika
|
||||
|
||||
When `TIKA_URL` is configured, the service can process a vast range of file formats. Apache Tika is designed for broad compatibility and supports hundreds of file types, including but not limited to:
|
||||
|
||||
- Portable Document Format (PDF)
|
||||
- Microsoft Office formats (DOC, DOCX, PPT, PPTX, XLS, XLSX)
|
||||
- OpenDocument Formats (ODT, ODS, ODP)
|
||||
- Rich Text Format (RTF)
|
||||
- Plain Text (TXT, CSV, JSON, XML, HTML)
|
||||
- Image formats with OCR capabilities (PNG, JPEG, TIFF)
|
||||
- Archive formats (ZIP, TAR, GZ)
|
||||
- Email formats (EML, MSG)
|
||||
|
||||
For a complete and up-to-date list, please refer to the official [Apache Tika documentation](https://tika.apache.org/3.2.3/formats.html).
|
||||
|
||||
### With Legacy Fallback
|
||||
|
||||
When Tika is not configured, text extraction is limited to the following formats:
|
||||
|
||||
- `application/pdf` (PDF)
|
||||
- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (DOCX)
|
||||
- `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` (XLSX)
|
||||
- Plain text formats such as `text/*`, `application/json`, and `application/xml`.
|
||||
|
||||
## Features of the Tika Integration (`OcrService`)
|
||||
|
||||
The `OcrService` (`packages/backend/src/services/OcrService.ts`) provides several enhancements to make text extraction efficient and robust.
|
||||
|
||||
### Caching
|
||||
|
||||
To avoid redundant processing of the same file, the service implements a simple LRU (Least Recently Used) cache.
|
||||
|
||||
- **Cache Key**: A SHA-256 hash of the file's buffer is used as the cache key.
|
||||
- **Functionality**: If a file with the same hash is processed again, the text content is served directly from the cache, saving significant processing time.
|
||||
- **Statistics**: The service keeps track of cache hits, misses, and the hit rate for performance monitoring.
|
||||
|
||||
### Concurrency Management (Semaphore)
|
||||
|
||||
Extracting text from large files can be resource-intensive. To prevent the Tika server from being overwhelmed by multiple requests for the _same file_ simultaneously (e.g., during a large import), a semaphore mechanism is used.
|
||||
|
||||
- **Functionality**: If a request for a specific file (identified by its hash) is already in progress, any subsequent requests for the same file will wait for the first one to complete and then use its result.
|
||||
- **Benefit**: This deduplicates parallel processing efforts and reduces unnecessary load on the Tika server.
|
||||
|
||||
### Health Check and DNS Fallback
|
||||
|
||||
- **Availability Check**: The service includes a `checkTikaAvailability` method to verify that the Tika server is reachable and operational. This check is performed on application startup.
|
||||
- **DNS Fallback**: For convenience in Docker environments, if the Tika URL uses the hostname `tika` (e.g., `http://tika:9998`), the service will automatically attempt a fallback to `localhost` if the initial connection fails.
|
||||
|
||||
## Legacy Fallback Methods
|
||||
|
||||
When Tika is not available, the `extractTextLegacy` function in `textExtractor.ts` handles extraction for a limited set of MIME types:
|
||||
|
||||
- `application/pdf`: Processed using `pdf2json`. Includes a 50MB size limit and a 5-second timeout to prevent memory issues.
|
||||
- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (DOCX): Processed using `mammoth`.
|
||||
- `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` (XLSX): Processed using `xlsx`.
|
||||
- Plain text formats (`text/*`, `application/json`, `application/xml`): Converted directly from the buffer.
|
||||
@@ -76,18 +76,19 @@ Here is a complete list of environment variables available for configuration:
|
||||
|
||||
These variables are used by `docker-compose.yml` to configure the services.
|
||||
|
||||
| Variable | Description | Default Value |
|
||||
| ------------------- | ----------------------------------------------- | -------------------------------------------------------- |
|
||||
| `POSTGRES_DB` | The name of the PostgreSQL database. | `open_archive` |
|
||||
| `POSTGRES_USER` | The username for the PostgreSQL database. | `admin` |
|
||||
| `POSTGRES_PASSWORD` | The password for the PostgreSQL database. | `password` |
|
||||
| `DATABASE_URL` | The connection URL for the PostgreSQL database. | `postgresql://admin:password@postgres:5432/open_archive` |
|
||||
| `MEILI_MASTER_KEY` | The master key for Meilisearch. | `aSampleMasterKey` |
|
||||
| `MEILI_HOST` | The host for the Meilisearch service. | `http://meilisearch:7700` |
|
||||
| `REDIS_HOST` | The host for the Valkey (Redis) service. | `valkey` |
|
||||
| `REDIS_PORT` | The port for the Valkey (Redis) service. | `6379` |
|
||||
| `REDIS_PASSWORD` | The password for the Valkey (Redis) service. | `defaultredispassword` |
|
||||
| `REDIS_TLS_ENABLED` | Enable or disable TLS for Redis. | `false` |
|
||||
| Variable | Description | Default Value |
|
||||
| ---------------------- | ---------------------------------------------------- | -------------------------------------------------------- |
|
||||
| `POSTGRES_DB` | The name of the PostgreSQL database. | `open_archive` |
|
||||
| `POSTGRES_USER` | The username for the PostgreSQL database. | `admin` |
|
||||
| `POSTGRES_PASSWORD` | The password for the PostgreSQL database. | `password` |
|
||||
| `DATABASE_URL` | The connection URL for the PostgreSQL database. | `postgresql://admin:password@postgres:5432/open_archive` |
|
||||
| `MEILI_MASTER_KEY` | The master key for Meilisearch. | `aSampleMasterKey` |
|
||||
| `MEILI_HOST` | The host for the Meilisearch service. | `http://meilisearch:7700` |
|
||||
| `MEILI_INDEXING_BATCH` | The number of emails to batch together for indexing. | `500` |
|
||||
| `REDIS_HOST` | The host for the Valkey (Redis) service. | `valkey` |
|
||||
| `REDIS_PORT` | The port for the Valkey (Redis) service. | `6379` |
|
||||
| `REDIS_PASSWORD` | The password for the Valkey (Redis) service. | `defaultredispassword` |
|
||||
| `REDIS_TLS_ENABLED` | Enable or disable TLS for Redis. | `false` |
|
||||
|
||||
#### Storage Settings
|
||||
|
||||
@@ -114,6 +115,12 @@ These variables are used by `docker-compose.yml` to configure the services.
|
||||
| `RATE_LIMIT_MAX_REQUESTS` | The maximum number of API requests allowed from an IP within the window. | `100` |
|
||||
| `ENCRYPTION_KEY` | A 32-byte hex string for encrypting sensitive data in the database. | |
|
||||
|
||||
#### Apache Tika Integration
|
||||
|
||||
| Variable | Description | Default Value |
|
||||
| ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------ |
|
||||
| `TIKA_URL` | Optional. The URL of an Apache Tika server for advanced text extraction from attachments. If not set, the application falls back to built-in parsers for PDF, Word, and Excel files. | `http://tika:9998` |
|
||||
|
||||
## 3. Run the Application
|
||||
|
||||
Once you have configured your `.env` file, you can start all the services using Docker Compose:
|
||||
|
||||
Reference in New Issue
Block a user