mirror of
https://github.com/LogicLabs-OU/OpenArchiver.git
synced 2026-04-06 00:31:57 +02:00
97 lines
5.2 KiB
Markdown
97 lines
5.2 KiB
Markdown
# OCR Service
|
|
|
|
The OCR (Optical Character Recognition) and text extraction service is responsible for extracting plain text content from various file formats, such as PDFs, Office documents, and more. This is a crucial component for making email attachments searchable.
|
|
|
|
## Overview
|
|
|
|
The system employs a two-pronged approach for text extraction:
|
|
|
|
1. **Primary Extractor (Apache Tika)**: A powerful and versatile toolkit that can extract text from a wide variety of file formats. It is the recommended method for its superior performance and format support.
|
|
2. **Legacy Extractor**: A fallback mechanism that uses a combination of libraries (`pdf2json`, `mammoth`, `xlsx`) for common file types like PDF, DOCX, and XLSX. This is used when Apache Tika is not configured.
|
|
|
|
The main logic resides in `packages/backend/src/helpers/textExtractor.ts`, which decides which extraction method to use based on the application's configuration.
|
|
|
|
## Configuration
|
|
|
|
To enable the primary text extraction method, you must configure the URL of an Apache Tika server instance in your environment variables.
|
|
|
|
In your `.env` file, set the `TIKA_URL`:
|
|
|
|
```env
|
|
# .env.example
|
|
|
|
# Apache Tika Integration
|
|
# ONLY active if TIKA_URL is set
|
|
TIKA_URL=http://tika:9998
|
|
```
|
|
|
|
If `TIKA_URL` is not set, the system will automatically fall back to the legacy extraction methods. The service performs a health check on startup to verify connectivity with the Tika server.
|
|
|
|
## File Size Limits
|
|
|
|
To prevent excessive memory usage and processing time, the service imposes a general size limit on files submitted for text extraction. Files larger than the configured limit will be skipped.
|
|
|
|
- **With Apache Tika**: The maximum file size is **100MB**.
|
|
- **With Legacy Fallback**: The maximum file size is **50MB**.
|
|
|
|
## Supported File Formats
|
|
|
|
The service's ability to extract text depends on whether it's using Apache Tika or the legacy fallback methods.
|
|
|
|
### With Apache Tika
|
|
|
|
When `TIKA_URL` is configured, the service can process a vast range of file formats. Apache Tika is designed for broad compatibility and supports hundreds of file types, including but not limited to:
|
|
|
|
- Portable Document Format (PDF)
|
|
- Microsoft Office formats (DOC, DOCX, PPT, PPTX, XLS, XLSX)
|
|
- OpenDocument Formats (ODT, ODS, ODP)
|
|
- Rich Text Format (RTF)
|
|
- Plain Text (TXT, CSV, JSON, XML, HTML)
|
|
- Image formats with OCR capabilities (PNG, JPEG, TIFF)
|
|
- Archive formats (ZIP, TAR, GZ)
|
|
- Email formats (EML, MSG)
|
|
|
|
For a complete and up-to-date list, please refer to the official [Apache Tika documentation](https://tika.apache.org/3.2.3/formats.html).
|
|
|
|
### With Legacy Fallback
|
|
|
|
When Tika is not configured, text extraction is limited to the following formats:
|
|
|
|
- `application/pdf` (PDF)
|
|
- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (DOCX)
|
|
- `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` (XLSX)
|
|
- Plain text formats such as `text/*`, `application/json`, and `application/xml`.
|
|
|
|
## Features of the Tika Integration (`OcrService`)
|
|
|
|
The `OcrService` (`packages/backend/src/services/OcrService.ts`) provides several enhancements to make text extraction efficient and robust.
|
|
|
|
### Caching
|
|
|
|
To avoid redundant processing of the same file, the service implements a simple LRU (Least Recently Used) cache.
|
|
|
|
- **Cache Key**: A SHA-256 hash of the file's buffer is used as the cache key.
|
|
- **Functionality**: If a file with the same hash is processed again, the text content is served directly from the cache, saving significant processing time.
|
|
- **Statistics**: The service keeps track of cache hits, misses, and the hit rate for performance monitoring.
|
|
|
|
### Concurrency Management (Semaphore)
|
|
|
|
Extracting text from large files can be resource-intensive. To prevent the Tika server from being overwhelmed by multiple requests for the _same file_ simultaneously (e.g., during a large import), a semaphore mechanism is used.
|
|
|
|
- **Functionality**: If a request for a specific file (identified by its hash) is already in progress, any subsequent requests for the same file will wait for the first one to complete and then use its result.
|
|
- **Benefit**: This deduplicates parallel processing efforts and reduces unnecessary load on the Tika server.
|
|
|
|
### Health Check and DNS Fallback
|
|
|
|
- **Availability Check**: The service includes a `checkTikaAvailability` method to verify that the Tika server is reachable and operational. This check is performed on application startup.
|
|
- **DNS Fallback**: For convenience in Docker environments, if the Tika URL uses the hostname `tika` (e.g., `http://tika:9998`), the service will automatically attempt a fallback to `localhost` if the initial connection fails.
|
|
|
|
## Legacy Fallback Methods
|
|
|
|
When Tika is not available, the `extractTextLegacy` function in `textExtractor.ts` handles extraction for a limited set of MIME types:
|
|
|
|
- `application/pdf`: Processed using `pdf2json`. Includes a 50MB size limit and a 5-second timeout to prevent memory issues.
|
|
- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (DOCX): Processed using `mammoth`.
|
|
- `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` (XLSX): Processed using `xlsx`.
|
|
- Plain text formats (`text/*`, `application/json`, `application/xml`): Converted directly from the buffer.
|