feat(ingestion): add local file path support and optimize EML processing

- Frontend: Updated IngestionSourceForm to allow toggling between "Upload File" and "Local File Path" for PST, EML, and Mbox providers.
- Frontend: Added logic to clear irrelevant form data when switching import methods.
- Frontend: Added English translations for new form fields.
- Backend: Refactored EMLConnector to stream ZIP entries using yauzl instead of extracting the full archive to disk, significantly improving efficiency for large archives.
- Docs: Updated API documentation and User Guides (PST, EML, Mbox) to clarify "Local File Path" usage, specifically within Docker environments.
This commit is contained in:
wayneshn
2026-02-23 20:35:51 +01:00
parent 7551d4d7c7
commit 5409dc771e
8 changed files with 304 additions and 122 deletions

View File

@@ -53,6 +53,11 @@ interface CreateIngestionSourceDto {
**Note:** When using `localFilePath`, the file will not be deleted after import. When using `uploadedFilePath` (via the upload API), the file will be automatically deleted after import. The same applies to `pst_import` and `eml_import` providers.
**Important regarding `localFilePath`:** When running OpenArchiver in a Docker container (which is the standard deployment), `localFilePath` refers to the path **inside the Docker container**, not on the host machine.
To use a local file:
1. **Recommended:** Place your file inside the directory defined by `STORAGE_LOCAL_ROOT_PATH` (e.g., inside a `temp` folder). Since this directory is already mounted as a volume, the file will be accessible at the same path inside the container.
2. **Alternative:** Mount a specific directory containing your files as a volume in `docker-compose.yml`. For example, add `- /path/to/my/files:/imports` to the `volumes` section and use `/imports/myfile.pst` as the `localFilePath`.
#### Responses
- **201 Created:** The newly created ingestion source.

View File

@@ -30,7 +30,14 @@ archive.zip
2. Click the **Create New** button.
3. Select **EML Import** as the provider.
4. Enter a name for the ingestion source.
5. Click the **Choose File** button and select the zip archive containing your EML files.
5. **Choose Import Method:**
* **Upload File:** Click **Choose File** and select the zip archive containing your EML files. (Best for smaller archives)
* **Local Path:** Enter the path to the zip file **inside the container**. (Best for large archives)
> **Note on Local Path:** When using Docker, the "Local Path" is relative to the container's filesystem.
> * **Recommended:** Place your zip file in a `temp` folder inside your configured storage directory (`STORAGE_LOCAL_ROOT_PATH`). This path is already mounted. For example, if your storage path is `/data`, put the file in `/data/temp/emails.zip` and enter `/data/temp/emails.zip` as the path.
> * **Alternative:** Mount a separate volume in `docker-compose.yml` (e.g., `- /host/path:/container/path`) and use the container path.
6. Click the **Submit** button.
OpenArchiver will then start importing the EML files from the zip archive. The ingestion process may take some time, depending on the size of the archive.

View File

@@ -17,7 +17,13 @@ Once you have your `.mbox` file, you can upload it to OpenArchiver through the w
1. Navigate to the **Ingestion** page.
2. Click on the **New Ingestion** button.
3. Select **Mbox** as the source type.
4. Upload your `.mbox` file.
4. **Choose Import Method:**
* **Upload File:** Upload your `.mbox` file.
* **Local Path:** Enter the path to the mbox file **inside the container**.
> **Note on Local Path:** When using Docker, the "Local Path" is relative to the container's filesystem.
> * **Recommended:** Place your mbox file in a `temp` folder inside your configured storage directory (`STORAGE_LOCAL_ROOT_PATH`). This path is already mounted. For example, if your storage path is `/data`, put the file in `/data/temp/emails.mbox` and enter `/data/temp/emails.mbox` as the path.
> * **Alternative:** Mount a separate volume in `docker-compose.yml` (e.g., `- /host/path:/container/path`) and use the container path.
## 3. Folder Structure

View File

@@ -15,7 +15,14 @@ To ensure a successful import, you should prepare your PST file according to the
2. Click the **Create New** button.
3. Select **PST Import** as the provider.
4. Enter a name for the ingestion source.
5. Click the **Choose File** button and select the PST file.
5. **Choose Import Method:**
* **Upload File:** Click **Choose File** and select the PST file from your computer. (Best for smaller files)
* **Local Path:** Enter the path to the PST file **inside the container**. (Best for large files)
> **Note on Local Path:** When using Docker, the "Local Path" is relative to the container's filesystem.
> * **Recommended:** Place your file in a `temp` folder inside your configured storage directory (`STORAGE_LOCAL_ROOT_PATH`). This path is already mounted. For example, if your storage path is `/data`, put the file in `/data/temp/archive.pst` and enter `/data/temp/archive.pst` as the path.
> * **Alternative:** Mount a separate volume in `docker-compose.yml` (e.g., `- /host/path:/container/path`) and use the container path.
6. Click the **Submit** button.
OpenArchiver will then start importing the emails from the PST file. The ingestion process may take some time, depending on the size of the file.

View File

@@ -1,6 +1,6 @@
{
"name": "open-archiver",
"version": "0.4.1",
"version": "0.4.2",
"private": true,
"license": "SEE LICENSE IN LICENSE file",
"scripts": {

View File

@@ -104,8 +104,6 @@ export class EMLConnector implements IEmailConnector {
): AsyncGenerator<EmailObject | null> {
const fileStream = await this.getFileStream();
const tempDir = await fs.mkdtemp(join('/tmp', `eml-import-${new Date().getTime()}`));
const unzippedPath = join(tempDir, 'unzipped');
await fs.mkdir(unzippedPath);
const zipFilePath = join(tempDir, 'eml.zip');
try {
@@ -116,34 +114,7 @@ export class EMLConnector implements IEmailConnector {
dest.on('error', reject);
});
await this.extract(zipFilePath, unzippedPath);
const files = await this.getAllFiles(unzippedPath);
for (const file of files) {
if (file.endsWith('.eml')) {
try {
// logger.info({ file }, 'Processing EML file.');
const stream = createReadStream(file);
const content = await streamToBuffer(stream);
// logger.info({ file, size: content.length }, 'Read file to buffer.');
let relativePath = file.substring(unzippedPath.length + 1);
if (dirname(relativePath) === '.') {
relativePath = '';
} else {
relativePath = dirname(relativePath);
}
const emailObject = await this.parseMessage(content, relativePath);
// logger.info({ file, messageId: emailObject.id }, 'Parsed email message.');
yield emailObject;
} catch (error) {
logger.error(
{ error, file },
'Failed to process a single EML file. Skipping.'
);
}
}
}
yield* this.processZipEntries(zipFilePath);
} catch (error) {
logger.error({ error }, 'Failed to fetch email.');
throw error;
@@ -162,55 +133,131 @@ export class EMLConnector implements IEmailConnector {
}
}
private extract(zipFilePath: string, dest: string): Promise<void> {
return new Promise((resolve, reject) => {
private async *processZipEntries(zipFilePath: string): AsyncGenerator<EmailObject | null> {
// Open the ZIP file.
// Note: yauzl requires random access, so we must use the file on disk.
const zipfile = await new Promise<yauzl.ZipFile>((resolve, reject) => {
yauzl.open(zipFilePath, { lazyEntries: true, decodeStrings: false }, (err, zipfile) => {
if (err) reject(err);
zipfile.on('error', reject);
zipfile.readEntry();
zipfile.on('entry', (entry) => {
const fileName = entry.fileName.toString('utf8');
// Ignore macOS-specific metadata files.
if (fileName.startsWith('__MACOSX/')) {
zipfile.readEntry();
return;
}
const entryPath = join(dest, fileName);
if (/\/$/.test(fileName)) {
fs.mkdir(entryPath, { recursive: true })
.then(() => zipfile.readEntry())
.catch(reject);
} else {
zipfile.openReadStream(entry, (err, readStream) => {
if (err) reject(err);
const writeStream = createWriteStream(entryPath);
readStream.pipe(writeStream);
writeStream.on('finish', () => zipfile.readEntry());
writeStream.on('error', reject);
});
}
});
zipfile.on('end', () => resolve());
if (err || !zipfile) return reject(err);
resolve(zipfile);
});
});
}
private async getAllFiles(dirPath: string, arrayOfFiles: string[] = []): Promise<string[]> {
const files = await fs.readdir(dirPath);
// Create an async iterator for zip entries
const entryIterator = this.zipEntryGenerator(zipfile);
for (const file of files) {
const fullPath = join(dirPath, file);
if ((await fs.stat(fullPath)).isDirectory()) {
await this.getAllFiles(fullPath, arrayOfFiles);
} else {
arrayOfFiles.push(fullPath);
for await (const { entry, openReadStream } of entryIterator) {
const fileName = entry.fileName.toString();
if (fileName.startsWith('__MACOSX/') || /\/$/.test(fileName)) {
continue;
}
if (fileName.endsWith('.eml')) {
try {
const readStream = await openReadStream();
const relativePath = dirname(fileName) === '.' ? '' : dirname(fileName);
const emailObject = await this.parseMessage(readStream, relativePath);
yield emailObject;
} catch (error) {
logger.error(
{ error, file: fileName },
'Failed to process a single EML file from zip. Skipping.'
);
}
}
}
return arrayOfFiles;
}
private async parseMessage(emlBuffer: Buffer, path: string): Promise<EmailObject> {
private async *zipEntryGenerator(
zipfile: yauzl.ZipFile
): AsyncGenerator<{ entry: yauzl.Entry; openReadStream: () => Promise<Readable> }> {
let resolveNext: ((value: any) => void) | null = null;
let rejectNext: ((reason?: any) => void) | null = null;
let finished = false;
const queue: yauzl.Entry[] = [];
zipfile.readEntry();
zipfile.on('entry', (entry) => {
if (resolveNext) {
const resolve = resolveNext;
resolveNext = null;
rejectNext = null;
resolve(entry);
} else {
queue.push(entry);
}
});
zipfile.on('end', () => {
finished = true;
if (resolveNext) {
const resolve = resolveNext;
resolveNext = null;
rejectNext = null;
resolve(null); // Signal end
}
});
zipfile.on('error', (err) => {
finished = true;
if (rejectNext) {
const reject = rejectNext;
resolveNext = null;
rejectNext = null;
reject(err);
}
});
while (!finished || queue.length > 0) {
if (queue.length > 0) {
const entry = queue.shift()!;
yield {
entry,
openReadStream: () =>
new Promise<Readable>((resolve, reject) => {
zipfile.openReadStream(entry, (err, stream) => {
if (err || !stream) return reject(err);
resolve(stream);
});
}),
};
zipfile.readEntry(); // Read next entry only after yielding
} else {
const entry = await new Promise<yauzl.Entry | null>((resolve, reject) => {
resolveNext = resolve;
rejectNext = reject;
});
if (entry) {
yield {
entry,
openReadStream: () =>
new Promise<Readable>((resolve, reject) => {
zipfile.openReadStream(entry, (err, stream) => {
if (err || !stream) return reject(err);
resolve(stream);
});
}),
};
zipfile.readEntry(); // Read next entry only after yielding
} else {
break; // End of zip
}
}
}
}
private async parseMessage(
input: Buffer | Readable,
path: string
): Promise<EmailObject> {
let emlBuffer: Buffer;
if (Buffer.isBuffer(input)) {
emlBuffer = input;
} else {
emlBuffer = await streamToBuffer(input);
}
const parsedEmail: ParsedMail = await simpleParser(emlBuffer);
const attachments = parsedEmail.attachments.map((attachment: Attachment) => ({

View File

@@ -7,6 +7,7 @@
import { Label } from '$lib/components/ui/label';
import * as Select from '$lib/components/ui/select';
import * as Alert from '$lib/components/ui/alert/index.js';
import * as RadioGroup from '$lib/components/ui/radio-group/index.js';
import { Textarea } from '$lib/components/ui/textarea/index.js';
import { setAlert } from '$lib/components/custom/alert/alert-state.svelte';
import { api } from '$lib/api.client';
@@ -70,6 +71,27 @@
let fileUploading = $state(false);
let importMethod = $state<'upload' | 'local'>(
source?.credentials && 'localFilePath' in source.credentials && source.credentials.localFilePath
? 'local'
: 'upload'
);
$effect(() => {
if (importMethod === 'upload') {
if ('localFilePath' in formData.providerConfig) {
delete formData.providerConfig.localFilePath;
}
} else {
if ('uploadedFilePath' in formData.providerConfig) {
delete formData.providerConfig.uploadedFilePath;
}
if ('uploadedFileName' in formData.providerConfig) {
delete formData.providerConfig.uploadedFileName;
}
}
});
const handleSubmit = async (event: Event) => {
event.preventDefault();
isSubmitting = true;
@@ -236,59 +258,143 @@
/>
</div>
{:else if formData.provider === 'pst_import'}
<div class="grid grid-cols-4 items-center gap-4">
<Label for="pst-file" class="text-left"
>{$t('app.components.ingestion_source_form.pst_file')}</Label
>
<div class="col-span-3 flex flex-row items-center space-x-2">
<Input
id="pst-file"
type="file"
class=""
accept=".pst"
onchange={handleFileChange}
/>
{#if fileUploading}
<span class=" text-primary animate-spin"><Loader2 /></span>
{/if}
</div>
<div class="grid grid-cols-4 items-start gap-4">
<Label class="text-left pt-2">{$t('app.components.ingestion_source_form.import_method')}</Label>
<RadioGroup.Root bind:value={importMethod} class="col-span-3 flex flex-col space-y-1">
<div class="flex items-center space-x-2">
<RadioGroup.Item value="upload" id="pst-upload" />
<Label for="pst-upload">{$t('app.components.ingestion_source_form.upload_file')}</Label>
</div>
<div class="flex items-center space-x-2">
<RadioGroup.Item value="local" id="pst-local" />
<Label for="pst-local">{$t('app.components.ingestion_source_form.local_path')}</Label>
</div>
</RadioGroup.Root>
</div>
{#if importMethod === 'upload'}
<div class="grid grid-cols-4 items-center gap-4">
<Label for="pst-file" class="text-left"
>{$t('app.components.ingestion_source_form.pst_file')}</Label
>
<div class="col-span-3 flex flex-row items-center space-x-2">
<Input
id="pst-file"
type="file"
class=""
accept=".pst"
onchange={handleFileChange}
/>
{#if fileUploading}
<span class=" text-primary animate-spin"><Loader2 /></span>
{/if}
</div>
</div>
{:else}
<div class="grid grid-cols-4 items-center gap-4">
<Label for="pst-local-path" class="text-left"
>{$t('app.components.ingestion_source_form.local_file_path')}</Label
>
<Input
id="pst-local-path"
bind:value={formData.providerConfig.localFilePath}
placeholder="/path/to/file.pst"
class="col-span-3"
/>
</div>
{/if}
{:else if formData.provider === 'eml_import'}
<div class="grid grid-cols-4 items-center gap-4">
<Label for="eml-file" class="text-left"
>{$t('app.components.ingestion_source_form.eml_file')}</Label
>
<div class="col-span-3 flex flex-row items-center space-x-2">
<Input
id="eml-file"
type="file"
class=""
accept=".zip"
onchange={handleFileChange}
/>
{#if fileUploading}
<span class=" text-primary animate-spin"><Loader2 /></span>
{/if}
</div>
<div class="grid grid-cols-4 items-start gap-4">
<Label class="text-left pt-2">{$t('app.components.ingestion_source_form.import_method')}</Label>
<RadioGroup.Root bind:value={importMethod} class="col-span-3 flex flex-col space-y-1">
<div class="flex items-center space-x-2">
<RadioGroup.Item value="upload" id="eml-upload" />
<Label for="eml-upload">{$t('app.components.ingestion_source_form.upload_file')}</Label>
</div>
<div class="flex items-center space-x-2">
<RadioGroup.Item value="local" id="eml-local" />
<Label for="eml-local">{$t('app.components.ingestion_source_form.local_path')}</Label>
</div>
</RadioGroup.Root>
</div>
{#if importMethod === 'upload'}
<div class="grid grid-cols-4 items-center gap-4">
<Label for="eml-file" class="text-left"
>{$t('app.components.ingestion_source_form.eml_file')}</Label
>
<div class="col-span-3 flex flex-row items-center space-x-2">
<Input
id="eml-file"
type="file"
class=""
accept=".zip"
onchange={handleFileChange}
/>
{#if fileUploading}
<span class=" text-primary animate-spin"><Loader2 /></span>
{/if}
</div>
</div>
{:else}
<div class="grid grid-cols-4 items-center gap-4">
<Label for="eml-local-path" class="text-left"
>{$t('app.components.ingestion_source_form.local_file_path')}</Label
>
<Input
id="eml-local-path"
bind:value={formData.providerConfig.localFilePath}
placeholder="/path/to/file.zip"
class="col-span-3"
/>
</div>
{/if}
{:else if formData.provider === 'mbox_import'}
<div class="grid grid-cols-4 items-center gap-4">
<Label for="mbox-file" class="text-left"
>{$t('app.components.ingestion_source_form.mbox_file')}</Label
>
<div class="col-span-3 flex flex-row items-center space-x-2">
<Input
id="mbox-file"
type="file"
class=""
accept=".mbox"
onchange={handleFileChange}
/>
{#if fileUploading}
<span class=" text-primary animate-spin"><Loader2 /></span>
{/if}
</div>
<div class="grid grid-cols-4 items-start gap-4">
<Label class="text-left pt-2">{$t('app.components.ingestion_source_form.import_method')}</Label>
<RadioGroup.Root bind:value={importMethod} class="col-span-3 flex flex-col space-y-1">
<div class="flex items-center space-x-2">
<RadioGroup.Item value="upload" id="mbox-upload" />
<Label for="mbox-upload">{$t('app.components.ingestion_source_form.upload_file')}</Label>
</div>
<div class="flex items-center space-x-2">
<RadioGroup.Item value="local" id="mbox-local" />
<Label for="mbox-local">{$t('app.components.ingestion_source_form.local_path')}</Label>
</div>
</RadioGroup.Root>
</div>
{#if importMethod === 'upload'}
<div class="grid grid-cols-4 items-center gap-4">
<Label for="mbox-file" class="text-left"
>{$t('app.components.ingestion_source_form.mbox_file')}</Label
>
<div class="col-span-3 flex flex-row items-center space-x-2">
<Input
id="mbox-file"
type="file"
class=""
accept=".mbox"
onchange={handleFileChange}
/>
{#if fileUploading}
<span class=" text-primary animate-spin"><Loader2 /></span>
{/if}
</div>
</div>
{:else}
<div class="grid grid-cols-4 items-center gap-4">
<Label for="mbox-local-path" class="text-left"
>{$t('app.components.ingestion_source_form.local_file_path')}</Label
>
<Input
id="mbox-local-path"
bind:value={formData.providerConfig.localFilePath}
placeholder="/path/to/file.mbox"
class="col-span-3"
/>
</div>
{/if}
{/if}
{#if formData.provider === 'google_workspace' || formData.provider === 'microsoft_365'}
<Alert.Root>

View File

@@ -199,6 +199,10 @@
"provider_eml_import": "EML Import",
"provider_mbox_import": "Mbox Import",
"select_provider": "Select a provider",
"import_method": "Import Method",
"upload_file": "Upload File",
"local_path": "Local Path (Recommended for large files)",
"local_file_path": "Local File Path",
"service_account_key": "Service Account Key (JSON)",
"service_account_key_placeholder": "Paste your service account key JSON content",
"impersonated_admin_email": "Impersonated Admin Email",