mirror of
https://github.com/LogicLabs-OU/OpenArchiver.git
synced 2026-04-06 00:31:57 +02:00
MBOX import: global cross-source deduplication causes incomplete per-mailbox archives and silent data omission #6
Open
opened 2026-04-05 16:16:08 +02:00 by MrUnknownDE
·
0 comments
No Branch/Tag Specified
main
gh-pages
v0.5.1-dev
v0.4.3-release
ee-legalhold
ee-retention
v0.4.3-dev
wayneshn-patch-1
v0.4.3-pre
v0.4.2-fix
v0.4.2-dev
v0.4.1-dev
mailbox-processing-opt
v0.4.0-fix
ee-init
docs-ocr
v0.3.x-fixes
issue-templates
security-update
create-funding-yml
display-versions
attachment-ocr
docs
user-api-key
demo-mode
v0.3.0
system-settings
wip
CLA-v2
role-based-access
dev
v0.5.0
v0.4.2
v0.4.1
v0.4.0
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.1
v0.2.0
v0.1.2
v0.1.1
v0.1.0
Labels
Clear labels
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
bug
documentation
documentation
duplicate
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
enhancement
good first issue
help wanted
help wanted
help wanted
help wanted
improvement
improvement
question
question
Milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
MrUnknownDE
Clear assignees
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github/OpenArchiver#6
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @thisisamateurhour on 3/26/2026
Describe the bug
When importing MBOX files from multiple separate Gmail accounts as individual ingestion sources, three related issues emerge that compromise archive integrity:
Issue 1: Emails silently skipped during import
During MBOX import, emails are silently skipped with only an INFO-level log message (
Skipping duplicate email). There is no summary, warning, or UI indication that emails were excluded from the import. The user has no way to know their archive is incomplete without manually comparing email counts against the source. For an archiving tool — particularly one marketed for legal compliance — silent data omission without user awareness is a critical integrity problem.Issue 2: Global cross-source deduplication
The duplicate detection appears to operate globally across all ingestion sources rather than being scoped per-source. When the same
message_id_headerexists in multiple mailboxes (e.g., a company-wide email received by multiple accounts), only the first-imported copy is stored. Subsequent imports of other mailboxes skip the email entirely.This breaks per-mailbox archive integrity. In a corporate archiving scenario, if both Employee A and Employee B receive the same company-wide email, the archive must show it in both mailboxes independently. Attributing it only to whichever mailbox was imported first is factually and legally incorrect.
The
provider_msg_source_idxcomposite index on(provider_message_id, ingestion_source_id)suggests per-source dedup may have been intended, but the application-level dedup logic appears to check globally.Issue 3: Search results do not indicate source inbox
When searching for an email that was deduplicated via Issue 2, the search results show the email with its To/From headers but do not indicate which ingestion source it belongs to. The result is that:
account-b@example.comappears in search resultsaccount-bingestion source shows it as missingaccount-a's ingestion source (whichever was imported first)This creates a confusing experience where search says the email exists, but the per-inbox archive view says it doesn't.
To Reproduce
account-a@example.com) via MBOX import containing an email with a specific Message-IDaccount-b@example.com) via MBOX import where the same email (same Message-ID) was also receivedSkipping duplicate emailin logs with no summary or countExpected behavior
Import transparency: When emails are skipped during import, a summary should be provided (e.g., "Imported X emails, skipped Y duplicates"). Ideally, skipped emails should be logged with enough detail (Message-ID, subject) to allow the user to verify the skip was correct.
Per-source deduplication: The duplicate check should be scoped to
(message_id_header, ingestion_source_id). This prevents duplicates within a single source (important for Gmail Takeout where label copies produce multiple MBOX entries for the same email) while allowing the same email to exist across multiple sources as independent archive entries.Search source attribution: Search results should indicate which ingestion source an email belongs to, so users can understand why an email appears in search but not in a specific inbox's archive view.
Database evidence
The email was sent TO account-b but attributed to account-a because account-a was imported first and the global dedup skipped it during account-b's import.
Additional context: user_email derivation
A related observation:
user_emailis derived from the MBOX filename or ingestion source name (resulting in values likeaccount-a@example.com@mbox.local) rather than from actual email headers (To,Delivered-To,X-Delivered-To). This means theuser_emailfield does not reflect the actual recipient of the email, only the import source. Combined with Issue 2, this causes emails to have incorrect recipient attribution.System:
Relevant logs:
No error, warning, or summary is produced — the skips are logged only at INFO level with no detail about which email was skipped or why.
Suggested fix
(message_id_header, ingestion_source_id)instead ofmessage_id_headeralone.