mirror of
https://github.com/databasus/databasus.git
synced 2026-04-06 00:32:03 +02:00
Infinite pg_basebackup loop due to hardcoded WAL timeline ID (timeline=1) #10
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @raflol33 on 4/1/2026
Description
On PostgreSQL clusters where the timeline ID is greater than 1 (e.g., Patroni High Availability setups that have undergone failovers/switchovers), the Databasus agent enters an infinite loop of creating full backups due to a spurious wal_chain_broken error.
Steps to Reproduce
Set up a PostgreSQL database with a timeline > 1 (e.g., promote a replica to master or trigger a failover in Patroni).
Configure the Databasus agent to backup this database and stream WAL files.
Observe the agent behavior and backend logs.
Expected Behavior
The agent should perform a single successful pg_basebackup and subsequently stream incremental WAL segments correctly, keeping the WAL chain valid.
Actual Behavior (The Bug)
The agent performs endless full backups in a loop.
The agent successfully completes pg_basebackup.
The agent parses the start/stop LSN from pg_basebackup's stderr but hardcodes the timeline ID to 1 in
agent/internal/features/full_backup/stderr_parser.go
.
The Databasus database registers the full backup's underlying WAL segments as starting with 00000001....
The wal_streamer runs in parallel, correctly grabbing and uploading the actual WAL segments generated by PostgreSQL, which have the real timeline ID prefix (e.g., 0000001A... for timeline 26).
The backend's WAL chain validator compares the segments. Because 0000001A does not sequentially follow 00000001, the backend detects a gap and returns wal_chain_broken.
The agent receives this error and automatically triggers a new full backup, causing an endless loop.
Root Cause
In
agent/internal/features/full_backup/stderr_parser.go
, the timeline argument to
LSNToSegmentName()
is hardcoded to 1:
go
// Line ~28
startSegment, err = LSNToSegmentName(startMatch[1], 1, defaultWalSegmentSize)
// ...
// Line ~33
stopSegment, err = LSNToSegmentName(stopMatch[1], 1, defaultWalSegmentSize)
However, pg_basebackup correctly outputs the active timeline in its stderr during the start point line:
text
pg_basebackup: write-ahead log start point: 1D2/4A000028 on timeline 26
Proposed Solution
Update
ParseBasebackupStderr
to extract the timeline ID from the pg_basebackup stderr output using a regex (e.g., on timeline (\d+)) rather than hardcoding 1. Pass this dynamically parsed value into
LSNToSegmentName()
.
Note: The backend
WalCalculator
already correctly handles and preserves hexadecimal timelines when validating the strings, so this fix safely isolates to the agent code without breaking restore logic.
logfile.txt