Infinite pg_basebackup loop due to hardcoded WAL timeline ID (timeline=1) #10

Open
opened 2026-04-05 16:15:35 +02:00 by MrUnknownDE · 0 comments
Owner

Originally created by @raflol33 on 4/1/2026

Description
On PostgreSQL clusters where the timeline ID is greater than 1 (e.g., Patroni High Availability setups that have undergone failovers/switchovers), the Databasus agent enters an infinite loop of creating full backups due to a spurious wal_chain_broken error.

Steps to Reproduce
Set up a PostgreSQL database with a timeline > 1 (e.g., promote a replica to master or trigger a failover in Patroni).
Configure the Databasus agent to backup this database and stream WAL files.
Observe the agent behavior and backend logs.
Expected Behavior
The agent should perform a single successful pg_basebackup and subsequently stream incremental WAL segments correctly, keeping the WAL chain valid.

Actual Behavior (The Bug)
The agent performs endless full backups in a loop.

The agent successfully completes pg_basebackup.
The agent parses the start/stop LSN from pg_basebackup's stderr but hardcodes the timeline ID to 1 in

agent/internal/features/full_backup/stderr_parser.go
.
The Databasus database registers the full backup's underlying WAL segments as starting with 00000001....
The wal_streamer runs in parallel, correctly grabbing and uploading the actual WAL segments generated by PostgreSQL, which have the real timeline ID prefix (e.g., 0000001A... for timeline 26).
The backend's WAL chain validator compares the segments. Because 0000001A does not sequentially follow 00000001, the backend detects a gap and returns wal_chain_broken.
The agent receives this error and automatically triggers a new full backup, causing an endless loop.
Root Cause
In

agent/internal/features/full_backup/stderr_parser.go
, the timeline argument to

LSNToSegmentName()
is hardcoded to 1:

go
// Line ~28
startSegment, err = LSNToSegmentName(startMatch[1], 1, defaultWalSegmentSize)
// ...
// Line ~33
stopSegment, err = LSNToSegmentName(stopMatch[1], 1, defaultWalSegmentSize)
However, pg_basebackup correctly outputs the active timeline in its stderr during the start point line:

text
pg_basebackup: write-ahead log start point: 1D2/4A000028 on timeline 26
Proposed Solution
Update

ParseBasebackupStderr
to extract the timeline ID from the pg_basebackup stderr output using a regex (e.g., on timeline (\d+)) rather than hardcoding 1. Pass this dynamically parsed value into

LSNToSegmentName()
.

Note: The backend

WalCalculator
already correctly handles and preserves hexadecimal timelines when validating the strings, so this fix safely isolates to the agent code without breaking restore logic.

logfile.txt

*Originally created by @raflol33 on 4/1/2026* Description On PostgreSQL clusters where the timeline ID is greater than 1 (e.g., Patroni High Availability setups that have undergone failovers/switchovers), the Databasus agent enters an infinite loop of creating full backups due to a spurious wal_chain_broken error. Steps to Reproduce Set up a PostgreSQL database with a timeline > 1 (e.g., promote a replica to master or trigger a failover in Patroni). Configure the Databasus agent to backup this database and stream WAL files. Observe the agent behavior and backend logs. Expected Behavior The agent should perform a single successful pg_basebackup and subsequently stream incremental WAL segments correctly, keeping the WAL chain valid. Actual Behavior (The Bug) The agent performs endless full backups in a loop. The agent successfully completes pg_basebackup. The agent parses the start/stop LSN from pg_basebackup's stderr but hardcodes the timeline ID to 1 in agent/internal/features/full_backup/stderr_parser.go . The Databasus database registers the full backup's underlying WAL segments as starting with 00000001.... The wal_streamer runs in parallel, correctly grabbing and uploading the actual WAL segments generated by PostgreSQL, which have the real timeline ID prefix (e.g., 0000001A... for timeline 26). The backend's WAL chain validator compares the segments. Because 0000001A does not sequentially follow 00000001, the backend detects a gap and returns wal_chain_broken. The agent receives this error and automatically triggers a new full backup, causing an endless loop. Root Cause In agent/internal/features/full_backup/stderr_parser.go , the timeline argument to LSNToSegmentName() is hardcoded to 1: go // Line ~28 startSegment, err = LSNToSegmentName(startMatch[1], 1, defaultWalSegmentSize) // ... // Line ~33 stopSegment, err = LSNToSegmentName(stopMatch[1], 1, defaultWalSegmentSize) However, pg_basebackup correctly outputs the active timeline in its stderr during the start point line: text pg_basebackup: write-ahead log start point: 1D2/4A000028 on timeline 26 Proposed Solution Update ParseBasebackupStderr to extract the timeline ID from the pg_basebackup stderr output using a regex (e.g., on timeline (\d+)) rather than hardcoding 1. Pass this dynamically parsed value into LSNToSegmentName() . Note: The backend WalCalculator already correctly handles and preserves hexadecimal timelines when validating the strings, so this fix safely isolates to the agent code without breaking restore logic. [logfile.txt](https://github.com/user-attachments/files/26405851/logfile.txt)
MrUnknownDE added the bug label 2026-04-05 16:15:35 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github/databasus#10