Health check status not invalidated when Newt site goes offline #74

Closed
opened 2026-04-05 17:00:48 +02:00 by MrUnknownDE · 0 comments
Owner

Originally created by @strausmann on 3/24/2026

Description

When a Newt agent disconnects (site goes offline), the health check status of all targets routed through that site remains "healthy" in the dashboard. Pangolin correctly detects the site as offline, but does not invalidate the cached health check results for targets on that site.

This causes Pangolin to continue routing traffic to targets through a dead tunnel, resulting in timeouts for users.

Steps to Reproduce

  1. Configure a resource with multiple targets across different sites (e.g., 3 targets via Site A, 1 target via Site B)
  2. Enable health checks on all targets
  3. Verify all targets show "healthy"
  4. Stop the Newt agent on Site A (e.g., docker stop pangolin-newt)
  5. Observe: Site A shows "Offline" in the Sites dashboard
  6. Observe: All targets via Site A still show "healthy" in the resource configuration

Expected Behavior

When a site goes offline, all targets routed through that site should immediately transition to "unhealthy" or "unknown" status. Pangolin should not route traffic to targets on offline sites.

Actual Behavior

  • Site correctly shows "Offline"
  • Target health check status retains the last known value ("healthy")
  • Pangolin continues to route traffic through the dead tunnel
  • Users experience sporadic timeouts (requests randomly hit the dead route)

Root Cause Analysis

Based on log analysis:

  1. Health checks run through the Newt tunnel (Pangolin → WebSocket → Newt → HTTP → target)
  2. When Newt disconnects, no new health check results arrive
  3. The last-known-good status stays in the database and is displayed as current
  4. Additionally: newt/disconnecting message type throws an exception instead of triggering state cleanup:
    Unsupported message type: newt/disconnecting
    
  5. Pangolin continues sending health check requests to the disconnected Newt (phantom checks)

Environment

  • Pangolin: Enterprise Edition (PostgreSQL)
  • Newt: v1.10.3
  • Setup: 4 targets for Proxmox VE (172.16.50.8:8006) across 4 sites, 1 site taken offline

Suggested Fix

When a Newt disconnect is detected:

  1. Set all target health checks on that site to "unknown" or "unhealthy"
  2. Handle the newt/disconnecting message type (currently throws exception)
  3. Stop sending health check requests to disconnected sites
  4. When Newt reconnects, resume health checks and let them naturally transition back to "healthy"
*Originally created by @strausmann on 3/24/2026* ## Description When a Newt agent disconnects (site goes offline), the health check status of all targets routed through that site remains "healthy" in the dashboard. Pangolin correctly detects the site as offline, but does not invalidate the cached health check results for targets on that site. This causes Pangolin to continue routing traffic to targets through a dead tunnel, resulting in timeouts for users. ## Steps to Reproduce 1. Configure a resource with multiple targets across different sites (e.g., 3 targets via Site A, 1 target via Site B) 2. Enable health checks on all targets 3. Verify all targets show "healthy" 4. Stop the Newt agent on Site A (e.g., `docker stop pangolin-newt`) 5. Observe: Site A shows "Offline" in the Sites dashboard 6. Observe: All targets via Site A **still show "healthy"** in the resource configuration ## Expected Behavior When a site goes offline, all targets routed through that site should immediately transition to "unhealthy" or "unknown" status. Pangolin should not route traffic to targets on offline sites. ## Actual Behavior - Site correctly shows "Offline" - Target health check status retains the last known value ("healthy") - Pangolin continues to route traffic through the dead tunnel - Users experience sporadic timeouts (requests randomly hit the dead route) ## Root Cause Analysis Based on log analysis: 1. Health checks run **through the Newt tunnel** (Pangolin → WebSocket → Newt → HTTP → target) 2. When Newt disconnects, no new health check results arrive 3. The last-known-good status stays in the database and is displayed as current 4. Additionally: `newt/disconnecting` message type throws an exception instead of triggering state cleanup: ``` Unsupported message type: newt/disconnecting ``` 5. Pangolin continues sending health check requests to the disconnected Newt (phantom checks) ## Environment - Pangolin: Enterprise Edition (PostgreSQL) - Newt: v1.10.3 - Setup: 4 targets for Proxmox VE (172.16.50.8:8006) across 4 sites, 1 site taken offline ## Suggested Fix When a Newt disconnect is detected: 1. Set all target health checks on that site to `"unknown"` or `"unhealthy"` 2. Handle the `newt/disconnecting` message type (currently throws exception) 3. Stop sending health check requests to disconnected sites 4. When Newt reconnects, resume health checks and let them naturally transition back to "healthy"
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github/pangolin#74