Health check status not invalidated when Newt site goes offline #74

New Issue

MrUnknownDE · 2026-04-05T17:00:48+02:00

MrUnknownDE commented

2026-04-05 17:00:48 +02:00

Originally created by @strausmann on 3/24/2026

Description

When a Newt agent disconnects (site goes offline), the health check status of all targets routed through that site remains "healthy" in the dashboard. Pangolin correctly detects the site as offline, but does not invalidate the cached health check results for targets on that site.

This causes Pangolin to continue routing traffic to targets through a dead tunnel, resulting in timeouts for users.

Steps to Reproduce

Configure a resource with multiple targets across different sites (e.g., 3 targets via Site A, 1 target via Site B)
Enable health checks on all targets
Verify all targets show "healthy"
Stop the Newt agent on Site A (e.g., docker stop pangolin-newt)
Observe: Site A shows "Offline" in the Sites dashboard
Observe: All targets via Site A still show "healthy" in the resource configuration

Expected Behavior

When a site goes offline, all targets routed through that site should immediately transition to "unhealthy" or "unknown" status. Pangolin should not route traffic to targets on offline sites.

Actual Behavior

Site correctly shows "Offline"
Target health check status retains the last known value ("healthy")
Pangolin continues to route traffic through the dead tunnel
Users experience sporadic timeouts (requests randomly hit the dead route)

Root Cause Analysis

Based on log analysis:

Health checks run through the Newt tunnel (Pangolin → WebSocket → Newt → HTTP → target)
When Newt disconnects, no new health check results arrive
The last-known-good status stays in the database and is displayed as current
Additionally: newt/disconnecting message type throws an exception instead of triggering state cleanup:
```
Unsupported message type: newt/disconnecting
```
Pangolin continues sending health check requests to the disconnected Newt (phantom checks)

Environment

Pangolin: Enterprise Edition (PostgreSQL)
Newt: v1.10.3
Setup: 4 targets for Proxmox VE (172.16.50.8:8006) across 4 sites, 1 site taken offline

Suggested Fix

When a Newt disconnect is detected:

Set all target health checks on that site to "unknown" or "unhealthy"
Handle the newt/disconnecting message type (currently throws exception)
Stop sending health check requests to disconnected sites
When Newt reconnects, resume health checks and let them naturally transition back to "healthy"

*Originally created by @strausmann on 3/24/2026* ## Description When a Newt agent disconnects (site goes offline), the health check status of all targets routed through that site remains "healthy" in the dashboard. Pangolin correctly detects the site as offline, but does not invalidate the cached health check results for targets on that site. This causes Pangolin to continue routing traffic to targets through a dead tunnel, resulting in timeouts for users. ## Steps to Reproduce 1. Configure a resource with multiple targets across different sites (e.g., 3 targets via Site A, 1 target via Site B) 2. Enable health checks on all targets 3. Verify all targets show "healthy" 4. Stop the Newt agent on Site A (e.g., `docker stop pangolin-newt`) 5. Observe: Site A shows "Offline" in the Sites dashboard 6. Observe: All targets via Site A **still show "healthy"** in the resource configuration ## Expected Behavior When a site goes offline, all targets routed through that site should immediately transition to "unhealthy" or "unknown" status. Pangolin should not route traffic to targets on offline sites. ## Actual Behavior - Site correctly shows "Offline" - Target health check status retains the last known value ("healthy") - Pangolin continues to route traffic through the dead tunnel - Users experience sporadic timeouts (requests randomly hit the dead route) ## Root Cause Analysis Based on log analysis: 1. Health checks run **through the Newt tunnel** (Pangolin → WebSocket → Newt → HTTP → target) 2. When Newt disconnects, no new health check results arrive 3. The last-known-good status stays in the database and is displayed as current 4. Additionally: `newt/disconnecting` message type throws an exception instead of triggering state cleanup: ``` Unsupported message type: newt/disconnecting ``` 5. Pangolin continues sending health check requests to the disconnected Newt (phantom checks) ## Environment - Pangolin: Enterprise Edition (PostgreSQL) - Newt: v1.10.3 - Setup: 4 targets for Proxmox VE (172.16.50.8:8006) across 4 sites, 1 site taken offline ## Suggested Fix When a Newt disconnect is detected: 1. Set all target health checks on that site to `"unknown"` or `"unhealthy"` 2. Handle the `newt/disconnecting` message type (currently throws exception) 3. Stop sending health check requests to disconnected sites 4. When Newt reconnects, resume health checks and let them naturally transition back to "healthy"

MrUnknownDE closed this issue

2026-04-05 17:00:48 +02:00

MrUnknownDE referenced this issue

2026-04-05 19:42:49 +02:00

fix: add missing `await` when verifying pincode #1858

Sign in to join this conversation.

Branches Tags

main

dev

dependabot/npm_and_yarn/dev-minor-updates-b4e5d6b9c5

revert-2766-feature/systemd-install-instructions

dependabot/npm_and_yarn/prod-patch-updates-05702d39f2

dependabot/npm_and_yarn/next-16.2.1

dependabot/npm_and_yarn/recharts-3.8.1

alerting-rules

private-site-ha

dependabot/docker/docker/library/node-25-slim

ssh

delete-account

msg-delivery

org-only-idp

cicd

patch

site-targets-auto-login

No Label

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github/pangolin#74