Docker containers losing network connectivity amongst themselves #746

Closed
opened 2026-04-05 17:38:00 +02:00 by MrUnknownDE · 0 comments
Owner

Originally created by @clanger81 on 10/20/2025

Describe the Bug

Sorry, I didn't really know how to title this problem accurately. I've been having this issue intermittently for awhile now and it's becoming pretty frustrating. From the logs of each container, it looks like both gerbil and traefik lose connection to the pangolin container, there doesn't appear to be a specific reason, seems to happen completely randomly. Before running Pangolin, I had a single wireguard tunnel back to my local server VLAN, traefik was configured manually with rules from a file provider. I also had an instance of Crowdsec running alongside. I'd really like to continue using Pangolin as it makes setup of sub-domains significantly easier than doing it all by hand.

Ultimately Gerbil logs will spit out the following at some stage:
INFO: 2025/10/20 10:44:55 Failed to report peer bandwidth: API returned non-OK status: 408 Request Timeout
INFO: 2025/10/20 10:46:52 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout
INFO: 2025/10/20 10:49:00 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout
INFO: 2025/10/20 10:49:47 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin on 127.0.0.11:53: read udp 127.0.0.1:56696->127.0.0.11:53: i/o timeout
INFO: 2025/10/20 10:50:47 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout
INFO: 2025/10/20 10:52:45 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout
INFO: 2025/10/20 10:54:14 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout
INFO: 2025/10/20 10:57:52 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout
INFO: 2025/10/20 11:00:14 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout
INFO: 2025/10/20 11:06:44 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout
INFO: 2025/10/20 11:11:17 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout
INFO: 2025/10/20 11:20:15 Fetching remote config from http://pangolin:3001/api/v1/gerbil/get-config
ERROR: 2025/10/20 11:20:15 Error fetching remote config http://pangolin:3001/api/v1/gerbil/get-config: Post "http://pangolin:3001/api/v1/gerbil/get-config": dial tcp 172.18.0.3:3001: connect: connection refused
ERROR: 2025/10/20 11:20:15 Failed to load configuration: Post "http://pangolin:3001/api/v1/gerbil/get-config": dial tcp 172.18.0.3:3001: connect: connection refused
INFO: 2025/10/20 11:20:20 Fetching remote config from http://pangolin:3001/api/v1/gerbil/get-config
ERROR: 2025/10/20 11:20:20 Error fetching remote config http://pangolin:3001/api/v1/gerbil/get-config: Post "http://pangolin:3001/api/v1/gerbil/get-config": dial tcp 172.18.0.3:3001: connect: connection refused
ERROR: 2025/10/20 11:20:20 Failed to load configuration: Post "http://pangolin:3001/api/v1/gerbil/get-config": dial tcp 172.18.0.3:3001: connect: connection refused

And Traefik logs spit this out:
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": dial tcp 172.18.0.4:3001: connect: connection refused","time":"2025-10-20T05:40:10Z","message":"Provider error, retrying in 695.480607ms"}
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": dial tcp 172.18.0.4:3001: connect: connection refused","time":"2025-10-20T05:40:11Z","message":"Provider error, retrying in 856.717352ms"}
ERROR: CrowdsecBouncerTraefikPlugin: 2025/10/20 05:40:11 handleMetricsTicker:reportMetrics reportMetrics:query crowdsecQuery:unreachable url:http://crowdsec:8080/v1/usage-metrics Post "http://crowdsec:8080/v1/usage-metrics": dial tcp 172.18.0.3:8080: connect: connection refused
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": dial tcp 172.18.0.4:3001: connect: connection refused","time":"2025-10-20T05:40:12Z","message":"Provider error, retrying in 1.576982697s"}
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": dial tcp 172.18.0.4:3001: connect: connection refused","time":"2025-10-20T05:40:13Z","message":"Provider error, retrying in 2.509794383s"}
ERROR: CrowdsecBouncerTraefikPlugin: 2025/10/20 05:40:23 appsecQuery:unreachable
ERROR: CrowdsecBouncerTraefikPlugin: 2025/10/20 05:40:23 appsecQuery:unreachable
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:40:37Z","message":"Provider error, retrying in 424.080591ms"}
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:40:53Z","message":"Provider error, retrying in 818.761466ms"}
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:41:16Z","message":"Provider error, retrying in 701.622439ms"}
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:43:16Z","message":"Provider error, retrying in 262.435427ms"}
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:44:36Z","message":"Provider error, retrying in 313.481504ms"}
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:49:43Z","message":"Provider error, retrying in 634.789881ms"}
{"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get "http://pangolin:3001/api/v1/traefik-config": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:58:08Z","message":"Provider error, retrying in 257.667696ms"}

The Pangolin logs from what I can tell are completely clean and don't really demonstrate any issues. I've had them on debug for the past couple of weeks to try and troubleshoot this but there's nothing in them that really demonstrates a problem.

A few things about my setup:

  • My instance is behind Cloudflare proxy, I have not tried exposing it yet to see if that makes a difference or not. Certs are being pulled through Cloudflare.
  • I am running the crowdsec plugin. Those logs also look clean, no errors from what I can tell.
  • Local network IPs are whitelisted in my config along with Cloudflare DNS IPs setup for trusted header forwarding.
  • I'm using Defguard as my OpenID Auth provider.

Environment

  • OS Type & Version: Ubuntu 24.04
  • Pangolin Version: 1.9.4
  • Gerbil Version: 1.2.2
  • Traefik Version: 3.5.3
  • Newt Version: 1.5.2 (on two nodes)

To Reproduce

No idea, the issue is completely random as far as I can tell. Only way I know it's down is I try to access one of my sub-domains and they don't work. A restart of the whole stack resolves the problem. I've had it go down after an hour or two, lately it's generally stable for two to three days. I'll note that this has been a consistent issue over several versions of Pangolin to date.

Expected Behavior

The Pangolin stack stays stable for a consistent period of time.

*Originally created by @clanger81 on 10/20/2025* ### Describe the Bug Sorry, I didn't really know how to title this problem accurately. I've been having this issue intermittently for awhile now and it's becoming pretty frustrating. From the logs of each container, it looks like both gerbil and traefik lose connection to the pangolin container, there doesn't appear to be a specific reason, seems to happen completely randomly. Before running Pangolin, I had a single wireguard tunnel back to my local server VLAN, traefik was configured manually with rules from a file provider. I also had an instance of Crowdsec running alongside. I'd really like to continue using Pangolin as it makes setup of sub-domains significantly easier than doing it all by hand. Ultimately Gerbil logs will spit out the following at some stage: INFO: 2025/10/20 10:44:55 Failed to report peer bandwidth: API returned non-OK status: 408 Request Timeout INFO: 2025/10/20 10:46:52 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout INFO: 2025/10/20 10:49:00 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout INFO: 2025/10/20 10:49:47 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin on 127.0.0.11:53: read udp 127.0.0.1:56696->127.0.0.11:53: i/o timeout INFO: 2025/10/20 10:50:47 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout INFO: 2025/10/20 10:52:45 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout INFO: 2025/10/20 10:54:14 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout INFO: 2025/10/20 10:57:52 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout INFO: 2025/10/20 11:00:14 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout INFO: 2025/10/20 11:06:44 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout INFO: 2025/10/20 11:11:17 Failed to report peer bandwidth: failed to send bandwidth data: Post "http://pangolin:3001/api/v1/gerbil/receive-bandwidth": dial tcp: lookup pangolin: i/o timeout INFO: 2025/10/20 11:20:15 Fetching remote config from http://pangolin:3001/api/v1/gerbil/get-config ERROR: 2025/10/20 11:20:15 Error fetching remote config http://pangolin:3001/api/v1/gerbil/get-config: Post "http://pangolin:3001/api/v1/gerbil/get-config": dial tcp 172.18.0.3:3001: connect: connection refused ERROR: 2025/10/20 11:20:15 Failed to load configuration: Post "http://pangolin:3001/api/v1/gerbil/get-config": dial tcp 172.18.0.3:3001: connect: connection refused INFO: 2025/10/20 11:20:20 Fetching remote config from http://pangolin:3001/api/v1/gerbil/get-config ERROR: 2025/10/20 11:20:20 Error fetching remote config http://pangolin:3001/api/v1/gerbil/get-config: Post "http://pangolin:3001/api/v1/gerbil/get-config": dial tcp 172.18.0.3:3001: connect: connection refused ERROR: 2025/10/20 11:20:20 Failed to load configuration: Post "http://pangolin:3001/api/v1/gerbil/get-config": dial tcp 172.18.0.3:3001: connect: connection refused And Traefik logs spit this out: {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": dial tcp 172.18.0.4:3001: connect: connection refused","time":"2025-10-20T05:40:10Z","message":"Provider error, retrying in 695.480607ms"} {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": dial tcp 172.18.0.4:3001: connect: connection refused","time":"2025-10-20T05:40:11Z","message":"Provider error, retrying in 856.717352ms"} ERROR: CrowdsecBouncerTraefikPlugin: 2025/10/20 05:40:11 handleMetricsTicker:reportMetrics reportMetrics:query crowdsecQuery:unreachable url:http://crowdsec:8080/v1/usage-metrics Post "http://crowdsec:8080/v1/usage-metrics": dial tcp 172.18.0.3:8080: connect: connection refused {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": dial tcp 172.18.0.4:3001: connect: connection refused","time":"2025-10-20T05:40:12Z","message":"Provider error, retrying in 1.576982697s"} {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": dial tcp 172.18.0.4:3001: connect: connection refused","time":"2025-10-20T05:40:13Z","message":"Provider error, retrying in 2.509794383s"} ERROR: CrowdsecBouncerTraefikPlugin: 2025/10/20 05:40:23 appsecQuery:unreachable ERROR: CrowdsecBouncerTraefikPlugin: 2025/10/20 05:40:23 appsecQuery:unreachable {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:40:37Z","message":"Provider error, retrying in 424.080591ms"} {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:40:53Z","message":"Provider error, retrying in 818.761466ms"} {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:41:16Z","message":"Provider error, retrying in 701.622439ms"} {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:43:16Z","message":"Provider error, retrying in 262.435427ms"} {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:44:36Z","message":"Provider error, retrying in 313.481504ms"} {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:49:43Z","message":"Provider error, retrying in 634.789881ms"} {"level":"error","providerName":"http","error":"cannot fetch configuration data: do fetch request: Get \"http://pangolin:3001/api/v1/traefik-config\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","time":"2025-10-20T10:58:08Z","message":"Provider error, retrying in 257.667696ms"} The Pangolin logs from what I can tell are completely clean and don't really demonstrate any issues. I've had them on debug for the past couple of weeks to try and troubleshoot this but there's nothing in them that really demonstrates a problem. A few things about my setup: - My instance is behind Cloudflare proxy, I have not tried exposing it yet to see if that makes a difference or not. Certs are being pulled through Cloudflare. - I am running the crowdsec plugin. Those logs also look clean, no errors from what I can tell. - Local network IPs are whitelisted in my config along with Cloudflare DNS IPs setup for trusted header forwarding. - I'm using Defguard as my OpenID Auth provider. ### Environment - OS Type & Version: Ubuntu 24.04 - Pangolin Version: 1.9.4 - Gerbil Version: 1.2.2 - Traefik Version: 3.5.3 - Newt Version: 1.5.2 (on two nodes) ### To Reproduce No idea, the issue is completely random as far as I can tell. Only way I know it's down is I try to access one of my sub-domains and they don't work. A restart of the whole stack resolves the problem. I've had it go down after an hour or two, lately it's generally stable for two to three days. I'll note that this has been a consistent issue over several versions of Pangolin to date. ### Expected Behavior The Pangolin stack stays stable for a consistent period of time.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github/pangolin#746