[Feature Request] Implement OpenTelemetry Metrics in Pangolin #950

New Issue

MrUnknownDE · 2026-04-05T18:02:51+02:00

MrUnknownDE commented

2026-04-05 18:02:51 +02:00

Originally created by @marcschaeferger on 9/7/2025

Add OpenTelemetry-based observability to Pangolin

Summary / Goal

Instrument Pangolin with OpenTelemetry Metrics (OTel) following CNCF / industry best practices so that:

Metrics are emitted using the OpenTelemetry JS SDK (vendor-neutral API).
Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
Focus is metrics first; design should allow adding traces and logs later.
Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.

Why OpenTelemetry (OTel)

OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs).
Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors).
Use the OTel Collector for attribute enrichment, normalization, batching, and remote_write.

Requirements & Constraints

Use the OpenTelemetry JavaScript/TypeScript SDKs and official instrumentation packages.
Provide a /metrics endpoint in Prometheus exposition format via the OTel Prometheus exporter.
All durations in seconds and sizes in bytes. Metric names should carry units where applicable (_seconds, _bytes) and counters use _total.
Enforce low‑cardinality label design (e.g., site_id, resource_id). Do not use per‑request unique values as labels.
All exporters configurable at runtime through environment variables or configuration (no code change to switch exporter).
Provide an example OTel Collector configuration that demonstrates OTLP ingestion, attribute promotion and prometheusremotewrite usage.

Recommended Pangolin Metrics (TypeScript implementation)

Use snake_case names and include units/ _total suffixes where applicable.

Category	Metric Name	Type	Labels	Unit / Notes
Site / Global	`pangolin_site_active_sites`	Gauge	`site_id`, `region`	count
	`pangolin_site_online`	Gauge 0/1	`site_id`, `transport`	bool
	`pangolin_site_bandwidth_bytes_total`	Counter	`site_id`, `direction`, `protocol`	bytes
	`pangolin_site_uptime_seconds_total`	Counter	`site_id`	seconds
	`pangolin_site_connection_drops_total`	Counter	`site_id`	count
	`pangolin_site_handshake_latency_seconds`	Histogram	`site_id`, `transport`	seconds
Resource / App	`pangolin_resource_requests_total`	Counter	`site_id`,`resource_id`,`backend`,`method`,`status`	count
	`pangolin_resource_request_duration_seconds`	Histogram	`site_id`,`resource_id`,`backend`,`method`	seconds
	`pangolin_resource_active_connections`	Gauge	`site_id`,`resource_id`,`protocol`	count
	`pangolin_resource_errors_total`	Counter	`site_id`,`resource_id`,`backend`,`error_type`	count
	`pangolin_resource_bandwidth_bytes_total`	Counter	`site_id`,`resource_id`,`direction`	bytes
Tunnel / Transport	`pangolin_tunnel_up`	Gauge 0/1	`site_id`,`transport`	bool
	`pangolin_tunnel_reconnects_total`	Counter	`site_id`,`transport`,`reason`	count
	`pangolin_tunnel_latency_seconds`	Histogram	`site_id`,`transport`	seconds
	`pangolin_tunnel_bytes_total`	Counter	`site_id`,`transport`,`direction`	bytes
	`pangolin_wg_handshake_total`	Counter	`site_id`,`result`	count
Backend	`pangolin_backend_health_status`	Gauge 1/0	`backend`,`site_id`	bool
	`pangolin_backend_connection_errors_total`	Counter	`backend`,`site_id`,`error_type`	count
	`pangolin_backend_response_size_bytes`	Histogram	`backend`,`site_id`	bytes
Auth / Identity	`pangolin_auth_requests_total`	Counter	`site_id`,`auth_method`,`result`	count
	`pangolin_auth_request_duration_seconds`	Histogram	`auth_method`,`result`	seconds
	`pangolin_auth_active_users`	Gauge	`site_id`,`auth_method`	count
	`pangolin_auth_failure_reasons_total`	Counter	`site_id`,`reason`,`auth_method`	count
Tokens / Sessions	`pangolin_token_issued_total`	Counter	`site_id`,`auth_method`	count
	`pangolin_token_revoked_total`	Counter	`reason`	count
	`pangolin_token_refresh_total`	Counter	`site_id`,`result`	count
UI / API	`pangolin_ui_requests_total`	Counter	`endpoint`,`method`,`status`	count
	`pangolin_ui_active_sessions`	Gauge		count
Operational	`pangolin_config_reloads_total`	Counter	`result`	count
	`pangolin_restart_count_total`	Counter		count
	`pangolin_background_jobs_total`	Counter	`job_type`,`status`	count
	`pangolin_certificates_expiry_days`	Gauge	`site_id`,`resource_id`	days

Label guidelines: prefer site_id/resource_id. Avoid per‑request unique labels (user IDs, full URLs). Use enums and stable identifiers.

Implementation Plan

Dependencies (example packages)
- Add OpenTelemetry JS packages to the Node app (install via npm/yarn):
  - @opentelemetry/api
  - @opentelemetry/sdk-metrics (or current stable metrics SDK)
  - @opentelemetry/exporter-prometheus
  - @opentelemetry/exporter-metrics-otlp-http (or OTLP exporter variant)
  - @opentelemetry/instrumentation-http
  - Framework instrumentation if used (e.g., @opentelemetry/instrumentation-express, Next.js instrumentation patterns)
  - ...
Central metrics module
- Create src/metrics/ (or server/metrics/) that:
  - Initializes OTel MeterProvider.
  - Registers Prometheus exporter (when enabled) and exposes the exporter handler on /metrics (or mounts to existing server route).
  - Optionally registers OTLP exporter when configured via env vars.
  - Exposes a singleton metrics object with helper functions:
    - inc(name, labels), observe(name, value, labels), setGauge(name, value, labels) — mapped to pre-registered instruments.
  - Provides shutdown() to flush metrics.
Instrumentation approach
- HTTP: use @opentelemetry/instrumentation-http plus manual wrapper to label proxied requests with site_id, resource_id, backend, etc.
- Proxy logic: instrument where Pangolin forwards requests to backends; record counts, statuses and latencies.
- Auth: instrument login/logout flows, failed attempts, active sessions gauge.
- Tunnel events: instrument connect/disconnect/reconnect and throughput/latency where Pangolin has visibility.
- Background jobs, config reloads, cert expiry checks: instrument events and counters.
Histograms & buckets
- Configure histogram buckets per spec (duration buckets and byte-size buckets).
- Use seconds for durations; bytes for sizes.
Exporter configuration (runtime)
- Environment variables (suggested defaults):
  - PANGOLIN_METRICS_PROMETHEUS_ENABLED=true
  - PANGOLIN_METRICS_OTLP_ENABLED=false
  - OTEL_EXPORTER_OTLP_ENDPOINT (when OTLP enabled)
  - OTEL_EXPORTER_OTLP_PROTOCOL (http/protobuf or grpc)
  - OTEL_SERVICE_NAME=pangolin
  - OTEL_RESOURCE_ATTRIBUTES (e.g., service.instance.id=...)
  - OTEL_METRIC_EXPORT_INTERVAL (ms)
Local testing
- Provide docker-compose.metrics.yml with:
  - Pangolin
  - OpenTelemetry Collector (example config)
  - Prometheus (scraping /metrics or Collector)
  - Grafana (optional)
- Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows.
Collector example
- Include example collector.yaml demonstrating:
  - OTLP receiver
  - Transform processor to promote resource attributes (e.g., site_id, resource_id)
  - Prometheus remote_write exporter (generic endpoint)
  - Notes on name normalization and out‑of‑order ingestion if sending OTLP to Prometheus
Documentation
- observability.md:
  - Metric catalog (name, type, labels, units, description)
  - How to enable/disable Prometheus exporter and OTLP exporter via env vars
  - How to run Docker Compose test stack
  - How to add a new metric (naming, labels, buckets)
Testing & validation
- Manual test: start compose, generate traffic, curl /metrics, verify metrics names, units, labels and histogram buckets.
- Include sample /metrics output in the PR.
- ...

References & Best Practices

Traefik – Metrics (observability) – Traefik metrics configuration and exporter options (Prometheus, OpenTelemetry).
OpenTelemetry – JavaScript/TypeScript: Getting Started / Instrumentation Guide – How to instrument JavaScript/TypeScript/Node.js applications with OpenTelemetry.
OpenTelemetry – JavaScript/TypeScript: Exporters – Exporter options for Node.js and browser (OTLP, Prometheus, etc.).

Practical walkthroughs & blog posts

OpenTelemetry Blog – Prometheus + OpenTelemetry (2024) – Practical notes on combining Prometheus and OpenTelemetry.
Grafana Blog – A Practical Guide to Data Collection with OpenTelemetry and Prometheus (Jul 2023) – Hands‑on examples and best practices for OTEL + Prometheus.
BetterStack – OpenTelemetry for Node.js – Practical guide for instrumenting Node.js apps with OpenTelemetry.
BetterStack – OpenTelemetry Metrics vs Prometheus Metrics – Comparison and guidance on when to use OTEL vs Prometheus metrics.

*Originally created by @marcschaeferger on 9/7/2025* Add OpenTelemetry-based observability to Pangolin --- ## Summary / Goal Instrument Pangolin with **OpenTelemetry Metrics (OTel)** following CNCF / industry best practices so that: - Metrics are emitted using the OpenTelemetry JS SDK (vendor-neutral API). - Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector. - Semantic conventions, SI units (`_seconds`, `_bytes`), and low‑cardinality labels are enforced. - Focus is metrics first; design should allow adding traces and logs later. - Provide an out‑of‑the‑box `/metrics` endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines. --- ## Why OpenTelemetry (OTel) - OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs). - Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors). - Use the OTel Collector for attribute enrichment, normalization, batching, and remote_write. --- ## Requirements & Constraints - Use the **OpenTelemetry JavaScript/TypeScript SDKs** and official instrumentation packages. - Provide a `/metrics` endpoint in Prometheus exposition format via the OTel Prometheus exporter. - All durations in **seconds** and sizes in **bytes**. Metric names should carry units where applicable (`_seconds`, `_bytes`) and counters use `_total`. - Enforce low‑cardinality label design (e.g., `site_id`, `resource_id`). Do not use per‑request unique values as labels. - All exporters configurable at runtime through environment variables or configuration (no code change to switch exporter). - Provide an example **OTel Collector** configuration that demonstrates OTLP ingestion, attribute promotion and `prometheusremotewrite` usage. --- ## Recommended Pangolin Metrics (TypeScript implementation) Use snake_case names and include units/ `_total` suffixes where applicable. | Category | Metric Name | Type | Labels | Unit / Notes | |-------------------|-----------------------------------------------|------------|----------------------------------------------------------------------------------------------|--------------| | **Site / Global** | `pangolin_site_active_sites` | Gauge | `site_id`, `region` | count | | | `pangolin_site_online` | Gauge 0/1 | `site_id`, `transport` | bool | | | `pangolin_site_bandwidth_bytes_total` | Counter | `site_id`, `direction`, `protocol` | bytes | | | `pangolin_site_uptime_seconds_total` | Counter | `site_id` | seconds | | | `pangolin_site_connection_drops_total` | Counter | `site_id` | count | | | `pangolin_site_handshake_latency_seconds` | Histogram | `site_id`, `transport` | seconds | | **Resource / App**| `pangolin_resource_requests_total` | Counter | `site_id`,`resource_id`,`backend`,`method`,`status` | count | | | `pangolin_resource_request_duration_seconds` | Histogram | `site_id`,`resource_id`,`backend`,`method` | seconds | | | `pangolin_resource_active_connections` | Gauge | `site_id`,`resource_id`,`protocol` | count | | | `pangolin_resource_errors_total` | Counter | `site_id`,`resource_id`,`backend`,`error_type` | count | | | `pangolin_resource_bandwidth_bytes_total` | Counter | `site_id`,`resource_id`,`direction` | bytes | | **Tunnel / Transport** | `pangolin_tunnel_up` | Gauge 0/1 | `site_id`,`transport` | bool | | | `pangolin_tunnel_reconnects_total` | Counter | `site_id`,`transport`,`reason` | count | | | `pangolin_tunnel_latency_seconds` | Histogram | `site_id`,`transport` | seconds | | | `pangolin_tunnel_bytes_total` | Counter | `site_id`,`transport`,`direction` | bytes | | | `pangolin_wg_handshake_total` | Counter | `site_id`,`result` | count | | **Backend** | `pangolin_backend_health_status` | Gauge 1/0 | `backend`,`site_id` | bool | | | `pangolin_backend_connection_errors_total` | Counter | `backend`,`site_id`,`error_type` | count | | | `pangolin_backend_response_size_bytes` | Histogram | `backend`,`site_id` | bytes | | **Auth / Identity**| `pangolin_auth_requests_total` | Counter | `site_id`,`auth_method`,`result` | count | | | `pangolin_auth_request_duration_seconds` | Histogram | `auth_method`,`result` | seconds | | | `pangolin_auth_active_users` | Gauge | `site_id`,`auth_method` | count | | | `pangolin_auth_failure_reasons_total` | Counter | `site_id`,`reason`,`auth_method` | count | | **Tokens / Sessions** | `pangolin_token_issued_total` | Counter | `site_id`,`auth_method` | count | | | `pangolin_token_revoked_total` | Counter | `reason` | count | | | `pangolin_token_refresh_total` | Counter | `site_id`,`result` | count | | **UI / API** | `pangolin_ui_requests_total` | Counter | `endpoint`,`method`,`status` | count | | | `pangolin_ui_active_sessions` | Gauge | | count | | **Operational** | `pangolin_config_reloads_total` | Counter | `result` | count | | | `pangolin_restart_count_total` | Counter | | count | | | `pangolin_background_jobs_total` | Counter | `job_type`,`status` | count | | | `pangolin_certificates_expiry_days` | Gauge | `site_id`,`resource_id` | days | _Label guidelines:_ prefer `site_id`/`resource_id`. Avoid per‑request unique labels (user IDs, full URLs). Use enums and stable identifiers. --- ## Implementation Plan 1. Dependencies (example packages) - Add OpenTelemetry JS packages to the Node app (install via npm/yarn): - `@opentelemetry/api` - `@opentelemetry/sdk-metrics` (or current stable metrics SDK) - `@opentelemetry/exporter-prometheus` - `@opentelemetry/exporter-metrics-otlp-http` (or OTLP exporter variant) - `@opentelemetry/instrumentation-http` - Framework instrumentation if used (e.g., `@opentelemetry/instrumentation-express`, Next.js instrumentation patterns) - ... 2. Central metrics module - Create `src/metrics/` (or `server/metrics/`) that: - Initializes OTel MeterProvider. - Registers Prometheus exporter (when enabled) and exposes the exporter handler on `/metrics` (or mounts to existing server route). - Optionally registers OTLP exporter when configured via env vars. - Exposes a singleton `metrics` object with helper functions: - `inc(name, labels)`, `observe(name, value, labels)`, `setGauge(name, value, labels)` — mapped to pre-registered instruments. - Provides `shutdown()` to flush metrics. 3. Instrumentation approach - HTTP: use `@opentelemetry/instrumentation-http` plus manual wrapper to label proxied requests with `site_id`, `resource_id`, `backend`, etc. - Proxy logic: instrument where Pangolin forwards requests to backends; record counts, statuses and latencies. - Auth: instrument login/logout flows, failed attempts, active sessions gauge. - Tunnel events: instrument connect/disconnect/reconnect and throughput/latency where Pangolin has visibility. - Background jobs, config reloads, cert expiry checks: instrument events and counters. 4. Histograms & buckets - Configure histogram buckets per spec (duration buckets and byte-size buckets). - Use seconds for durations; bytes for sizes. 5. Exporter configuration (runtime) - Environment variables (suggested defaults): - `PANGOLIN_METRICS_PROMETHEUS_ENABLED=true` - `PANGOLIN_METRICS_OTLP_ENABLED=false` - `OTEL_EXPORTER_OTLP_ENDPOINT` (when OTLP enabled) - `OTEL_EXPORTER_OTLP_PROTOCOL` (http/protobuf or grpc) - `OTEL_SERVICE_NAME=pangolin` - `OTEL_RESOURCE_ATTRIBUTES` (e.g., `service.instance.id=...`) - `OTEL_METRIC_EXPORT_INTERVAL` (ms) 6. Local testing - Provide `docker-compose.metrics.yml` with: - Pangolin - OpenTelemetry Collector (example config) - Prometheus (scraping `/metrics` or Collector) - Grafana (optional) - Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows. 7. Collector example - Include `example collector.yaml` demonstrating: - OTLP receiver - Transform processor to promote resource attributes (e.g., `site_id`, `resource_id`) - Prometheus remote_write exporter (generic endpoint) - Notes on name normalization and out‑of‑order ingestion if sending OTLP to Prometheus 8. Documentation - `observability.md`: - Metric catalog (name, type, labels, units, description) - How to enable/disable Prometheus exporter and OTLP exporter via env vars - How to run Docker Compose test stack - How to add a new metric (naming, labels, buckets) 9. Testing & validation - Manual test: start compose, generate traffic, curl `/metrics`, verify metrics names, units, labels and histogram buckets. - Include sample `/metrics` output in the PR. - ... --- ## References & Best Practices - [Traefik – Metrics (observability)](https://doc.traefik.io/traefik/reference/install-configuration/observability/metrics/) – Traefik metrics configuration and exporter options (Prometheus, OpenTelemetry). - [OpenTelemetry – JavaScript/TypeScript: Getting Started / Instrumentation Guide](https://opentelemetry.io/docs/languages/js/) – How to instrument JavaScript/TypeScript/Node.js applications with OpenTelemetry. - [OpenTelemetry – JavaScript/TypeScript: Exporters](https://opentelemetry.io/docs/languages/js/exporters/) – Exporter options for Node.js and browser (OTLP, Prometheus, etc.). **Practical walkthroughs & blog posts** - [OpenTelemetry Blog – Prometheus + OpenTelemetry (2024)](https://opentelemetry.io/blog/2024/prom-and-otel/) – Practical notes on combining Prometheus and OpenTelemetry. - [Grafana Blog – A Practical Guide to Data Collection with OpenTelemetry and Prometheus (Jul 2023)](https://grafana.com/blog/2023/07/20/a-practical-guide-to-data-collection-with-opentelemetry-and-prometheus/) – Hands‑on examples and best practices for OTEL + Prometheus. - [BetterStack – OpenTelemetry for Node.js](https://betterstack.com/community/guides/observability/opentelemetry-metrics-nodejs/) – Practical guide for instrumenting Node.js apps with OpenTelemetry. - [BetterStack – OpenTelemetry Metrics vs Prometheus Metrics](https://betterstack.com/community/guides/observability/opentelemetry-metrics-vs-prometheus-metrics/) – Comparison and guidance on when to use OTEL vs Prometheus metrics.

MrUnknownDE closed this issue

2026-04-05 18:02:51 +02:00

Sign in to join this conversation.

Branches Tags

main

dev

dependabot/npm_and_yarn/dev-minor-updates-b4e5d6b9c5

revert-2766-feature/systemd-install-instructions

dependabot/npm_and_yarn/prod-patch-updates-05702d39f2

dependabot/npm_and_yarn/next-16.2.1

dependabot/npm_and_yarn/recharts-3.8.1

alerting-rules

private-site-ha

dependabot/docker/docker/library/node-25-slim

ssh

delete-account

msg-delivery

org-only-idp

cicd

patch

site-targets-auto-login

No Label

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github/pangolin#950