[Feature Request] Implement OpenTelemetry Metrics in Pangolin #950

Closed
opened 2026-04-05 18:02:51 +02:00 by MrUnknownDE · 0 comments
Owner

Originally created by @marcschaeferger on 9/7/2025

Add OpenTelemetry-based observability to Pangolin

Summary / Goal

Instrument Pangolin with OpenTelemetry Metrics (OTel) following CNCF / industry best practices so that:

  • Metrics are emitted using the OpenTelemetry JS SDK (vendor-neutral API).
  • Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
  • Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
  • Focus is metrics first; design should allow adding traces and logs later.
  • Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.

Why OpenTelemetry (OTel)

  • OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs).
  • Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors).
  • Use the OTel Collector for attribute enrichment, normalization, batching, and remote_write.

Requirements & Constraints

  • Use the OpenTelemetry JavaScript/TypeScript SDKs and official instrumentation packages.
  • Provide a /metrics endpoint in Prometheus exposition format via the OTel Prometheus exporter.
  • All durations in seconds and sizes in bytes. Metric names should carry units where applicable (_seconds, _bytes) and counters use _total.
  • Enforce low‑cardinality label design (e.g., site_id, resource_id). Do not use per‑request unique values as labels.
  • All exporters configurable at runtime through environment variables or configuration (no code change to switch exporter).
  • Provide an example OTel Collector configuration that demonstrates OTLP ingestion, attribute promotion and prometheusremotewrite usage.

Use snake_case names and include units/ _total suffixes where applicable.

Category Metric Name Type Labels Unit / Notes
Site / Global pangolin_site_active_sites Gauge site_id, region count
pangolin_site_online Gauge 0/1 site_id, transport bool
pangolin_site_bandwidth_bytes_total Counter site_id, direction, protocol bytes
pangolin_site_uptime_seconds_total Counter site_id seconds
pangolin_site_connection_drops_total Counter site_id count
pangolin_site_handshake_latency_seconds Histogram site_id, transport seconds
Resource / App pangolin_resource_requests_total Counter site_id,resource_id,backend,method,status count
pangolin_resource_request_duration_seconds Histogram site_id,resource_id,backend,method seconds
pangolin_resource_active_connections Gauge site_id,resource_id,protocol count
pangolin_resource_errors_total Counter site_id,resource_id,backend,error_type count
pangolin_resource_bandwidth_bytes_total Counter site_id,resource_id,direction bytes
Tunnel / Transport pangolin_tunnel_up Gauge 0/1 site_id,transport bool
pangolin_tunnel_reconnects_total Counter site_id,transport,reason count
pangolin_tunnel_latency_seconds Histogram site_id,transport seconds
pangolin_tunnel_bytes_total Counter site_id,transport,direction bytes
pangolin_wg_handshake_total Counter site_id,result count
Backend pangolin_backend_health_status Gauge 1/0 backend,site_id bool
pangolin_backend_connection_errors_total Counter backend,site_id,error_type count
pangolin_backend_response_size_bytes Histogram backend,site_id bytes
Auth / Identity pangolin_auth_requests_total Counter site_id,auth_method,result count
pangolin_auth_request_duration_seconds Histogram auth_method,result seconds
pangolin_auth_active_users Gauge site_id,auth_method count
pangolin_auth_failure_reasons_total Counter site_id,reason,auth_method count
Tokens / Sessions pangolin_token_issued_total Counter site_id,auth_method count
pangolin_token_revoked_total Counter reason count
pangolin_token_refresh_total Counter site_id,result count
UI / API pangolin_ui_requests_total Counter endpoint,method,status count
pangolin_ui_active_sessions Gauge count
Operational pangolin_config_reloads_total Counter result count
pangolin_restart_count_total Counter count
pangolin_background_jobs_total Counter job_type,status count
pangolin_certificates_expiry_days Gauge site_id,resource_id days

Label guidelines: prefer site_id/resource_id. Avoid per‑request unique labels (user IDs, full URLs). Use enums and stable identifiers.


Implementation Plan

  1. Dependencies (example packages)

    • Add OpenTelemetry JS packages to the Node app (install via npm/yarn):
      • @opentelemetry/api
      • @opentelemetry/sdk-metrics (or current stable metrics SDK)
      • @opentelemetry/exporter-prometheus
      • @opentelemetry/exporter-metrics-otlp-http (or OTLP exporter variant)
      • @opentelemetry/instrumentation-http
      • Framework instrumentation if used (e.g., @opentelemetry/instrumentation-express, Next.js instrumentation patterns)
      • ...
  2. Central metrics module

    • Create src/metrics/ (or server/metrics/) that:
      • Initializes OTel MeterProvider.
      • Registers Prometheus exporter (when enabled) and exposes the exporter handler on /metrics (or mounts to existing server route).
      • Optionally registers OTLP exporter when configured via env vars.
      • Exposes a singleton metrics object with helper functions:
        • inc(name, labels), observe(name, value, labels), setGauge(name, value, labels) — mapped to pre-registered instruments.
      • Provides shutdown() to flush metrics.
  3. Instrumentation approach

    • HTTP: use @opentelemetry/instrumentation-http plus manual wrapper to label proxied requests with site_id, resource_id, backend, etc.
    • Proxy logic: instrument where Pangolin forwards requests to backends; record counts, statuses and latencies.
    • Auth: instrument login/logout flows, failed attempts, active sessions gauge.
    • Tunnel events: instrument connect/disconnect/reconnect and throughput/latency where Pangolin has visibility.
    • Background jobs, config reloads, cert expiry checks: instrument events and counters.
  4. Histograms & buckets

    • Configure histogram buckets per spec (duration buckets and byte-size buckets).
    • Use seconds for durations; bytes for sizes.
  5. Exporter configuration (runtime)

    • Environment variables (suggested defaults):
      • PANGOLIN_METRICS_PROMETHEUS_ENABLED=true
      • PANGOLIN_METRICS_OTLP_ENABLED=false
      • OTEL_EXPORTER_OTLP_ENDPOINT (when OTLP enabled)
      • OTEL_EXPORTER_OTLP_PROTOCOL (http/protobuf or grpc)
      • OTEL_SERVICE_NAME=pangolin
      • OTEL_RESOURCE_ATTRIBUTES (e.g., service.instance.id=...)
      • OTEL_METRIC_EXPORT_INTERVAL (ms)
  6. Local testing

    • Provide docker-compose.metrics.yml with:
      • Pangolin
      • OpenTelemetry Collector (example config)
      • Prometheus (scraping /metrics or Collector)
      • Grafana (optional)
    • Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows.
  7. Collector example

    • Include example collector.yaml demonstrating:
      • OTLP receiver
      • Transform processor to promote resource attributes (e.g., site_id, resource_id)
      • Prometheus remote_write exporter (generic endpoint)
      • Notes on name normalization and out‑of‑order ingestion if sending OTLP to Prometheus
  8. Documentation

    • observability.md:
      • Metric catalog (name, type, labels, units, description)
      • How to enable/disable Prometheus exporter and OTLP exporter via env vars
      • How to run Docker Compose test stack
      • How to add a new metric (naming, labels, buckets)
  9. Testing & validation

    • Manual test: start compose, generate traffic, curl /metrics, verify metrics names, units, labels and histogram buckets.
    • Include sample /metrics output in the PR.
    • ...

References & Best Practices

Practical walkthroughs & blog posts

*Originally created by @marcschaeferger on 9/7/2025* Add OpenTelemetry-based observability to Pangolin --- ## Summary / Goal Instrument Pangolin with **OpenTelemetry Metrics (OTel)** following CNCF / industry best practices so that: - Metrics are emitted using the OpenTelemetry JS SDK (vendor-neutral API). - Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector. - Semantic conventions, SI units (`_seconds`, `_bytes`), and low‑cardinality labels are enforced. - Focus is metrics first; design should allow adding traces and logs later. - Provide an out‑of‑the‑box `/metrics` endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines. --- ## Why OpenTelemetry (OTel) - OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs). - Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors). - Use the OTel Collector for attribute enrichment, normalization, batching, and remote_write. --- ## Requirements & Constraints - Use the **OpenTelemetry JavaScript/TypeScript SDKs** and official instrumentation packages. - Provide a `/metrics` endpoint in Prometheus exposition format via the OTel Prometheus exporter. - All durations in **seconds** and sizes in **bytes**. Metric names should carry units where applicable (`_seconds`, `_bytes`) and counters use `_total`. - Enforce low‑cardinality label design (e.g., `site_id`, `resource_id`). Do not use per‑request unique values as labels. - All exporters configurable at runtime through environment variables or configuration (no code change to switch exporter). - Provide an example **OTel Collector** configuration that demonstrates OTLP ingestion, attribute promotion and `prometheusremotewrite` usage. --- ## Recommended Pangolin Metrics (TypeScript implementation) Use snake_case names and include units/ `_total` suffixes where applicable. | Category | Metric Name | Type | Labels | Unit / Notes | |-------------------|-----------------------------------------------|------------|----------------------------------------------------------------------------------------------|--------------| | **Site / Global** | `pangolin_site_active_sites` | Gauge | `site_id`, `region` | count | | | `pangolin_site_online` | Gauge 0/1 | `site_id`, `transport` | bool | | | `pangolin_site_bandwidth_bytes_total` | Counter | `site_id`, `direction`, `protocol` | bytes | | | `pangolin_site_uptime_seconds_total` | Counter | `site_id` | seconds | | | `pangolin_site_connection_drops_total` | Counter | `site_id` | count | | | `pangolin_site_handshake_latency_seconds` | Histogram | `site_id`, `transport` | seconds | | **Resource / App**| `pangolin_resource_requests_total` | Counter | `site_id`,`resource_id`,`backend`,`method`,`status` | count | | | `pangolin_resource_request_duration_seconds` | Histogram | `site_id`,`resource_id`,`backend`,`method` | seconds | | | `pangolin_resource_active_connections` | Gauge | `site_id`,`resource_id`,`protocol` | count | | | `pangolin_resource_errors_total` | Counter | `site_id`,`resource_id`,`backend`,`error_type` | count | | | `pangolin_resource_bandwidth_bytes_total` | Counter | `site_id`,`resource_id`,`direction` | bytes | | **Tunnel / Transport** | `pangolin_tunnel_up` | Gauge 0/1 | `site_id`,`transport` | bool | | | `pangolin_tunnel_reconnects_total` | Counter | `site_id`,`transport`,`reason` | count | | | `pangolin_tunnel_latency_seconds` | Histogram | `site_id`,`transport` | seconds | | | `pangolin_tunnel_bytes_total` | Counter | `site_id`,`transport`,`direction` | bytes | | | `pangolin_wg_handshake_total` | Counter | `site_id`,`result` | count | | **Backend** | `pangolin_backend_health_status` | Gauge 1/0 | `backend`,`site_id` | bool | | | `pangolin_backend_connection_errors_total` | Counter | `backend`,`site_id`,`error_type` | count | | | `pangolin_backend_response_size_bytes` | Histogram | `backend`,`site_id` | bytes | | **Auth / Identity**| `pangolin_auth_requests_total` | Counter | `site_id`,`auth_method`,`result` | count | | | `pangolin_auth_request_duration_seconds` | Histogram | `auth_method`,`result` | seconds | | | `pangolin_auth_active_users` | Gauge | `site_id`,`auth_method` | count | | | `pangolin_auth_failure_reasons_total` | Counter | `site_id`,`reason`,`auth_method` | count | | **Tokens / Sessions** | `pangolin_token_issued_total` | Counter | `site_id`,`auth_method` | count | | | `pangolin_token_revoked_total` | Counter | `reason` | count | | | `pangolin_token_refresh_total` | Counter | `site_id`,`result` | count | | **UI / API** | `pangolin_ui_requests_total` | Counter | `endpoint`,`method`,`status` | count | | | `pangolin_ui_active_sessions` | Gauge | | count | | **Operational** | `pangolin_config_reloads_total` | Counter | `result` | count | | | `pangolin_restart_count_total` | Counter | | count | | | `pangolin_background_jobs_total` | Counter | `job_type`,`status` | count | | | `pangolin_certificates_expiry_days` | Gauge | `site_id`,`resource_id` | days | _Label guidelines:_ prefer `site_id`/`resource_id`. Avoid per‑request unique labels (user IDs, full URLs). Use enums and stable identifiers. --- ## Implementation Plan 1. Dependencies (example packages) - Add OpenTelemetry JS packages to the Node app (install via npm/yarn): - `@opentelemetry/api` - `@opentelemetry/sdk-metrics` (or current stable metrics SDK) - `@opentelemetry/exporter-prometheus` - `@opentelemetry/exporter-metrics-otlp-http` (or OTLP exporter variant) - `@opentelemetry/instrumentation-http` - Framework instrumentation if used (e.g., `@opentelemetry/instrumentation-express`, Next.js instrumentation patterns) - ... 2. Central metrics module - Create `src/metrics/` (or `server/metrics/`) that: - Initializes OTel MeterProvider. - Registers Prometheus exporter (when enabled) and exposes the exporter handler on `/metrics` (or mounts to existing server route). - Optionally registers OTLP exporter when configured via env vars. - Exposes a singleton `metrics` object with helper functions: - `inc(name, labels)`, `observe(name, value, labels)`, `setGauge(name, value, labels)` — mapped to pre-registered instruments. - Provides `shutdown()` to flush metrics. 3. Instrumentation approach - HTTP: use `@opentelemetry/instrumentation-http` plus manual wrapper to label proxied requests with `site_id`, `resource_id`, `backend`, etc. - Proxy logic: instrument where Pangolin forwards requests to backends; record counts, statuses and latencies. - Auth: instrument login/logout flows, failed attempts, active sessions gauge. - Tunnel events: instrument connect/disconnect/reconnect and throughput/latency where Pangolin has visibility. - Background jobs, config reloads, cert expiry checks: instrument events and counters. 4. Histograms & buckets - Configure histogram buckets per spec (duration buckets and byte-size buckets). - Use seconds for durations; bytes for sizes. 5. Exporter configuration (runtime) - Environment variables (suggested defaults): - `PANGOLIN_METRICS_PROMETHEUS_ENABLED=true` - `PANGOLIN_METRICS_OTLP_ENABLED=false` - `OTEL_EXPORTER_OTLP_ENDPOINT` (when OTLP enabled) - `OTEL_EXPORTER_OTLP_PROTOCOL` (http/protobuf or grpc) - `OTEL_SERVICE_NAME=pangolin` - `OTEL_RESOURCE_ATTRIBUTES` (e.g., `service.instance.id=...`) - `OTEL_METRIC_EXPORT_INTERVAL` (ms) 6. Local testing - Provide `docker-compose.metrics.yml` with: - Pangolin - OpenTelemetry Collector (example config) - Prometheus (scraping `/metrics` or Collector) - Grafana (optional) - Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows. 7. Collector example - Include `example collector.yaml` demonstrating: - OTLP receiver - Transform processor to promote resource attributes (e.g., `site_id`, `resource_id`) - Prometheus remote_write exporter (generic endpoint) - Notes on name normalization and out‑of‑order ingestion if sending OTLP to Prometheus 8. Documentation - `observability.md`: - Metric catalog (name, type, labels, units, description) - How to enable/disable Prometheus exporter and OTLP exporter via env vars - How to run Docker Compose test stack - How to add a new metric (naming, labels, buckets) 9. Testing & validation - Manual test: start compose, generate traffic, curl `/metrics`, verify metrics names, units, labels and histogram buckets. - Include sample `/metrics` output in the PR. - ... --- ## References & Best Practices - [Traefik – Metrics (observability)](https://doc.traefik.io/traefik/reference/install-configuration/observability/metrics/) – Traefik metrics configuration and exporter options (Prometheus, OpenTelemetry). - [OpenTelemetry – JavaScript/TypeScript: Getting Started / Instrumentation Guide](https://opentelemetry.io/docs/languages/js/) – How to instrument JavaScript/TypeScript/Node.js applications with OpenTelemetry. - [OpenTelemetry – JavaScript/TypeScript: Exporters](https://opentelemetry.io/docs/languages/js/exporters/) – Exporter options for Node.js and browser (OTLP, Prometheus, etc.). **Practical walkthroughs & blog posts** - [OpenTelemetry Blog – Prometheus + OpenTelemetry (2024)](https://opentelemetry.io/blog/2024/prom-and-otel/) – Practical notes on combining Prometheus and OpenTelemetry. - [Grafana Blog – A Practical Guide to Data Collection with OpenTelemetry and Prometheus (Jul 2023)](https://grafana.com/blog/2023/07/20/a-practical-guide-to-data-collection-with-opentelemetry-and-prometheus/) – Hands‑on examples and best practices for OTEL + Prometheus. - [BetterStack – OpenTelemetry for Node.js](https://betterstack.com/community/guides/observability/opentelemetry-metrics-nodejs/) – Practical guide for instrumenting Node.js apps with OpenTelemetry. - [BetterStack – OpenTelemetry Metrics vs Prometheus Metrics](https://betterstack.com/community/guides/observability/opentelemetry-metrics-vs-prometheus-metrics/) – Comparison and guidance on when to use OTEL vs Prometheus metrics.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github/pangolin#950