20 KiB
Plan: Bring OneUptime Metrics to Industry Parity and Beyond
Context
OneUptime's metrics implementation provides OTLP ingestion (HTTP and gRPC), ClickHouse storage with support for Gauge, Sum, Histogram, and ExponentialHistogram metric types, basic aggregations (Avg, Sum, Min, Max, Count), single-attribute GROUP BY, formula support for calculated metrics, threshold-based metric monitors, and a metric explorer with line/bar charts. Auto-discovery creates MetricType metadata (name, description, unit) on first ingest. Per-service data retention with TTL (default 15 days).
This plan identifies the remaining gaps vs DataDog and New Relic, and proposes a phased implementation to close them and build a best-in-class metrics product.
Completed
The following features have been implemented:
- OTLP Ingestion - HTTP and gRPC metric ingestion with async queue-based batch processing
- Metric Types - Gauge, Sum, Histogram, ExponentialHistogram support
- ClickHouse Storage - MergeTree with
sipHash64(projectId) % 16partitioning, per-service TTL - Aggregations - Avg, Sum, Min, Max, Count
- Single-Attribute GROUP BY - Group by one attribute at a time
- Formulas - Calculated metrics using aliases (e.g.,
a / b * 100) - Metric Explorer - Time range selection, multiple queries with aliases, URL state persistence
- Threshold-Based Monitors - Static threshold alerting on aggregated metric values
- MetricType Auto-Discovery - Name, description, unit captured on first ingest
- Attribute Storage - Full JSON with extracted
attributeKeysarray for fast enumeration - BloomFilter index on
name, Set index onserviceType - Phase 2.3 - Render Metric Units on Charts (MetricType.unit passed through to chart Y-axis and tooltip formatting)
- S.1 - Fix Sort Key Order (sort key changed to
projectId, name, serviceId, timefor optimal metric name filtering) - S.2 - Upgrade time to DateTime64 (time column uses
TableColumnType.DateTime64for sub-second precision) - S.3 - Add Skip Index on metricPointType (Set skip index
idx_metric_point_typeadded to Metric model)
Gap Analysis Summary
| Feature | OneUptime | DataDog | New Relic | Priority |
|---|---|---|---|---|
| Percentile aggregations (p50/p75/p90/p95/p99) | None | DDSketch distributions | NRQL percentile() | P0 |
| Rate/derivative calculations | None | Native Rate type + .as_rate() | rate() NRQL function | P0 |
| Multi-attribute GROUP BY | Single attribute only | Multiple tags | FACET on multiple attrs | P0 |
| Rollup/downsampling for long-range queries | None (raw data, 15-day TTL) | Automatic tiered rollups | 30-day raw + 13-month rollups | P0 |
| Anomaly detection | Static thresholds only | Watchdog + anomaly monitors | Anomaly detection + sigma bands | P1 |
| SLO/SLI tracking | None | Metric-based + Time Slice SLOs | One-click setup + error budgets | P1 |
| Heatmap visualization | None | Purpose-built for distributions | Built-in chart type | P1 |
| Time-over-time comparison | None | Yes | COMPARE WITH in NRQL | P1 |
| Summary metric type | Not supported | N/A (uses Distribution) | Yes | P1 |
| Query language | Form-based UI only | Graphing editor + NLQ | Full NRQL language | P2 |
| Predictive alerting | None | Watchdog forecasting | GA predictive alerting | P2 |
| Metric correlations | None | Auto-surfaces related metrics | Applied Intelligence correlation | P2 |
| Golden Signals dashboards | None | Available via APM | Pre-built with default alerts | P2 |
| Cardinality management | None | Metrics Without Limits + Explorer | Budget system + pruning rules | P2 |
| More chart types | Line and bar only | 12+ types | 10+ types with conditional coloring | P2 |
| Dashboard templates | None | Pre-built integration dashboards | Pre-built entity dashboards | P2 |
| Natural language querying | None | NLQ translates English to queries | None | P3 |
| Metric cost/volume management | None | Cost attribution dashboards | Volume dashboards | P3 |
Phase 1: Foundation (P0) — Close Critical Gaps
These are table-stakes features without which the metrics product is fundamentally limited.
1.1 Percentile Aggregations (p50, p75, p90, p95, p99)
Current: Only Avg, Sum, Min, Max, Count aggregations. Target: Support percentile aggregations on all metric data, especially histograms and distributions.
Implementation:
- Add
P50,P75,P90,P95,P99to theAggregationTypeenum - For raw metric values: use ClickHouse
quantile(0.50)(value),quantile(0.95)(value), etc. - For histogram data (with
bucketCountsandexplicitBounds): implement approximate percentile calculation from bucket data using linear interpolation between bucket boundaries - Update the metric query builder to include percentile options in the aggregation dropdown
- Update chart rendering to display percentile series
Files to modify:
Common/Types/BaseDatabase/AggregationType.ts(add P50, P75, P90, P95, P99)Common/Server/Services/MetricService.ts(generate quantile SQL)App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx(add to dropdown)
1.2 Rate/Derivative Calculations
Current: No rate or delta computation. Raw cumulative counters are meaningless without rate calculation. Target: Compute per-second rates and deltas from counter/sum metrics.
Implementation:
- Add
RateandDeltaas aggregation options - For cumulative sums: compute
(value_t - value_t-1) / (time_t - time_t-1)using ClickHouserunningDifference() - Handle counter resets (when value decreases, treat as reset and skip that interval)
- For delta temporality sums: rate is simply
value / interval_seconds - Display rate with appropriate units (e.g., "req/s", "bytes/s")
Files to modify:
Common/Types/BaseDatabase/AggregationType.ts(add Rate, Delta)Common/Server/Services/MetricService.ts(generate rate SQL with runningDifference)Common/Types/Metrics/MetricsQuery.ts(support rate in query config)
1.3 Multi-Attribute GROUP BY
Current: Single groupByAttribute: string field.
Target: Group by multiple attributes simultaneously (e.g., by region AND status_code).
Implementation:
- Change
groupByAttributefromstringtostring[]inMetricsQuery - Update ClickHouse query generation to GROUP BY multiple extracted JSON attributes
- Update chart rendering to handle multi-dimensional grouping (composite legend labels)
- Update the UI to allow selecting multiple group-by attributes
Files to modify:
Common/Types/Metrics/MetricsQuery.ts(change type)Common/Server/Services/MetricService.ts(update query generation)App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx(multi-select UI)App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx(composite legends)
1.4 Rollups / Downsampling
Current: Raw data only with 15-day default TTL. No rollups means long-range queries are slow and historical analysis is limited. Target: Pre-aggregated rollups at multiple resolutions with extended retention.
Implementation:
- Create ClickHouse materialized views for automatic rollup:
Raw Data (1s resolution) -> 15-day retention |-> Materialized View -> 1-min rollups -> 90-day retention |-> Materialized View -> 1-hour rollups -> 13-month retention |-> Materialized View -> 1-day rollups -> 3-year retention - Each rollup table stores: min, max, sum, count, avg, and quantile sketches per metric name + attributes
- Route queries based on time range:
- < 6 hours: raw data
- 6 hours - 7 days: 1-min rollups
- 7 days - 30 days: 1-hour rollups
- 30+ days: 1-day rollups
- Automatic query routing in the metric service layer
Files to modify:
Common/Models/AnalyticsModels/MetricRollup1Min.ts(new)Common/Models/AnalyticsModels/MetricRollup1Hour.ts(new)Common/Models/AnalyticsModels/MetricRollup1Day.ts(new)Common/Server/Services/MetricService.ts(query routing by time range)Worker/DataMigrations/(new migration to create materialized views)
Phase 2: Visualization & UX (P1) — Match Industry Standard
2.1 More Chart Types
Current: Line and bar charts only. Target: Add Heatmap, Stacked Area, Pie/Donut, Scatter, Single-Value Billboard, and Gauge.
Implementation:
- Heatmap: Essential for histogram/distribution data. Use a heatmap library that renders time on X-axis, bucket values on Y-axis, and color intensity for count
- Stacked Area: Extension of existing line chart with fill and stacking
- Pie/Donut: For showing proportional breakdowns (e.g., request distribution by service)
- Scatter: For correlation analysis between two metrics
- Billboard: Large single-value display with configurable thresholds for color coding (green/yellow/red)
- Gauge: Circular gauge showing a value against a min/max range
Files to modify:
Common/Types/Dashboard/Chart/ChartType.ts(add new chart types)App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx(render new chart types)App/FeatureSet/Dashboard/src/Components/Metrics/MetricCharts.tsx(chart type selector)
2.2 Time-Over-Time Comparison
Current: No comparison capability. Target: Overlay current metric data with data from a previous period (1h ago, 1d ago, 1w ago).
Implementation:
- Add a "Compare with" dropdown in the metric explorer toolbar (options: 1 hour ago, 1 day ago, 1 week ago, custom)
- Execute the same query twice with shifted time ranges
- Render the comparison series as a dashed/translucent overlay on the same chart
- Show the delta (absolute and percentage) in tooltips
Files to modify:
App/FeatureSet/Dashboard/src/Components/Metrics/MetricExplorer.tsx(add compare dropdown)Common/Types/Metrics/MetricsQuery.ts(add compareWith field)App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx(render comparison series)
2.4 Dashboard Templates
Current: No templates. Target: Pre-built dashboards for common scenarios that auto-populate based on detected metrics.
Implementation:
- Create MetricsViewConfig templates for:
- HTTP Service Health (request rate, error rate, latency percentiles)
- Database Performance (query duration, connection pool, error rate)
- Kubernetes Metrics (CPU, memory, pod restarts, network)
- Host Metrics (CPU, memory, disk, network)
- Runtime Metrics (GC, heap, threads - per language)
- Auto-detect which templates are relevant based on ingested metric names
- "One-click apply" creates a dashboard from the template
Files to modify:
Common/Types/Metrics/DashboardTemplates/(new directory with template definitions)App/FeatureSet/Dashboard/src/Pages/Dashboards/Templates.tsx(new - template gallery)
2.5 Summary Metric Type Support
Current: Summary type not supported. Target: Ingest and store Summary metrics from OTLP.
Implementation:
- Add
Summaryto the metric point type enum - Store quantile values from summary data points
- Display summary quantiles in the metric explorer
Files to modify:
Telemetry/Services/OtelMetricsIngestService.ts(handle summary type)Common/Models/AnalyticsModels/Metric.ts(add summary-specific columns if needed)
Phase 3: Alerting & Intelligence (P1-P2) — Smart Monitoring
3.1 Anomaly Detection
Current: Static threshold alerting only. Target: Detect metrics deviating from expected patterns using statistical methods.
Implementation:
- Start with rolling mean + N standard deviations (configurable sensitivity: low/medium/high)
- Account for daily/weekly seasonality by comparing to same-time-last-week baselines
- Store baselines in ClickHouse (periodic computation job, hourly)
- Baseline table: metric name, service, hour_of_week, mean, stddev
- On each evaluation: compare current value to baseline, alert if deviation > configured sigma
- Surface anomalies as visual highlights on metric charts (shaded band showing expected range)
Files to modify:
Common/Models/AnalyticsModels/MetricBaseline.ts(new - baseline storage)Worker/Jobs/Metrics/ComputeMetricBaselines.ts(new - periodic baseline computation)Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts(add anomaly detection)App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx(render anomaly bands)
3.2 SLO/SLI Tracking
Current: No SLO support. Target: Define Service Level Objectives based on metric queries, track attainment over rolling windows, calculate error budgets.
Implementation:
- Create
SLOPostgreSQL model:- Name, description, target percentage (e.g., 99.9%)
- SLI definition: good events query / total events query (both metric queries)
- Time window: 7-day, 28-day, or 30-day rolling
- Alert thresholds: error budget remaining %, burn rate
- SLO dashboard page showing:
- Current attainment vs target (e.g., 99.85% / 99.9%)
- Error budget remaining (absolute and percentage)
- Burn rate chart (current burn rate vs sustainable burn rate)
- SLI time series chart
- Alert when error budget drops below threshold or burn rate exceeds sustainable rate
- Integrate with existing monitor/incident system
Files to modify:
Common/Models/DatabaseModels/SLO.ts(new)Common/Server/Services/SLOService.ts(new - SLI computation, budget calculation)Worker/Jobs/SLO/EvaluateSLOs.ts(new - periodic SLO evaluation)App/FeatureSet/Dashboard/src/Pages/SLO/(new - SLO list, detail, creation pages)
3.3 Metric Correlations
Current: No correlation capability. Target: When an anomaly is detected, automatically identify other metrics that changed around the same time.
Implementation:
- When an anomaly is detected on a metric, query all metrics for the same service/project in the surrounding time window (e.g., +/- 30 minutes)
- Compute Pearson correlation coefficient between the anomalous metric and each candidate
- Rank by absolute correlation value
- Surface top 5-10 correlated metrics in the alert/incident view
- Show correlation chart: anomalous metric overlaid with top correlated metrics
Files to modify:
Common/Server/Services/MetricCorrelationService.ts(new)App/FeatureSet/Dashboard/src/Components/Metrics/CorrelatedMetrics.tsx(new - correlation view)
Phase 4: Scale & Power Features (P2-P3) — Differentiation
4.1 Cardinality Management
Current: No cardinality visibility or controls. Target: Track unique series count, alert on spikes, allow attribute allowlist/blocklist.
Implementation:
- Track unique series count per metric name (via periodic ClickHouse
uniq()queries) - Store in a dedicated cardinality tracking table
- Dashboard showing: top metrics by cardinality, cardinality trend over time, per-attribute breakdown
- Allow configuring attribute allowlists/blocklists per metric (applied at ingest time)
- Alert when cardinality exceeds configured budget
Files to modify:
Worker/Jobs/Metrics/TrackMetricCardinality.ts(new - periodic cardinality computation)Common/Models/DatabaseModels/MetricCardinalityConfig.ts(new - allowlist/blocklist)Telemetry/Services/OtelMetricsIngestService.ts(apply attribute filtering)App/FeatureSet/Dashboard/src/Pages/Settings/MetricCardinality.tsx(new - cardinality dashboard)
4.2 Query Language
Current: Form-based UI only. Target: Text-based metrics query language inspired by PromQL/NRQL for advanced users.
Implementation:
- Define a grammar supporting:
metric_name{attribute="value", attribute2=~"regex"} | aggregation(duration) by (attribute1, attribute2) - Build a parser that translates to the existing ClickHouse query builder
- Offer both UI builder and text modes (toggle like New Relic's basic/advanced)
- Syntax highlighting and autocomplete in the text editor (metric names, attribute keys, functions)
- Functions:
rate(),delta(),avg(),sum(),min(),max(),p50(),p95(),p99(),count(),topk(),bottomk()
Files to modify:
Common/Utils/Metrics/MetricsQueryLanguage.ts(new - parser and translator)App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryEditor.tsx(new - text editor with autocomplete)
4.3 Golden Signals Dashboards
Current: No auto-generated dashboards. Target: Auto-generated dashboards showing Latency, Traffic, Errors, Saturation for each service.
Implementation:
- Detect common OpenTelemetry metric names per service:
- Latency:
http.server.duration,http.server.request.duration - Traffic:
http.server.request.count,http.server.active_requests - Errors:
http.server.request.countwhere status_code >= 500 - Saturation:
process.runtime.*.memory,system.cpu.utilization
- Latency:
- Auto-create a Golden Signals dashboard for each service with detected metrics
- Include default alert thresholds
Files to modify:
Worker/Jobs/Metrics/GenerateGoldenSignalsDashboards.ts(new)Common/Utils/Metrics/GoldenSignalsDetector.ts(new - metric name pattern matching)
4.4 Predictive Alerting
Current: No forecasting capability. Target: Forecast metric values and alert before thresholds are breached.
Implementation:
- Use linear regression or Holt-Winters on recent data to project forward
- Alert if projected value crosses threshold within configurable forecast window (e.g., "disk full in 4 hours")
- Particularly valuable for capacity planning metrics (disk, memory, connection pools)
- Show forecast as a dashed line extension on metric charts
Files to modify:
Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts(add predictive evaluation)Common/Utils/Metrics/Forecasting.ts(new - regression/Holt-Winters)App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx(render forecast line)
ClickHouse Storage Improvements
S.4 Evaluate Map Type for Attributes (MEDIUM)
Current: Attributes stored as JSON.
Target: Evaluate Map(LowCardinality(String), String) for faster attribute-based filtering.
S.5 Upgrade count/bucketCounts to Int64 (LOW)
Current: Int32 for count and Array(Int32) for bucketCounts.
Target: Int64 / Array(Int64) to prevent overflow in high-throughput systems.
Quick Wins (Can Ship This Week)
- Add p50/p95/p99 to aggregation dropdown - ClickHouse
quantile()is straightforward to add - Extend default retention - 15 days is too short; increase default to 30 days
- Multi-attribute GROUP BY - Change
groupByAttribute: stringtogroupByAttribute: string[] - Add stacked area chart type - Simple extension of existing line chart
Recommended Implementation Order
- Quick Wins - Ship p50/p95/p99, multi-attribute GROUP BY, stacked area
- Phase 1.1 - Percentile aggregations (full implementation beyond quick win)
- Phase 1.2 - Rate/derivative calculations
- Phase 1.4 - Rollups/downsampling (enables long-range queries)
- Phase 2.1 - More chart types (heatmap, pie, gauge, billboard)
- Phase 2.2 - Time-over-time comparison
- Phase 1.3 - Multi-attribute GROUP BY (full implementation)
- Phase 3.1 - Anomaly detection
- Phase 3.2 - SLO/SLI tracking
- Phase 2.4 - Dashboard templates
- Phase 4.1 - Cardinality management
- Phase 4.2 - Query language
- Phase 4.3 - Golden Signals dashboards
- Phase 4.4 - Predictive alerting
- Phase 3.3 - Metric correlations
Verification
For each feature:
- Unit tests for new aggregation types, rate calculations, unit formatting, query language parser
- Integration tests for new API endpoints (percentiles, rollup queries, SLO evaluation)
- Manual verification via the dev server at
https://oneuptimedev.genosyn.com/dashboard/{projectId}/metrics - Check ClickHouse query performance with
EXPLAINfor new query patterns - Verify rollup accuracy by comparing rollup results to raw data results for overlapping time ranges
- Load test cardinality tracking and anomaly detection jobs to ensure they don't impact ingestion