mirror of https://github.com/OneUptime/oneuptime.git synced 2026-04-06 00:32:12 +02:00

Files

Nawaz Dhandala a02018aeb2 refactor: adjust formatting in SelectFieldGenerator test and update roadmap documents for clarity and new features

2026-03-16 21:23:59 +00:00

22 KiB

Raw Blame History

Plan: Bring OneUptime Traces to Industry Parity and Beyond

Context

OneUptime's trace implementation provides OTLP-native ingestion (HTTP and gRPC), ClickHouse storage with a full OpenTelemetry span model (events, links, status, attributes, resources, scope), a Gantt/waterfall visualization, trace-to-log and trace-to-exception correlation, a basic service dependency graph, queue-based async ingestion, and per-service data retention with TTL. ClickHouse schema has been optimized with BloomFilter indexes on traceId/spanId/parentSpanId, Set indexes on statusCode/kind/hasException, TokenBF on name, and ZSTD compression on key columns.

This plan identifies the remaining gaps vs DataDog, NewRelic, Honeycomb, and Grafana Tempo, and proposes a phased implementation to close them and surpass competition.

Completed

The following features have been implemented:

OTLP Ingestion - HTTP and gRPC trace ingestion with async queue-based processing
ClickHouse Storage - MergeTree with sipHash64(projectId) % 16 partitioning, per-service TTL
Gantt/Waterfall View - Hierarchical span visualization with color-coded services, time-unit auto-scaling, error indicators
Trace-to-Log Correlation - Log model has traceId/spanId columns; SpanViewer shows associated logs
Trace-to-Exception Correlation - ExceptionInstance model links to traceId/spanId with stack trace parsing and fingerprinting
Span Detail Panel - Side-over with tabs for Basic Info, Logs, Attributes, Events, Exceptions
BloomFilter indexes on traceId, spanId, parentSpanId
Set indexes on statusCode, kind, hasException
TokenBF index on name
ZSTD compression on time/ID/attribute columns
hasException boolean column for fast error span filtering
links default value corrected to []
Basic Trace-Based Alerting - MonitorType.Traces with span count threshold alerting, span name/status/service/attribute filtering, time window (5s-24h), worker job running every minute, frontend form with preview
S.1 - Migrate attributes to Map(String, String) (TableColumnType.MapStringString in Span model with attributeKeys array for fast enumeration)
S.2 - Aggregation Projections (proj_agg_by_service for service-level COUNT/AVG/P99 aggregation, proj_trace_by_id for trace-by-ID queries)

Gap Analysis Summary

Feature	OneUptime	DataDog	NewRelic	Tempo/Honeycomb	Priority
Trace analytics / aggregation engine	None	Trace Explorer with COUNT/percentiles	NRQL on span data	TraceQL rate/count/quantile	P0
RED metrics from traces	None	Auto-computed on 100% traffic	Derived golden signals	Metrics-generator to Prometheus	P0
Trace-based alerting	Partial — span count only, no latency/error rate/Apdex	APM Monitors (p50-p99, error rate, Apdex)	NRQL alert conditions	Via Grafana alerting / Triggers	P0
Sampling controls	None (100% ingestion)	Head-based adaptive + retention filters	Infinite Tracing (tail-based)	Refinery (rules/dynamic/tail)	P0
Flame graph view	None	Yes (default view)	No	No	P1
Latency breakdown / critical path	None	Per-hop latency, bottleneck detection	No	BubbleUp (Honeycomb)	P1
In-trace search	None	Yes	No	No	P1
Per-trace service map	None	Yes (Map view)	No	No	P1
Trace-to-metric exemplars	None	Pivot from metric graph to traces	Metric-to-trace linking	Prometheus exemplars	P1
Custom metrics from spans	None	Generate count/distribution/gauge from tags	Via NRQL	SLOs from span data	P2
Structural trace queries	None	Trace Queries (multi-span relationships)	Via NRQL	TraceQL spanset pipelines	P2
Trace comparison / diffing	None	Partial	Side-by-side comparison	compare() in TraceQL	P2
AI/ML on traces	None	Watchdog (auto anomaly + RCA)	NRAI	BubbleUp (pattern detection)	P3
RUM correlation	None	Frontend-to-backend trace linking	Yes	Faro / frontend observability	P3
Continuous profiling	None	Code Hotspots (span-to-profile)	Partial	Pyroscope	P3

Phase 1: Analytics & Alerting Foundation (P0) — Highest Impact

Without these, users cannot answer basic questions like "is my service healthy?" from trace data.

1.1 Trace Analytics / Aggregation Engine

Current: Can list/filter individual spans and view individual traces. No way to aggregate or compute statistics. Target: Full trace analytics supporting COUNT, AVG, SUM, MIN, MAX, P50/P75/P90/P95/P99 aggregations with GROUP BY on any span attribute and time-series bucketing.

Implementation:

Build a trace analytics API endpoint that translates query configs into ClickHouse aggregation queries
Use ClickHouse's native functions: quantile(0.99)(durationUnixNano), countIf(statusCode = 2), toStartOfInterval(startTime, INTERVAL 1 MINUTE)
Support GROUP BY on service, span name, kind, status, and any custom attribute (via JSON extraction)
Frontend: Add an "Analytics" tab to the Traces page with chart types (timeseries, top list, table) similar to the existing LogsAnalyticsView
Support switching between "List" view (current) and "Analytics" view

Files to modify:

Common/Server/API/TelemetryAPI.ts (add trace analytics endpoint)
Common/Server/Services/SpanService.ts (add aggregation query methods)
Common/Types/Traces/TraceAnalyticsQuery.ts (new - query interface)
App/FeatureSet/Dashboard/src/Pages/Traces/Index.tsx (add analytics view toggle)
App/FeatureSet/Dashboard/src/Components/Traces/TraceAnalyticsView.tsx (new - analytics UI)

1.2 RED Metrics from Traces (Request Rate, Error Rate, Duration)

Current: No automatic computation of service-level metrics from trace data. Target: Auto-computed per-service, per-operation RED metrics displayed on a Service Overview page.

Implementation:

Create a ClickHouse materialized view that aggregates spans into per-service, per-operation metrics at 1-minute intervals:

CREATE MATERIALIZED VIEW span_red_metrics
ENGINE = AggregatingMergeTree()
ORDER BY (projectId, serviceId, name, minute)
AS SELECT
  projectId, serviceId, name,
  toStartOfMinute(startTime) AS minute,
  countState() AS request_count,
  countIfState(statusCode = 2) AS error_count,
  quantileState(0.50)(durationUnixNano) AS p50_duration,
  quantileState(0.95)(durationUnixNano) AS p95_duration,
  quantileState(0.99)(durationUnixNano) AS p99_duration
FROM SpanItem
GROUP BY projectId, serviceId, name, minute

Build a Service Overview page showing: request rate chart, error rate chart, p50/p95/p99 latency charts
Add an API endpoint to query the materialized view

Files to modify:

Common/Models/AnalyticsModels/SpanRedMetrics.ts (new - materialized view model)
Telemetry/Services/SpanRedMetricsService.ts (new - query service)
App/FeatureSet/Dashboard/src/Pages/Service/View/Overview.tsx (new or enhanced - RED dashboard)
Worker/DataMigrations/ (new migration to create materialized view)

1.3 Trace-Based Alerting — Extend Beyond Span Count

Current: Basic trace alerting is implemented with MonitorType.Traces. The existing system supports:

Filtering by span name, span status (Unset/Ok/Error), service, and attributes
Configurable time windows (5s to 24h)
Worker job evaluating every minute via MonitorTelemetryMonitor
Only one criteria check: CheckOn.SpanCount — compares matching span count against a threshold
Frontend form (TraceMonitorStepForm.tsx) with preview of matching spans

What's missing: The current implementation can only answer "are there more/fewer than N spans matching this filter?" It cannot alert on latency, error rates, or throughput — the core APM alerting use cases.

Target: Full APM-grade alerting with latency percentiles, error rate, request rate, and Apdex.

Implementation — extend existing infrastructure:

1.3.1 Add Latency Percentile Alerts (P50/P90/P95/P99)

Add CheckOn.P50Latency, CheckOn.P90Latency, CheckOn.P95Latency, CheckOn.P99Latency to CriteriaFilter.ts
In monitorTrace() worker function, compute quantile(0.50)(durationUnixNano) etc. via ClickHouse instead of just countBy()
Return latency values in TraceMonitorResponse alongside span count
Add latency criteria evaluation in TraceMonitorCriteria.ts

1.3.2 Add Error Rate Alerts

Add CheckOn.ErrorRate to CriteriaFilter.ts
Compute countIf(statusCode = 2) / count() * 100 in the worker query
Return error rate percentage in TraceMonitorResponse
Criteria: "alert if error rate > 5%"

1.3.3 Add Average/Max Duration Alerts

Add CheckOn.AvgDuration, CheckOn.MaxDuration to CriteriaFilter.ts
Compute avg(durationUnixNano), max(durationUnixNano) in worker query
Useful for simpler latency alerts without percentile overhead

1.3.4 Add Request Rate (Throughput) Alerts

Add CheckOn.SpanRate to CriteriaFilter.ts
Compute count() / time_window_seconds to normalize to spans/second
Criteria: "alert if request rate drops below 10 req/s" (detects outages)

1.3.5 Add Apdex Score (Nice-to-have)

Add CheckOn.ApdexScore to CriteriaFilter.ts
Compute from duration thresholds: (satisfied + tolerating*0.5) / total
Allow configuring satisfied/tolerating thresholds per monitor (e.g., satisfied < 500ms, tolerating < 2s)

Files to modify:

Common/Types/Monitor/CriteriaFilter.ts (add new CheckOn values: P50Latency, P90Latency, P95Latency, P99Latency, ErrorRate, AvgDuration, MaxDuration, SpanRate, ApdexScore)
Common/Types/Monitor/TraceMonitor/TraceMonitorResponse.ts (add latency, error rate, throughput fields)
Common/Server/Utils/Monitor/Criteria/TraceMonitorCriteria.ts (add evaluation for new criteria types)
Worker/Jobs/TelemetryMonitor/MonitorTelemetryMonitor.ts (change monitorTrace() from countBy() to aggregation query returning all metrics)
App/FeatureSet/Dashboard/src/Components/Form/Monitor/TraceMonitor/TraceMonitorStepForm.tsx (add criteria type selector for latency/error rate/throughput)

1.4 Head-Based Probabilistic Sampling

Current: Ingests 100% of received traces. Target: Configurable per-service probabilistic sampling with rules to always keep errors and slow traces.

Implementation:

Create TraceSamplingRule PostgreSQL model: service filter, sample rate (0-100%), conditions to always keep (error status, duration > threshold)
Evaluate sampling rules in OtelTracesIngestService.ts before ClickHouse insert
Use deterministic sampling based on traceId hash (so all spans from the same trace are kept or dropped together)
UI under Settings > Trace Configuration > Sampling Rules
Show estimated storage savings

Files to modify:

Common/Models/DatabaseModels/TraceSamplingRule.ts (new)
Telemetry/Services/OtelTracesIngestService.ts (add sampling logic)
Dashboard: new Settings page for sampling configuration

Phase 2: Visualization & Debugging UX (P1) — Industry-Standard Features

2.1 Flame Graph View

Current: Only Gantt/waterfall view. Target: Flame graph visualization showing proportional time spent in each span, with service color coding.

Implementation:

Build a flame graph component that renders spans as horizontally stacked rectangles proportional to duration
Allow switching between Waterfall and Flame Graph views in TraceExplorer
Color-code by service (consistent with waterfall view)
Click a span rectangle to focus/zoom into that subtree
Show tooltip with span name, service, duration, self-time on hover

Files to modify:

App/FeatureSet/Dashboard/src/Components/Traces/FlameGraph.tsx (new)
App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx (add view toggle)

2.2 Latency Breakdown / Critical Path Analysis

Current: Shows individual span durations but no automated analysis. Target: Compute and display critical path, self-time vs child-time, and bottleneck identification.

Implementation:

Compute critical path: the longest sequential chain of spans through the trace (accounts for parallelism)
Calculate "self time" per span: span.duration - sum(child.duration) (clamped to 0 for overlapping children)
Display latency breakdown by service: percentage of total trace time spent in each service
Highlight bottleneck spans (spans contributing most to critical path duration)
Add "Critical Path" toggle in TraceExplorer that highlights the critical path spans

Files to modify:

Common/Utils/Traces/CriticalPath.ts (new - critical path algorithm)
App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx (show self-time)
App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx (add critical path view)

2.3 In-Trace Span Search

Current: TraceExplorer shows all spans with service filtering and error toggle, but no text search. Target: Search box to filter spans by name, attribute values, or status within the current trace.

Implementation:

Add a search input in TraceExplorer toolbar
Client-side filtering: match span name, service name, attribute keys/values against search text
Highlight matching spans in the waterfall/flame graph
Show match count (e.g., "3 of 47 spans")

Files to modify:

App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx (add search bar and filtering)

2.4 Per-Trace Service Flow Map

Current: Service dependency graph exists globally but not per-trace. Target: Per-trace visualization showing the path of a request through services with latency annotations.

Implementation:

Build a directed graph from the spans in a single trace (services as nodes, calls as edges)
Annotate edges with call count and latency
Color-code nodes by error status
Add as a new view tab alongside Waterfall and Flame Graph

Files to modify:

App/FeatureSet/Dashboard/src/Components/Traces/TraceServiceMap.tsx (new)
App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx (add view tab)

Current: Links data is stored in spans but not navigable in the UI. Target: Clickable links in the span detail panel that navigate to related traces/spans.

Implementation:

In the SpanViewer detail panel, render the links array as clickable items
Each link shows the linked traceId, spanId, and relationship type
Clicking navigates to the linked trace view

Files to modify:

App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx (render clickable links)

Phase 3: Advanced Analytics & Correlation (P2) — Power Features

3.1 Trace-to-Metric Exemplars

Current: Metric model has no traceId/spanId fields. Target: Link metric data points to trace IDs; show exemplar dots on metric charts that navigate to traces.

Implementation:

Add optional traceId and spanId columns to the Metric ClickHouse model
During metric ingestion, extract exemplar trace/span IDs from OTLP exemplar fields
On metric charts, render exemplar dots at data points that have associated traces
Clicking an exemplar dot navigates to the trace view

Files to modify:

Common/Models/AnalyticsModels/Metric.ts (add traceId/spanId columns)
Telemetry/Services/OtelMetricsIngestService.ts (extract exemplars)
App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx (render exemplar dots)

3.2 Custom Metrics from Spans

Current: No way to create persistent metrics from trace data. Target: Users define custom metrics from span attributes that are computed via ClickHouse materialized views and available for alerting and dashboards.

Implementation:

Create SpanDerivedMetric model: name, filter query (which spans), aggregation (count/avg/p99 of what field), GROUP BY attributes
Use ClickHouse materialized views for efficient computation
Surface derived metrics in the metric explorer and alerting system

Files to modify:

Common/Models/DatabaseModels/SpanDerivedMetric.ts (new)
Common/Server/Services/SpanDerivedMetricService.ts (new)
Dashboard: UI for defining derived metrics

3.3 Structural Trace Queries

Current: Can only filter on individual span attributes. Target: Query traces based on properties of multiple spans and their relationships (e.g., "find traces where service A called service B and B returned an error").

Implementation:

Design a visual query builder for structural queries (easier adoption than a query language)
Translate structural queries to ClickHouse subqueries with JOINs on traceId

Example: "Find traces where span with service=frontend has child span with service=database AND duration > 500ms"

SELECT DISTINCT s1.traceId FROM SpanItem s1
JOIN SpanItem s2 ON s1.traceId = s2.traceId AND s1.spanId = s2.parentSpanId
WHERE s1.projectId = {pid}
  AND JSONExtractString(s1.attributes, 'service.name') = 'frontend'
  AND JSONExtractString(s2.attributes, 'service.name') = 'database'
  AND s2.durationUnixNano > 500000000

Files to modify:

Common/Types/Traces/StructuralTraceQuery.ts (new - query model)
Common/Server/Services/SpanService.ts (add structural query execution)
App/FeatureSet/Dashboard/src/Components/Traces/StructuralQueryBuilder.tsx (new - visual builder)

3.4 Trace Comparison / Diffing

Current: No way to compare traces. Target: Side-by-side comparison of two traces of the same operation, highlighting differences in span count, latency, and structure.

Implementation:

Add "Compare" action to trace list (select two traces)
Build a diff view showing: added/removed spans, latency differences per span, structural changes
Useful for comparing a slow trace to a fast trace of the same operation

Files to modify:

App/FeatureSet/Dashboard/src/Components/Traces/TraceComparison.tsx (new)
App/FeatureSet/Dashboard/src/Pages/Traces/Compare.tsx (new page)

Phase 4: Competitive Differentiation (P3) — Long-Term

4.1 Rules-Based and Tail-Based Sampling

Current: Phase 1 adds head-based probabilistic sampling. Target: Rules-based sampling (always keep errors/slow traces, sample successes) and eventually tail-based sampling (buffer complete traces, decide after seeing all spans).

Implementation:

Rules engine: configurable conditions (service, status, duration, attributes) with per-rule sample rates
Tail-based: buffer spans for a configurable window (30s), assemble complete traces, then apply retention decisions
Tail-based is complex; consider integrating with OpenTelemetry Collector's tail sampling processor as an alternative

4.2 AI/ML on Trace Data

Anomaly detection on RED metrics (statistical deviation from baseline)
Auto-surfacing correlated attributes when latency spikes (similar to Honeycomb BubbleUp)
Natural language trace queries ("show me slow database calls from the last hour")
Automatic root cause analysis from trace data during incidents

4.3 RUM (Real User Monitoring) Correlation

Browser SDK that propagates W3C trace context from frontend to backend
Link frontend page loads, interactions, and web vitals to backend traces
Show end-to-end user experience from browser to backend services

4.4 Continuous Profiling Integration

Integrate with a profiling backend (e.g., Pyroscope)
Link profile data to span time windows
Show "Code Hotspots" within spans (similar to DataDog)

ClickHouse Storage Improvements

S.3 Add Trace-by-ID Projection (LOW)

Current: Trace detail view relies on BloomFilter skip index for traceId lookups. (Note: proj_trace_by_id projection has been added but may need evaluation for further optimization.) Target: Projection sorted by (projectId, traceId, startTime) for faster trace-by-ID queries.

Quick Wins (Can Ship This Week)

In-trace span search - Add a text filter in TraceExplorer (few hours of work)
Self-time calculation - Show "self time" (span duration minus child durations) in SpanViewer
Span link navigation - Links data is stored but not clickable in UI
Top-N slowest operations - Simple ClickHouse query: ORDER BY durationUnixNano DESC LIMIT N
Error rate by service - Aggregate statusCode=2 counts grouped by serviceId
Trace duration distribution histogram - Use ClickHouse histogram() on durationUnixNano
Span count per service display - Already tracked in servicesInTrace, just needs better display

Recommended Implementation Order

Phase 1.1 - Trace Analytics Engine (highest impact, unlocks everything else)
Phase 1.2 - RED Metrics from Traces (prerequisite for alerting, service overview)
Quick Wins - Ship in-trace search, self-time, span links, top-N operations
Phase 1.3 - Trace-Based Alerting (core observability workflow)
Phase 2.1 - Flame Graph View (industry-standard visualization)
Phase 2.2 - Critical Path Analysis (key debugging capability)
Phase 1.4 - Head-Based Sampling (essential for high-volume users)
Phase 2.3-2.5 - In-trace search, per-trace map, span links
Phase 3.1 - Trace-to-Metric Exemplars
Phase 3.2-3.4 - Custom metrics, structural queries, comparison
Phase 4.x - AI/ML, RUM, profiling (long-term)

Verification

For each feature:

Unit tests for new query builders, critical path algorithm, sampling logic
Integration tests for new API endpoints (analytics, RED metrics, sampling)
Manual verification via the dev server at https://oneuptimedev.genosyn.com/dashboard/{projectId}/traces
Check ClickHouse query performance with EXPLAIN for new aggregation queries
Verify trace correlation (logs, exceptions, metrics) still works correctly with new features
Load test sampling logic to ensure it doesn't add ingestion latency

22 KiB Raw Blame History

Plan: Bring OneUptime Traces to Industry Parity and Beyond

Context

Completed

Gap Analysis Summary

Phase 1: Analytics & Alerting Foundation (P0) — Highest Impact

1.1 Trace Analytics / Aggregation Engine

1.2 RED Metrics from Traces (Request Rate, Error Rate, Duration)

1.3 Trace-Based Alerting — Extend Beyond Span Count

1.3.1 Add Latency Percentile Alerts (P50/P90/P95/P99)

1.3.2 Add Error Rate Alerts

1.3.3 Add Average/Max Duration Alerts

1.3.4 Add Request Rate (Throughput) Alerts

1.3.5 Add Apdex Score (Nice-to-have)

1.4 Head-Based Probabilistic Sampling

Phase 2: Visualization & Debugging UX (P1) — Industry-Standard Features

2.1 Flame Graph View

2.2 Latency Breakdown / Critical Path Analysis

2.3 In-Trace Span Search

2.4 Per-Trace Service Flow Map

2.5 Span Link Navigation

Phase 3: Advanced Analytics & Correlation (P2) — Power Features

3.1 Trace-to-Metric Exemplars

3.2 Custom Metrics from Spans

3.3 Structural Trace Queries

3.4 Trace Comparison / Diffing

Phase 4: Competitive Differentiation (P3) — Long-Term

4.1 Rules-Based and Tail-Based Sampling

4.2 AI/ML on Trace Data

4.3 RUM (Real User Monitoring) Correlation

4.4 Continuous Profiling Integration

ClickHouse Storage Improvements

S.3 Add Trace-by-ID Projection (LOW)

Quick Wins (Can Ship This Week)

Recommended Implementation Order

Verification

22 KiB

Raw Blame History