Files
oneuptime/Internal/Roadmap/Traces.md

22 KiB

Plan: Bring OneUptime Traces to Industry Parity and Beyond

Context

OneUptime's trace implementation provides OTLP-native ingestion (HTTP and gRPC), ClickHouse storage with a full OpenTelemetry span model (events, links, status, attributes, resources, scope), a Gantt/waterfall visualization, trace-to-log and trace-to-exception correlation, a basic service dependency graph, queue-based async ingestion, and per-service data retention with TTL. ClickHouse schema has been optimized with BloomFilter indexes on traceId/spanId/parentSpanId, Set indexes on statusCode/kind/hasException, TokenBF on name, and ZSTD compression on key columns.

This plan identifies the remaining gaps vs DataDog, NewRelic, Honeycomb, and Grafana Tempo, and proposes a phased implementation to close them and surpass competition.

Completed

The following features have been implemented:

  • OTLP Ingestion - HTTP and gRPC trace ingestion with async queue-based processing
  • ClickHouse Storage - MergeTree with sipHash64(projectId) % 16 partitioning, per-service TTL
  • Gantt/Waterfall View - Hierarchical span visualization with color-coded services, time-unit auto-scaling, error indicators
  • Trace-to-Log Correlation - Log model has traceId/spanId columns; SpanViewer shows associated logs
  • Trace-to-Exception Correlation - ExceptionInstance model links to traceId/spanId with stack trace parsing and fingerprinting
  • Span Detail Panel - Side-over with tabs for Basic Info, Logs, Attributes, Events, Exceptions
  • BloomFilter indexes on traceId, spanId, parentSpanId
  • Set indexes on statusCode, kind, hasException
  • TokenBF index on name
  • ZSTD compression on time/ID/attribute columns
  • hasException boolean column for fast error span filtering
  • links default value corrected to []
  • Basic Trace-Based Alerting - MonitorType.Traces with span count threshold alerting, span name/status/service/attribute filtering, time window (5s-24h), worker job running every minute, frontend form with preview
  • S.1 - Migrate attributes to Map(String, String) (TableColumnType.MapStringString in Span model with attributeKeys array for fast enumeration)
  • S.2 - Aggregation Projections (proj_agg_by_service for service-level COUNT/AVG/P99 aggregation, proj_trace_by_id for trace-by-ID queries)

Gap Analysis Summary

Feature OneUptime DataDog NewRelic Tempo/Honeycomb Priority
Trace analytics / aggregation engine None Trace Explorer with COUNT/percentiles NRQL on span data TraceQL rate/count/quantile P0
RED metrics from traces None Auto-computed on 100% traffic Derived golden signals Metrics-generator to Prometheus P0
Trace-based alerting Partial — span count only, no latency/error rate/Apdex APM Monitors (p50-p99, error rate, Apdex) NRQL alert conditions Via Grafana alerting / Triggers P0
Sampling controls None (100% ingestion) Head-based adaptive + retention filters Infinite Tracing (tail-based) Refinery (rules/dynamic/tail) P0
Flame graph view None Yes (default view) No No P1
Latency breakdown / critical path None Per-hop latency, bottleneck detection No BubbleUp (Honeycomb) P1
In-trace search None Yes No No P1
Per-trace service map None Yes (Map view) No No P1
Trace-to-metric exemplars None Pivot from metric graph to traces Metric-to-trace linking Prometheus exemplars P1
Custom metrics from spans None Generate count/distribution/gauge from tags Via NRQL SLOs from span data P2
Structural trace queries None Trace Queries (multi-span relationships) Via NRQL TraceQL spanset pipelines P2
Trace comparison / diffing None Partial Side-by-side comparison compare() in TraceQL P2
AI/ML on traces None Watchdog (auto anomaly + RCA) NRAI BubbleUp (pattern detection) P3
RUM correlation None Frontend-to-backend trace linking Yes Faro / frontend observability P3
Continuous profiling None Code Hotspots (span-to-profile) Partial Pyroscope P3

Phase 1: Analytics & Alerting Foundation (P0) — Highest Impact

Without these, users cannot answer basic questions like "is my service healthy?" from trace data.

1.1 Trace Analytics / Aggregation Engine

Current: Can list/filter individual spans and view individual traces. No way to aggregate or compute statistics. Target: Full trace analytics supporting COUNT, AVG, SUM, MIN, MAX, P50/P75/P90/P95/P99 aggregations with GROUP BY on any span attribute and time-series bucketing.

Implementation:

  • Build a trace analytics API endpoint that translates query configs into ClickHouse aggregation queries
  • Use ClickHouse's native functions: quantile(0.99)(durationUnixNano), countIf(statusCode = 2), toStartOfInterval(startTime, INTERVAL 1 MINUTE)
  • Support GROUP BY on service, span name, kind, status, and any custom attribute (via JSON extraction)
  • Frontend: Add an "Analytics" tab to the Traces page with chart types (timeseries, top list, table) similar to the existing LogsAnalyticsView
  • Support switching between "List" view (current) and "Analytics" view

Files to modify:

  • Common/Server/API/TelemetryAPI.ts (add trace analytics endpoint)
  • Common/Server/Services/SpanService.ts (add aggregation query methods)
  • Common/Types/Traces/TraceAnalyticsQuery.ts (new - query interface)
  • App/FeatureSet/Dashboard/src/Pages/Traces/Index.tsx (add analytics view toggle)
  • App/FeatureSet/Dashboard/src/Components/Traces/TraceAnalyticsView.tsx (new - analytics UI)

1.2 RED Metrics from Traces (Request Rate, Error Rate, Duration)

Current: No automatic computation of service-level metrics from trace data. Target: Auto-computed per-service, per-operation RED metrics displayed on a Service Overview page.

Implementation:

  • Create a ClickHouse materialized view that aggregates spans into per-service, per-operation metrics at 1-minute intervals:
    CREATE MATERIALIZED VIEW span_red_metrics
    ENGINE = AggregatingMergeTree()
    ORDER BY (projectId, serviceId, name, minute)
    AS SELECT
      projectId, serviceId, name,
      toStartOfMinute(startTime) AS minute,
      countState() AS request_count,
      countIfState(statusCode = 2) AS error_count,
      quantileState(0.50)(durationUnixNano) AS p50_duration,
      quantileState(0.95)(durationUnixNano) AS p95_duration,
      quantileState(0.99)(durationUnixNano) AS p99_duration
    FROM SpanItem
    GROUP BY projectId, serviceId, name, minute
    
  • Build a Service Overview page showing: request rate chart, error rate chart, p50/p95/p99 latency charts
  • Add an API endpoint to query the materialized view

Files to modify:

  • Common/Models/AnalyticsModels/SpanRedMetrics.ts (new - materialized view model)
  • Telemetry/Services/SpanRedMetricsService.ts (new - query service)
  • App/FeatureSet/Dashboard/src/Pages/Service/View/Overview.tsx (new or enhanced - RED dashboard)
  • Worker/DataMigrations/ (new migration to create materialized view)

1.3 Trace-Based Alerting — Extend Beyond Span Count

Current: Basic trace alerting is implemented with MonitorType.Traces. The existing system supports:

  • Filtering by span name, span status (Unset/Ok/Error), service, and attributes
  • Configurable time windows (5s to 24h)
  • Worker job evaluating every minute via MonitorTelemetryMonitor
  • Only one criteria check: CheckOn.SpanCount — compares matching span count against a threshold
  • Frontend form (TraceMonitorStepForm.tsx) with preview of matching spans

What's missing: The current implementation can only answer "are there more/fewer than N spans matching this filter?" It cannot alert on latency, error rates, or throughput — the core APM alerting use cases.

Target: Full APM-grade alerting with latency percentiles, error rate, request rate, and Apdex.

Implementation — extend existing infrastructure:

1.3.1 Add Latency Percentile Alerts (P50/P90/P95/P99)

  • Add CheckOn.P50Latency, CheckOn.P90Latency, CheckOn.P95Latency, CheckOn.P99Latency to CriteriaFilter.ts
  • In monitorTrace() worker function, compute quantile(0.50)(durationUnixNano) etc. via ClickHouse instead of just countBy()
  • Return latency values in TraceMonitorResponse alongside span count
  • Add latency criteria evaluation in TraceMonitorCriteria.ts

1.3.2 Add Error Rate Alerts

  • Add CheckOn.ErrorRate to CriteriaFilter.ts
  • Compute countIf(statusCode = 2) / count() * 100 in the worker query
  • Return error rate percentage in TraceMonitorResponse
  • Criteria: "alert if error rate > 5%"

1.3.3 Add Average/Max Duration Alerts

  • Add CheckOn.AvgDuration, CheckOn.MaxDuration to CriteriaFilter.ts
  • Compute avg(durationUnixNano), max(durationUnixNano) in worker query
  • Useful for simpler latency alerts without percentile overhead

1.3.4 Add Request Rate (Throughput) Alerts

  • Add CheckOn.SpanRate to CriteriaFilter.ts
  • Compute count() / time_window_seconds to normalize to spans/second
  • Criteria: "alert if request rate drops below 10 req/s" (detects outages)

1.3.5 Add Apdex Score (Nice-to-have)

  • Add CheckOn.ApdexScore to CriteriaFilter.ts
  • Compute from duration thresholds: (satisfied + tolerating*0.5) / total
  • Allow configuring satisfied/tolerating thresholds per monitor (e.g., satisfied < 500ms, tolerating < 2s)

Files to modify:

  • Common/Types/Monitor/CriteriaFilter.ts (add new CheckOn values: P50Latency, P90Latency, P95Latency, P99Latency, ErrorRate, AvgDuration, MaxDuration, SpanRate, ApdexScore)
  • Common/Types/Monitor/TraceMonitor/TraceMonitorResponse.ts (add latency, error rate, throughput fields)
  • Common/Server/Utils/Monitor/Criteria/TraceMonitorCriteria.ts (add evaluation for new criteria types)
  • Worker/Jobs/TelemetryMonitor/MonitorTelemetryMonitor.ts (change monitorTrace() from countBy() to aggregation query returning all metrics)
  • App/FeatureSet/Dashboard/src/Components/Form/Monitor/TraceMonitor/TraceMonitorStepForm.tsx (add criteria type selector for latency/error rate/throughput)

1.4 Head-Based Probabilistic Sampling

Current: Ingests 100% of received traces. Target: Configurable per-service probabilistic sampling with rules to always keep errors and slow traces.

Implementation:

  • Create TraceSamplingRule PostgreSQL model: service filter, sample rate (0-100%), conditions to always keep (error status, duration > threshold)
  • Evaluate sampling rules in OtelTracesIngestService.ts before ClickHouse insert
  • Use deterministic sampling based on traceId hash (so all spans from the same trace are kept or dropped together)
  • UI under Settings > Trace Configuration > Sampling Rules
  • Show estimated storage savings

Files to modify:

  • Common/Models/DatabaseModels/TraceSamplingRule.ts (new)
  • Telemetry/Services/OtelTracesIngestService.ts (add sampling logic)
  • Dashboard: new Settings page for sampling configuration

Phase 2: Visualization & Debugging UX (P1) — Industry-Standard Features

2.1 Flame Graph View

Current: Only Gantt/waterfall view. Target: Flame graph visualization showing proportional time spent in each span, with service color coding.

Implementation:

  • Build a flame graph component that renders spans as horizontally stacked rectangles proportional to duration
  • Allow switching between Waterfall and Flame Graph views in TraceExplorer
  • Color-code by service (consistent with waterfall view)
  • Click a span rectangle to focus/zoom into that subtree
  • Show tooltip with span name, service, duration, self-time on hover

Files to modify:

  • App/FeatureSet/Dashboard/src/Components/Traces/FlameGraph.tsx (new)
  • App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx (add view toggle)

2.2 Latency Breakdown / Critical Path Analysis

Current: Shows individual span durations but no automated analysis. Target: Compute and display critical path, self-time vs child-time, and bottleneck identification.

Implementation:

  • Compute critical path: the longest sequential chain of spans through the trace (accounts for parallelism)
  • Calculate "self time" per span: span.duration - sum(child.duration) (clamped to 0 for overlapping children)
  • Display latency breakdown by service: percentage of total trace time spent in each service
  • Highlight bottleneck spans (spans contributing most to critical path duration)
  • Add "Critical Path" toggle in TraceExplorer that highlights the critical path spans

Files to modify:

  • Common/Utils/Traces/CriticalPath.ts (new - critical path algorithm)
  • App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx (show self-time)
  • App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx (add critical path view)

Current: TraceExplorer shows all spans with service filtering and error toggle, but no text search. Target: Search box to filter spans by name, attribute values, or status within the current trace.

Implementation:

  • Add a search input in TraceExplorer toolbar
  • Client-side filtering: match span name, service name, attribute keys/values against search text
  • Highlight matching spans in the waterfall/flame graph
  • Show match count (e.g., "3 of 47 spans")

Files to modify:

  • App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx (add search bar and filtering)

2.4 Per-Trace Service Flow Map

Current: Service dependency graph exists globally but not per-trace. Target: Per-trace visualization showing the path of a request through services with latency annotations.

Implementation:

  • Build a directed graph from the spans in a single trace (services as nodes, calls as edges)
  • Annotate edges with call count and latency
  • Color-code nodes by error status
  • Add as a new view tab alongside Waterfall and Flame Graph

Files to modify:

  • App/FeatureSet/Dashboard/src/Components/Traces/TraceServiceMap.tsx (new)
  • App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx (add view tab)

Current: Links data is stored in spans but not navigable in the UI. Target: Clickable links in the span detail panel that navigate to related traces/spans.

Implementation:

  • In the SpanViewer detail panel, render the links array as clickable items
  • Each link shows the linked traceId, spanId, and relationship type
  • Clicking navigates to the linked trace view

Files to modify:

  • App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx (render clickable links)

Phase 3: Advanced Analytics & Correlation (P2) — Power Features

3.1 Trace-to-Metric Exemplars

Current: Metric model has no traceId/spanId fields. Target: Link metric data points to trace IDs; show exemplar dots on metric charts that navigate to traces.

Implementation:

  • Add optional traceId and spanId columns to the Metric ClickHouse model
  • During metric ingestion, extract exemplar trace/span IDs from OTLP exemplar fields
  • On metric charts, render exemplar dots at data points that have associated traces
  • Clicking an exemplar dot navigates to the trace view

Files to modify:

  • Common/Models/AnalyticsModels/Metric.ts (add traceId/spanId columns)
  • Telemetry/Services/OtelMetricsIngestService.ts (extract exemplars)
  • App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx (render exemplar dots)

3.2 Custom Metrics from Spans

Current: No way to create persistent metrics from trace data. Target: Users define custom metrics from span attributes that are computed via ClickHouse materialized views and available for alerting and dashboards.

Implementation:

  • Create SpanDerivedMetric model: name, filter query (which spans), aggregation (count/avg/p99 of what field), GROUP BY attributes
  • Use ClickHouse materialized views for efficient computation
  • Surface derived metrics in the metric explorer and alerting system

Files to modify:

  • Common/Models/DatabaseModels/SpanDerivedMetric.ts (new)
  • Common/Server/Services/SpanDerivedMetricService.ts (new)
  • Dashboard: UI for defining derived metrics

3.3 Structural Trace Queries

Current: Can only filter on individual span attributes. Target: Query traces based on properties of multiple spans and their relationships (e.g., "find traces where service A called service B and B returned an error").

Implementation:

  • Design a visual query builder for structural queries (easier adoption than a query language)
  • Translate structural queries to ClickHouse subqueries with JOINs on traceId
  • Example: "Find traces where span with service=frontend has child span with service=database AND duration > 500ms"
    SELECT DISTINCT s1.traceId FROM SpanItem s1
    JOIN SpanItem s2 ON s1.traceId = s2.traceId AND s1.spanId = s2.parentSpanId
    WHERE s1.projectId = {pid}
      AND JSONExtractString(s1.attributes, 'service.name') = 'frontend'
      AND JSONExtractString(s2.attributes, 'service.name') = 'database'
      AND s2.durationUnixNano > 500000000
    

Files to modify:

  • Common/Types/Traces/StructuralTraceQuery.ts (new - query model)
  • Common/Server/Services/SpanService.ts (add structural query execution)
  • App/FeatureSet/Dashboard/src/Components/Traces/StructuralQueryBuilder.tsx (new - visual builder)

3.4 Trace Comparison / Diffing

Current: No way to compare traces. Target: Side-by-side comparison of two traces of the same operation, highlighting differences in span count, latency, and structure.

Implementation:

  • Add "Compare" action to trace list (select two traces)
  • Build a diff view showing: added/removed spans, latency differences per span, structural changes
  • Useful for comparing a slow trace to a fast trace of the same operation

Files to modify:

  • App/FeatureSet/Dashboard/src/Components/Traces/TraceComparison.tsx (new)
  • App/FeatureSet/Dashboard/src/Pages/Traces/Compare.tsx (new page)

Phase 4: Competitive Differentiation (P3) — Long-Term

4.1 Rules-Based and Tail-Based Sampling

Current: Phase 1 adds head-based probabilistic sampling. Target: Rules-based sampling (always keep errors/slow traces, sample successes) and eventually tail-based sampling (buffer complete traces, decide after seeing all spans).

Implementation:

  • Rules engine: configurable conditions (service, status, duration, attributes) with per-rule sample rates
  • Tail-based: buffer spans for a configurable window (30s), assemble complete traces, then apply retention decisions
  • Tail-based is complex; consider integrating with OpenTelemetry Collector's tail sampling processor as an alternative

4.2 AI/ML on Trace Data

  • Anomaly detection on RED metrics (statistical deviation from baseline)
  • Auto-surfacing correlated attributes when latency spikes (similar to Honeycomb BubbleUp)
  • Natural language trace queries ("show me slow database calls from the last hour")
  • Automatic root cause analysis from trace data during incidents

4.3 RUM (Real User Monitoring) Correlation

  • Browser SDK that propagates W3C trace context from frontend to backend
  • Link frontend page loads, interactions, and web vitals to backend traces
  • Show end-to-end user experience from browser to backend services

4.4 Continuous Profiling Integration

  • Integrate with a profiling backend (e.g., Pyroscope)
  • Link profile data to span time windows
  • Show "Code Hotspots" within spans (similar to DataDog)

ClickHouse Storage Improvements

S.3 Add Trace-by-ID Projection (LOW)

Current: Trace detail view relies on BloomFilter skip index for traceId lookups. (Note: proj_trace_by_id projection has been added but may need evaluation for further optimization.) Target: Projection sorted by (projectId, traceId, startTime) for faster trace-by-ID queries.


Quick Wins (Can Ship This Week)

  1. In-trace span search - Add a text filter in TraceExplorer (few hours of work)
  2. Self-time calculation - Show "self time" (span duration minus child durations) in SpanViewer
  3. Span link navigation - Links data is stored but not clickable in UI
  4. Top-N slowest operations - Simple ClickHouse query: ORDER BY durationUnixNano DESC LIMIT N
  5. Error rate by service - Aggregate statusCode=2 counts grouped by serviceId
  6. Trace duration distribution histogram - Use ClickHouse histogram() on durationUnixNano
  7. Span count per service display - Already tracked in servicesInTrace, just needs better display

  1. Phase 1.1 - Trace Analytics Engine (highest impact, unlocks everything else)
  2. Phase 1.2 - RED Metrics from Traces (prerequisite for alerting, service overview)
  3. Quick Wins - Ship in-trace search, self-time, span links, top-N operations
  4. Phase 1.3 - Trace-Based Alerting (core observability workflow)
  5. Phase 2.1 - Flame Graph View (industry-standard visualization)
  6. Phase 2.2 - Critical Path Analysis (key debugging capability)
  7. Phase 1.4 - Head-Based Sampling (essential for high-volume users)
  8. Phase 2.3-2.5 - In-trace search, per-trace map, span links
  9. Phase 3.1 - Trace-to-Metric Exemplars
  10. Phase 3.2-3.4 - Custom metrics, structural queries, comparison
  11. Phase 4.x - AI/ML, RUM, profiling (long-term)

Verification

For each feature:

  1. Unit tests for new query builders, critical path algorithm, sampling logic
  2. Integration tests for new API endpoints (analytics, RED metrics, sampling)
  3. Manual verification via the dev server at https://oneuptimedev.genosyn.com/dashboard/{projectId}/traces
  4. Check ClickHouse query performance with EXPLAIN for new aggregation queries
  5. Verify trace correlation (logs, exceptions, metrics) still works correctly with new features
  6. Load test sampling logic to ensure it doesn't add ingestion latency