mirror of
https://github.com/OneUptime/oneuptime.git
synced 2026-04-06 00:32:12 +02:00
413 lines
22 KiB
Markdown
413 lines
22 KiB
Markdown
# Plan: Bring OneUptime Traces to Industry Parity and Beyond
|
|
|
|
## Context
|
|
|
|
OneUptime's trace implementation provides OTLP-native ingestion (HTTP and gRPC), ClickHouse storage with a full OpenTelemetry span model (events, links, status, attributes, resources, scope), a Gantt/waterfall visualization, trace-to-log and trace-to-exception correlation, a basic service dependency graph, queue-based async ingestion, and per-service data retention with TTL. ClickHouse schema has been optimized with BloomFilter indexes on traceId/spanId/parentSpanId, Set indexes on statusCode/kind/hasException, TokenBF on name, and ZSTD compression on key columns.
|
|
|
|
This plan identifies the remaining gaps vs DataDog, NewRelic, Honeycomb, and Grafana Tempo, and proposes a phased implementation to close them and surpass competition.
|
|
|
|
## Completed
|
|
|
|
The following features have been implemented:
|
|
- **OTLP Ingestion** - HTTP and gRPC trace ingestion with async queue-based processing
|
|
- **ClickHouse Storage** - MergeTree with `sipHash64(projectId) % 16` partitioning, per-service TTL
|
|
- **Gantt/Waterfall View** - Hierarchical span visualization with color-coded services, time-unit auto-scaling, error indicators
|
|
- **Trace-to-Log Correlation** - Log model has traceId/spanId columns; SpanViewer shows associated logs
|
|
- **Trace-to-Exception Correlation** - ExceptionInstance model links to traceId/spanId with stack trace parsing and fingerprinting
|
|
- **Span Detail Panel** - Side-over with tabs for Basic Info, Logs, Attributes, Events, Exceptions
|
|
- **BloomFilter indexes** on traceId, spanId, parentSpanId
|
|
- **Set indexes** on statusCode, kind, hasException
|
|
- **TokenBF index** on name
|
|
- **ZSTD compression** on time/ID/attribute columns
|
|
- **hasException boolean column** for fast error span filtering
|
|
- **links default value** corrected to `[]`
|
|
- **Basic Trace-Based Alerting** - MonitorType.Traces with span count threshold alerting, span name/status/service/attribute filtering, time window (5s-24h), worker job running every minute, frontend form with preview
|
|
- **S.1** - Migrate `attributes` to Map(String, String) (TableColumnType.MapStringString in Span model with `attributeKeys` array for fast enumeration)
|
|
- **S.2** - Aggregation Projections (`proj_agg_by_service` for service-level COUNT/AVG/P99 aggregation, `proj_trace_by_id` for trace-by-ID queries)
|
|
|
|
## Gap Analysis Summary
|
|
|
|
| Feature | OneUptime | DataDog | NewRelic | Tempo/Honeycomb | Priority |
|
|
|---------|-----------|---------|----------|-----------------|----------|
|
|
| Trace analytics / aggregation engine | None | Trace Explorer with COUNT/percentiles | NRQL on span data | TraceQL rate/count/quantile | **P0** |
|
|
| RED metrics from traces | None | Auto-computed on 100% traffic | Derived golden signals | Metrics-generator to Prometheus | **P0** |
|
|
| Trace-based alerting | **Partial** — span count only, no latency/error rate/Apdex | APM Monitors (p50-p99, error rate, Apdex) | NRQL alert conditions | Via Grafana alerting / Triggers | **P0** |
|
|
| Sampling controls | None (100% ingestion) | Head-based adaptive + retention filters | Infinite Tracing (tail-based) | Refinery (rules/dynamic/tail) | **P0** |
|
|
| Flame graph view | None | Yes (default view) | No | No | **P1** |
|
|
| Latency breakdown / critical path | None | Per-hop latency, bottleneck detection | No | BubbleUp (Honeycomb) | **P1** |
|
|
| In-trace search | None | Yes | No | No | **P1** |
|
|
| Per-trace service map | None | Yes (Map view) | No | No | **P1** |
|
|
| Trace-to-metric exemplars | None | Pivot from metric graph to traces | Metric-to-trace linking | Prometheus exemplars | **P1** |
|
|
| Custom metrics from spans | None | Generate count/distribution/gauge from tags | Via NRQL | SLOs from span data | **P2** |
|
|
| Structural trace queries | None | Trace Queries (multi-span relationships) | Via NRQL | TraceQL spanset pipelines | **P2** |
|
|
| Trace comparison / diffing | None | Partial | Side-by-side comparison | compare() in TraceQL | **P2** |
|
|
| AI/ML on traces | None | Watchdog (auto anomaly + RCA) | NRAI | BubbleUp (pattern detection) | **P3** |
|
|
| RUM correlation | None | Frontend-to-backend trace linking | Yes | Faro / frontend observability | **P3** |
|
|
| Continuous profiling | None | Code Hotspots (span-to-profile) | Partial | Pyroscope | **P3** |
|
|
|
|
---
|
|
|
|
## Phase 1: Analytics & Alerting Foundation (P0) — Highest Impact
|
|
|
|
Without these, users cannot answer basic questions like "is my service healthy?" from trace data.
|
|
|
|
### 1.1 Trace Analytics / Aggregation Engine
|
|
|
|
**Current**: Can list/filter individual spans and view individual traces. No way to aggregate or compute statistics.
|
|
**Target**: Full trace analytics supporting COUNT, AVG, SUM, MIN, MAX, P50/P75/P90/P95/P99 aggregations with GROUP BY on any span attribute and time-series bucketing.
|
|
|
|
**Implementation**:
|
|
|
|
- Build a trace analytics API endpoint that translates query configs into ClickHouse aggregation queries
|
|
- Use ClickHouse's native functions: `quantile(0.99)(durationUnixNano)`, `countIf(statusCode = 2)`, `toStartOfInterval(startTime, INTERVAL 1 MINUTE)`
|
|
- Support GROUP BY on service, span name, kind, status, and any custom attribute (via JSON extraction)
|
|
- Frontend: Add an "Analytics" tab to the Traces page with chart types (timeseries, top list, table) similar to the existing LogsAnalyticsView
|
|
- Support switching between "List" view (current) and "Analytics" view
|
|
|
|
**Files to modify**:
|
|
- `Common/Server/API/TelemetryAPI.ts` (add trace analytics endpoint)
|
|
- `Common/Server/Services/SpanService.ts` (add aggregation query methods)
|
|
- `Common/Types/Traces/TraceAnalyticsQuery.ts` (new - query interface)
|
|
- `App/FeatureSet/Dashboard/src/Pages/Traces/Index.tsx` (add analytics view toggle)
|
|
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceAnalyticsView.tsx` (new - analytics UI)
|
|
|
|
### 1.2 RED Metrics from Traces (Request Rate, Error Rate, Duration)
|
|
|
|
**Current**: No automatic computation of service-level metrics from trace data.
|
|
**Target**: Auto-computed per-service, per-operation RED metrics displayed on a Service Overview page.
|
|
|
|
**Implementation**:
|
|
|
|
- Create a ClickHouse materialized view that aggregates spans into per-service, per-operation metrics at 1-minute intervals:
|
|
```sql
|
|
CREATE MATERIALIZED VIEW span_red_metrics
|
|
ENGINE = AggregatingMergeTree()
|
|
ORDER BY (projectId, serviceId, name, minute)
|
|
AS SELECT
|
|
projectId, serviceId, name,
|
|
toStartOfMinute(startTime) AS minute,
|
|
countState() AS request_count,
|
|
countIfState(statusCode = 2) AS error_count,
|
|
quantileState(0.50)(durationUnixNano) AS p50_duration,
|
|
quantileState(0.95)(durationUnixNano) AS p95_duration,
|
|
quantileState(0.99)(durationUnixNano) AS p99_duration
|
|
FROM SpanItem
|
|
GROUP BY projectId, serviceId, name, minute
|
|
```
|
|
- Build a Service Overview page showing: request rate chart, error rate chart, p50/p95/p99 latency charts
|
|
- Add an API endpoint to query the materialized view
|
|
|
|
**Files to modify**:
|
|
- `Common/Models/AnalyticsModels/SpanRedMetrics.ts` (new - materialized view model)
|
|
- `Telemetry/Services/SpanRedMetricsService.ts` (new - query service)
|
|
- `App/FeatureSet/Dashboard/src/Pages/Service/View/Overview.tsx` (new or enhanced - RED dashboard)
|
|
- `Worker/DataMigrations/` (new migration to create materialized view)
|
|
|
|
### 1.3 Trace-Based Alerting — Extend Beyond Span Count
|
|
|
|
**Current**: Basic trace alerting is implemented with `MonitorType.Traces`. The existing system supports:
|
|
- Filtering by span name, span status (Unset/Ok/Error), service, and attributes
|
|
- Configurable time windows (5s to 24h)
|
|
- Worker job evaluating every minute via `MonitorTelemetryMonitor`
|
|
- **Only one criteria check**: `CheckOn.SpanCount` — compares matching span count against a threshold
|
|
- Frontend form (`TraceMonitorStepForm.tsx`) with preview of matching spans
|
|
|
|
**What's missing**: The current implementation can only answer "are there more/fewer than N spans matching this filter?" It cannot alert on latency, error rates, or throughput — the core APM alerting use cases.
|
|
|
|
**Target**: Full APM-grade alerting with latency percentiles, error rate, request rate, and Apdex.
|
|
|
|
**Implementation — extend existing infrastructure**:
|
|
|
|
#### 1.3.1 Add Latency Percentile Alerts (P50/P90/P95/P99)
|
|
- Add `CheckOn.P50Latency`, `CheckOn.P90Latency`, `CheckOn.P95Latency`, `CheckOn.P99Latency` to `CriteriaFilter.ts`
|
|
- In `monitorTrace()` worker function, compute `quantile(0.50)(durationUnixNano)` etc. via ClickHouse instead of just `countBy()`
|
|
- Return latency values in `TraceMonitorResponse` alongside span count
|
|
- Add latency criteria evaluation in `TraceMonitorCriteria.ts`
|
|
|
|
#### 1.3.2 Add Error Rate Alerts
|
|
- Add `CheckOn.ErrorRate` to `CriteriaFilter.ts`
|
|
- Compute `countIf(statusCode = 2) / count() * 100` in the worker query
|
|
- Return error rate percentage in `TraceMonitorResponse`
|
|
- Criteria: "alert if error rate > 5%"
|
|
|
|
#### 1.3.3 Add Average/Max Duration Alerts
|
|
- Add `CheckOn.AvgDuration`, `CheckOn.MaxDuration` to `CriteriaFilter.ts`
|
|
- Compute `avg(durationUnixNano)`, `max(durationUnixNano)` in worker query
|
|
- Useful for simpler latency alerts without percentile overhead
|
|
|
|
#### 1.3.4 Add Request Rate (Throughput) Alerts
|
|
- Add `CheckOn.SpanRate` to `CriteriaFilter.ts`
|
|
- Compute `count() / time_window_seconds` to normalize to spans/second
|
|
- Criteria: "alert if request rate drops below 10 req/s" (detects outages)
|
|
|
|
#### 1.3.5 Add Apdex Score (Nice-to-have)
|
|
- Add `CheckOn.ApdexScore` to `CriteriaFilter.ts`
|
|
- Compute from duration thresholds: `(satisfied + tolerating*0.5) / total`
|
|
- Allow configuring satisfied/tolerating thresholds per monitor (e.g., satisfied < 500ms, tolerating < 2s)
|
|
|
|
**Files to modify**:
|
|
- `Common/Types/Monitor/CriteriaFilter.ts` (add new CheckOn values: P50Latency, P90Latency, P95Latency, P99Latency, ErrorRate, AvgDuration, MaxDuration, SpanRate, ApdexScore)
|
|
- `Common/Types/Monitor/TraceMonitor/TraceMonitorResponse.ts` (add latency, error rate, throughput fields)
|
|
- `Common/Server/Utils/Monitor/Criteria/TraceMonitorCriteria.ts` (add evaluation for new criteria types)
|
|
- `Worker/Jobs/TelemetryMonitor/MonitorTelemetryMonitor.ts` (change `monitorTrace()` from `countBy()` to aggregation query returning all metrics)
|
|
- `App/FeatureSet/Dashboard/src/Components/Form/Monitor/TraceMonitor/TraceMonitorStepForm.tsx` (add criteria type selector for latency/error rate/throughput)
|
|
|
|
### 1.4 Head-Based Probabilistic Sampling
|
|
|
|
**Current**: Ingests 100% of received traces.
|
|
**Target**: Configurable per-service probabilistic sampling with rules to always keep errors and slow traces.
|
|
|
|
**Implementation**:
|
|
|
|
- Create `TraceSamplingRule` PostgreSQL model: service filter, sample rate (0-100%), conditions to always keep (error status, duration > threshold)
|
|
- Evaluate sampling rules in `OtelTracesIngestService.ts` before ClickHouse insert
|
|
- Use deterministic sampling based on traceId hash (so all spans from the same trace are kept or dropped together)
|
|
- UI under Settings > Trace Configuration > Sampling Rules
|
|
- Show estimated storage savings
|
|
|
|
**Files to modify**:
|
|
- `Common/Models/DatabaseModels/TraceSamplingRule.ts` (new)
|
|
- `Telemetry/Services/OtelTracesIngestService.ts` (add sampling logic)
|
|
- Dashboard: new Settings page for sampling configuration
|
|
|
|
---
|
|
|
|
## Phase 2: Visualization & Debugging UX (P1) — Industry-Standard Features
|
|
|
|
### 2.1 Flame Graph View
|
|
|
|
**Current**: Only Gantt/waterfall view.
|
|
**Target**: Flame graph visualization showing proportional time spent in each span, with service color coding.
|
|
|
|
**Implementation**:
|
|
|
|
- Build a flame graph component that renders spans as horizontally stacked rectangles proportional to duration
|
|
- Allow switching between Waterfall and Flame Graph views in TraceExplorer
|
|
- Color-code by service (consistent with waterfall view)
|
|
- Click a span rectangle to focus/zoom into that subtree
|
|
- Show tooltip with span name, service, duration, self-time on hover
|
|
|
|
**Files to modify**:
|
|
- `App/FeatureSet/Dashboard/src/Components/Traces/FlameGraph.tsx` (new)
|
|
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view toggle)
|
|
|
|
### 2.2 Latency Breakdown / Critical Path Analysis
|
|
|
|
**Current**: Shows individual span durations but no automated analysis.
|
|
**Target**: Compute and display critical path, self-time vs child-time, and bottleneck identification.
|
|
|
|
**Implementation**:
|
|
|
|
- Compute critical path: the longest sequential chain of spans through the trace (accounts for parallelism)
|
|
- Calculate "self time" per span: `span.duration - sum(child.duration)` (clamped to 0 for overlapping children)
|
|
- Display latency breakdown by service: percentage of total trace time spent in each service
|
|
- Highlight bottleneck spans (spans contributing most to critical path duration)
|
|
- Add "Critical Path" toggle in TraceExplorer that highlights the critical path spans
|
|
|
|
**Files to modify**:
|
|
- `Common/Utils/Traces/CriticalPath.ts` (new - critical path algorithm)
|
|
- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (show self-time)
|
|
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add critical path view)
|
|
|
|
### 2.3 In-Trace Span Search
|
|
|
|
**Current**: TraceExplorer shows all spans with service filtering and error toggle, but no text search.
|
|
**Target**: Search box to filter spans by name, attribute values, or status within the current trace.
|
|
|
|
**Implementation**:
|
|
|
|
- Add a search input in TraceExplorer toolbar
|
|
- Client-side filtering: match span name, service name, attribute keys/values against search text
|
|
- Highlight matching spans in the waterfall/flame graph
|
|
- Show match count (e.g., "3 of 47 spans")
|
|
|
|
**Files to modify**:
|
|
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add search bar and filtering)
|
|
|
|
### 2.4 Per-Trace Service Flow Map
|
|
|
|
**Current**: Service dependency graph exists globally but not per-trace.
|
|
**Target**: Per-trace visualization showing the path of a request through services with latency annotations.
|
|
|
|
**Implementation**:
|
|
|
|
- Build a directed graph from the spans in a single trace (services as nodes, calls as edges)
|
|
- Annotate edges with call count and latency
|
|
- Color-code nodes by error status
|
|
- Add as a new view tab alongside Waterfall and Flame Graph
|
|
|
|
**Files to modify**:
|
|
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceServiceMap.tsx` (new)
|
|
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view tab)
|
|
|
|
### 2.5 Span Link Navigation
|
|
|
|
**Current**: Links data is stored in spans but not navigable in the UI.
|
|
**Target**: Clickable links in the span detail panel that navigate to related traces/spans.
|
|
|
|
**Implementation**:
|
|
|
|
- In the SpanViewer detail panel, render the `links` array as clickable items
|
|
- Each link shows the linked traceId, spanId, and relationship type
|
|
- Clicking navigates to the linked trace view
|
|
|
|
**Files to modify**:
|
|
- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (render clickable links)
|
|
|
|
---
|
|
|
|
## Phase 3: Advanced Analytics & Correlation (P2) — Power Features
|
|
|
|
### 3.1 Trace-to-Metric Exemplars
|
|
|
|
**Current**: Metric model has no traceId/spanId fields.
|
|
**Target**: Link metric data points to trace IDs; show exemplar dots on metric charts that navigate to traces.
|
|
|
|
**Implementation**:
|
|
|
|
- Add optional `traceId` and `spanId` columns to the Metric ClickHouse model
|
|
- During metric ingestion, extract exemplar trace/span IDs from OTLP exemplar fields
|
|
- On metric charts, render exemplar dots at data points that have associated traces
|
|
- Clicking an exemplar dot navigates to the trace view
|
|
|
|
**Files to modify**:
|
|
- `Common/Models/AnalyticsModels/Metric.ts` (add traceId/spanId columns)
|
|
- `Telemetry/Services/OtelMetricsIngestService.ts` (extract exemplars)
|
|
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render exemplar dots)
|
|
|
|
### 3.2 Custom Metrics from Spans
|
|
|
|
**Current**: No way to create persistent metrics from trace data.
|
|
**Target**: Users define custom metrics from span attributes that are computed via ClickHouse materialized views and available for alerting and dashboards.
|
|
|
|
**Implementation**:
|
|
|
|
- Create `SpanDerivedMetric` model: name, filter query (which spans), aggregation (count/avg/p99 of what field), GROUP BY attributes
|
|
- Use ClickHouse materialized views for efficient computation
|
|
- Surface derived metrics in the metric explorer and alerting system
|
|
|
|
**Files to modify**:
|
|
- `Common/Models/DatabaseModels/SpanDerivedMetric.ts` (new)
|
|
- `Common/Server/Services/SpanDerivedMetricService.ts` (new)
|
|
- Dashboard: UI for defining derived metrics
|
|
|
|
### 3.3 Structural Trace Queries
|
|
|
|
**Current**: Can only filter on individual span attributes.
|
|
**Target**: Query traces based on properties of multiple spans and their relationships (e.g., "find traces where service A called service B and B returned an error").
|
|
|
|
**Implementation**:
|
|
|
|
- Design a visual query builder for structural queries (easier adoption than a query language)
|
|
- Translate structural queries to ClickHouse subqueries with JOINs on traceId
|
|
- Example: "Find traces where span with service=frontend has child span with service=database AND duration > 500ms"
|
|
```sql
|
|
SELECT DISTINCT s1.traceId FROM SpanItem s1
|
|
JOIN SpanItem s2 ON s1.traceId = s2.traceId AND s1.spanId = s2.parentSpanId
|
|
WHERE s1.projectId = {pid}
|
|
AND JSONExtractString(s1.attributes, 'service.name') = 'frontend'
|
|
AND JSONExtractString(s2.attributes, 'service.name') = 'database'
|
|
AND s2.durationUnixNano > 500000000
|
|
```
|
|
|
|
**Files to modify**:
|
|
- `Common/Types/Traces/StructuralTraceQuery.ts` (new - query model)
|
|
- `Common/Server/Services/SpanService.ts` (add structural query execution)
|
|
- `App/FeatureSet/Dashboard/src/Components/Traces/StructuralQueryBuilder.tsx` (new - visual builder)
|
|
|
|
### 3.4 Trace Comparison / Diffing
|
|
|
|
**Current**: No way to compare traces.
|
|
**Target**: Side-by-side comparison of two traces of the same operation, highlighting differences in span count, latency, and structure.
|
|
|
|
**Implementation**:
|
|
|
|
- Add "Compare" action to trace list (select two traces)
|
|
- Build a diff view showing: added/removed spans, latency differences per span, structural changes
|
|
- Useful for comparing a slow trace to a fast trace of the same operation
|
|
|
|
**Files to modify**:
|
|
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceComparison.tsx` (new)
|
|
- `App/FeatureSet/Dashboard/src/Pages/Traces/Compare.tsx` (new page)
|
|
|
|
---
|
|
|
|
## Phase 4: Competitive Differentiation (P3) — Long-Term
|
|
|
|
### 4.1 Rules-Based and Tail-Based Sampling
|
|
|
|
**Current**: Phase 1 adds head-based probabilistic sampling.
|
|
**Target**: Rules-based sampling (always keep errors/slow traces, sample successes) and eventually tail-based sampling (buffer complete traces, decide after seeing all spans).
|
|
|
|
**Implementation**:
|
|
|
|
- Rules engine: configurable conditions (service, status, duration, attributes) with per-rule sample rates
|
|
- Tail-based: buffer spans for a configurable window (30s), assemble complete traces, then apply retention decisions
|
|
- Tail-based is complex; consider integrating with OpenTelemetry Collector's tail sampling processor as an alternative
|
|
|
|
### 4.2 AI/ML on Trace Data
|
|
|
|
- **Anomaly detection** on RED metrics (statistical deviation from baseline)
|
|
- **Auto-surfacing correlated attributes** when latency spikes (similar to Honeycomb BubbleUp)
|
|
- **Natural language trace queries** ("show me slow database calls from the last hour")
|
|
- **Automatic root cause analysis** from trace data during incidents
|
|
|
|
### 4.3 RUM (Real User Monitoring) Correlation
|
|
|
|
- Browser SDK that propagates W3C trace context from frontend to backend
|
|
- Link frontend page loads, interactions, and web vitals to backend traces
|
|
- Show end-to-end user experience from browser to backend services
|
|
|
|
### 4.4 Continuous Profiling Integration
|
|
|
|
- Integrate with a profiling backend (e.g., Pyroscope)
|
|
- Link profile data to span time windows
|
|
- Show "Code Hotspots" within spans (similar to DataDog)
|
|
|
|
---
|
|
|
|
## ClickHouse Storage Improvements
|
|
|
|
### S.3 Add Trace-by-ID Projection (LOW)
|
|
|
|
**Current**: Trace detail view relies on BloomFilter skip index for traceId lookups. (Note: `proj_trace_by_id` projection has been added but may need evaluation for further optimization.)
|
|
**Target**: Projection sorted by `(projectId, traceId, startTime)` for faster trace-by-ID queries.
|
|
|
|
---
|
|
|
|
## Quick Wins (Can Ship This Week)
|
|
|
|
1. **In-trace span search** - Add a text filter in TraceExplorer (few hours of work)
|
|
2. **Self-time calculation** - Show "self time" (span duration minus child durations) in SpanViewer
|
|
3. **Span link navigation** - Links data is stored but not clickable in UI
|
|
4. **Top-N slowest operations** - Simple ClickHouse query: `ORDER BY durationUnixNano DESC LIMIT N`
|
|
5. **Error rate by service** - Aggregate `statusCode=2` counts grouped by serviceId
|
|
6. **Trace duration distribution histogram** - Use ClickHouse `histogram()` on durationUnixNano
|
|
7. **Span count per service display** - Already tracked in `servicesInTrace`, just needs better display
|
|
|
|
---
|
|
|
|
## Recommended Implementation Order
|
|
|
|
1. **Phase 1.1** - Trace Analytics Engine (highest impact, unlocks everything else)
|
|
2. **Phase 1.2** - RED Metrics from Traces (prerequisite for alerting, service overview)
|
|
3. **Quick Wins** - Ship in-trace search, self-time, span links, top-N operations
|
|
4. **Phase 1.3** - Trace-Based Alerting (core observability workflow)
|
|
5. **Phase 2.1** - Flame Graph View (industry-standard visualization)
|
|
6. **Phase 2.2** - Critical Path Analysis (key debugging capability)
|
|
7. **Phase 1.4** - Head-Based Sampling (essential for high-volume users)
|
|
8. **Phase 2.3-2.5** - In-trace search, per-trace map, span links
|
|
9. **Phase 3.1** - Trace-to-Metric Exemplars
|
|
10. **Phase 3.2-3.4** - Custom metrics, structural queries, comparison
|
|
11. **Phase 4.x** - AI/ML, RUM, profiling (long-term)
|
|
|
|
## Verification
|
|
|
|
For each feature:
|
|
1. Unit tests for new query builders, critical path algorithm, sampling logic
|
|
2. Integration tests for new API endpoints (analytics, RED metrics, sampling)
|
|
3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/traces`
|
|
4. Check ClickHouse query performance with `EXPLAIN` for new aggregation queries
|
|
5. Verify trace correlation (logs, exceptions, metrics) still works correctly with new features
|
|
6. Load test sampling logic to ensure it doesn't add ingestion latency
|