feat: Add comprehensive metrics and traces roadmap for industry parity

- Introduced detailed plans for enhancing OneUptime's metrics and traces capabilities to match and exceed industry standards.
- Metrics roadmap includes features like percentile aggregations, rate calculations, multi-attribute grouping, rollups, and advanced visualizations.
- Traces roadmap outlines improvements such as trace analytics, RED metrics, trace-based alerting, and enhanced visualization options like flame graphs and critical path analysis.
- Both roadmaps emphasize phased implementation, quick wins, and verification strategies to ensure robust feature delivery and performance.
This commit is contained in:
Nawaz Dhandala
2026-03-16 09:51:08 +00:00
parent 4781c6a532
commit 2c7486714f
7 changed files with 1476 additions and 108 deletions

View File

@@ -18,7 +18,7 @@ on:
default: false
permissions:
contents: read
contents: write
packages: write
jobs:

View File

@@ -10,7 +10,7 @@ on:
- "master"
permissions:
contents: read
contents: write
packages: write
jobs:

View File

@@ -0,0 +1,568 @@
# Plan: Bring OneUptime Dashboards to Industry Parity and Beyond
## Context
OneUptime's dashboard implementation provides a 12-column grid layout with drag-and-drop editing, 3 widget types (Chart with Line/Bar, Value, Text with basic formatting), global time range with presets, view/edit modes, role-based permissions, and full-screen support. Dashboard config is stored as a single JSON column. Dashboards can only query OpenTelemetry metrics from ClickHouse.
This plan identifies the remaining gaps vs Grafana, Datadog, and New Relic, and proposes a phased implementation to build a best-in-class dashboard product that leverages OneUptime's unique position as an all-in-one observability + status page platform.
## Completed
The following features have been implemented:
- **12-Column Grid Layout** - Fixed grid with dynamic unit sizing, 60 default rows (expandable)
- **Drag-and-Drop Editing** - Move and resize components with bounds checking
- **Chart Widget** - Line and Bar chart types with single metric query, configurable title/description/legend
- **Value Widget** - Single metric aggregation displayed as large number
- **Text Widget** - Bold/Italic/Underline formatting (no markdown)
- **Global Time Range** - Presets (30min to 3mo) + custom date range picker
- **View/Edit Modes** - Read-only view with full-screen, edit mode with side panel settings
- **Role-Based Permissions** - ProjectOwner, ProjectAdmin, ProjectMember + custom permissions
- **Dashboard CRUD API** - Standard REST API with slug generation
- **Billing Enforcement** - Free plan limited to 1 dashboard
## Gap Analysis Summary
| Feature | OneUptime | Grafana | Datadog | New Relic | Priority |
|---------|-----------|---------|---------|-----------|----------|
| Widget types | 3 | 20+ | 40+ | 15+ | **P0** |
| Chart types | 2 (Line, Bar) | 10+ | 12+ | 10+ | **P0** |
| Template variables | None | 6+ types | Yes | 3 types | **P0** |
| Auto-refresh | None | Configurable | Real-time | Yes | **P0** |
| Log panels | None | Yes (Loki) | Yes | Yes (NRQL) | **P0** |
| Trace panels | None | Yes (Tempo) | Yes | Yes | **P0** |
| Table widget | None | Yes | Yes | Yes | **P0** |
| Multiple queries per chart | Single query | Yes | Yes | Yes | **P0** |
| Markdown support | Basic formatting only | Full markdown | Full markdown | Full markdown | **P0** |
| Threshold lines / color coding | None | Yes | Yes | Yes | **P0** |
| Legend interaction (show/hide) | None | Yes | Yes | Yes | **P0** |
| Chart zoom | None | Yes | Yes | Yes | **P0** |
| Dashboard linking / drill-down | None | Data links | Yes | Facet linking | **P1** |
| Annotations / event overlays | None | Yes | Yes | Yes (Labs) | **P1** |
| Row/section grouping | None | Collapsible rows | Groups | No | **P1** |
| Public/shared dashboards | None | Yes | Yes | Yes | **P1** |
| JSON import/export | None | Yes | Yes | Yes | **P1** |
| Dashboard versioning | None | Yes | Yes | No | **P1** |
| Alert integration | None | Create from panel + show state | Yes | NRQL alerts | **P1** |
| TV/Kiosk mode | Full-screen only | Kiosk mode | Yes | Auto-cycling | **P1** |
| CSV export | None | Yes | Yes | Yes | **P1** |
| Custom time per widget | None | No | No | No | **P1** |
| AI dashboard creation | None | None | None | None | **P2** |
| Dashboard-as-code SDK | None | Foundation SDK | No | No | **P2** |
| Terraform provider | None | Yes | Yes | Yes | **P2** |
---
## Phase 1: Foundation (P0) — Close Critical Gaps
These gaps make OneUptime dashboards fundamentally non-competitive. Every major competitor has these.
### 1.1 Add Core Chart Types: Area, Pie, Table, Gauge, Heatmap, Histogram
**Current**: Line and Bar only.
**Target**: 8+ chart types covering all standard observability visualization needs.
**Implementation**:
- **Area Chart** (stacked and non-stacked): Extension of line chart with fill. Use existing chart library's area mode
- **Pie/Donut Chart**: For proportional breakdowns (e.g., error distribution by service). New component
- **Table Widget**: Tabular metric data, top-N lists, multi-column display with sortable columns. New component type
- **Gauge Widget**: Circular gauge with configurable min/max/thresholds and color zones. New component
- **Heatmap**: Time on X-axis, value buckets on Y-axis, color intensity for count. Essential for distribution/histogram metrics
- **Histogram**: Bar chart showing value distribution. Important for latency analysis
Each chart type needs:
- A new entry in `DashboardComponentType` or `ChartType` enum
- A rendering component in `Dashboard/Components/`
- Configuration options in the component settings side panel
**Files to modify**:
- `Common/Types/Dashboard/Chart/ChartType.ts` (add Area, Pie, Heatmap, Histogram, Gauge)
- `Common/Types/Dashboard/DashboardComponentType.ts` (add Table, Gauge)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render new types)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTableComponent.tsx` (new)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardGaugeComponent.tsx` (new)
### 1.2 Template Variables
**Current**: No template variables. Users must create separate dashboards for each service/host/environment.
**Target**: Drop-down variable selectors that dynamically filter all widgets.
**Implementation**:
- Create a `DashboardVariable` type stored in `dashboardViewConfig`:
- Name, label, type (query-based, custom list, text input)
- Query-based: runs a ClickHouse query to populate options (e.g., `SELECT DISTINCT service FROM MetricItem WHERE projectId = {pid}`)
- Custom list: manually defined options
- Multi-value selection support
- Render variables as dropdown selectors in the dashboard toolbar
- Variables can be referenced in metric queries as `$variable_name`
- When a variable changes, all widgets re-query with the new value
- Support cascading variables (variable B's query depends on variable A's value)
**Files to modify**:
- `Common/Types/Dashboard/DashboardVariable.ts` (new)
- `Common/Types/Dashboard/DashboardViewConfig.ts` (add variables array)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Toolbar/DashboardToolbar.tsx` (render variable dropdowns)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (pass variable values to widgets)
- `Common/Server/Services/MetricService.ts` (resolve variable references in queries)
### 1.3 Auto-Refresh
**Current**: Data goes stale after initial load.
**Target**: Configurable auto-refresh intervals.
**Implementation**:
- Add auto-refresh dropdown in toolbar with options: Off, 5s, 10s, 30s, 1m, 5m, 15m
- Store selected interval in dashboard config and URL state
- Use `setInterval` to trigger re-fetch on all metric widgets
- Show a subtle refresh indicator when data is being updated
- Pause auto-refresh when the dashboard is in edit mode
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Toolbar/DashboardToolbar.tsx` (add refresh dropdown)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (implement refresh timer)
- `Common/Types/Dashboard/DashboardViewConfig.ts` (store refresh interval)
### 1.4 Multiple Queries per Chart
**Current**: Single `MetricQueryConfigData` per chart.
**Target**: Overlay multiple metric series on a single chart for correlation.
**Implementation**:
- Change chart component's data source from single `MetricQueryConfigData` to `MetricQueryConfigData[]`
- Each query gets its own alias and legend entry
- Support formula references across queries (e.g., `a / b * 100`)
- Y-axis: support dual Y-axes for metrics with different scales
**Files to modify**:
- `Common/Utils/Dashboard/Components/DashboardChartComponent.ts` (change to array)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render multiple series)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/ComponentSettingsSideOver.tsx` (multi-query config UI)
### 1.5 Full Markdown Support for Text Widget
**Current**: Only bold, italic, underline formatting.
**Target**: Full markdown rendering including headers, links, lists, code blocks, tables, and images.
**Implementation**:
- Replace the current custom formatting with a markdown renderer (e.g., `react-markdown` or `marked`)
- Support: headers (h1-h6), links, ordered/unordered lists, code blocks with syntax highlighting, tables, images, blockquotes
- Edit mode: raw markdown text area with preview toggle
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTextComponent.tsx` (replace with markdown renderer)
- `Common/Utils/Dashboard/Components/DashboardTextComponent.ts` (store raw markdown)
### 1.6 Threshold Lines & Color Coding
**Current**: No threshold visualization.
**Target**: Configurable warning/critical thresholds on charts with color-coded regions.
**Implementation**:
- Add threshold configuration to chart settings: value, label, color (default: yellow for warning, red for critical)
- Render horizontal lines on the chart at threshold values
- Optionally fill regions above/below thresholds with translucent color
- For value/billboard widgets: change background color based on which threshold range the value falls in (green/yellow/red)
**Files to modify**:
- `Common/Utils/Dashboard/Components/DashboardChartComponent.ts` (add thresholds config)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render threshold lines)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardValueComponent.tsx` (color coding)
### 1.7 Legend Interaction (Show/Hide Series)
**Current**: Legends are display-only.
**Target**: Click legend items to toggle series visibility.
**Implementation**:
- Add click handler on legend items to toggle series visibility
- Clicked-off series should be visually dimmed in the legend and removed from the chart
- Support "isolate" mode: Ctrl+Click shows only that series and hides all others
- Persist toggled state during the session (reset on page reload)
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (add legend click handlers)
### 1.8 Chart Zoom (Click-Drag Time Selection)
**Current**: No zoom capability.
**Target**: Click and drag on a time series chart to zoom into a time range.
**Implementation**:
- Enable brush/selection mode on time series charts
- When user drags to select a range, update the global time range to the selected range
- Show a "Reset zoom" button to return to the previous time range
- Maintain a zoom stack so users can zoom in multiple times and zoom back out
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (add brush selection)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (handle time range updates from zoom)
---
## Phase 2: Observability Integration (P0-P1) — Leverage the Full Platform
This is where OneUptime can differentiate: metrics, logs, and traces in one platform.
### 2.1 Log Stream Widget
**Current**: Dashboards can only show metrics.
**Target**: Widget that displays a live log stream with filtering.
**Implementation**:
- New `DashboardComponentType.LogStream` widget type
- Configuration: log query filter, severity filter, service filter, max rows
- Renders as a scrolling log list with severity color coding, timestamp, and body
- Click a log entry to expand and see full details
- Respects dashboard time range and template variables
**Files to modify**:
- `Common/Types/Dashboard/DashboardComponentType.ts` (add LogStream)
- `Common/Utils/Dashboard/Components/DashboardLogStreamComponent.ts` (new - config)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardLogStreamComponent.tsx` (new - rendering)
### 2.2 Trace List Widget
**Current**: No trace visualization in dashboards.
**Target**: Widget showing a filtered trace list with duration and status.
**Implementation**:
- New `DashboardComponentType.TraceList` widget type
- Configuration: service filter, operation filter, status filter, min duration
- Renders as a table: trace ID, operation, service, duration, status, timestamp
- Click a row to navigate to the full trace view
- Respects dashboard time range and template variables
**Files to modify**:
- `Common/Types/Dashboard/DashboardComponentType.ts` (add TraceList)
- `Common/Utils/Dashboard/Components/DashboardTraceListComponent.ts` (new)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTraceListComponent.tsx` (new)
### 2.3 Click-to-Correlate Across Signals
**Current**: No cross-signal correlation in dashboards.
**Target**: Click a point on a metric chart to instantly see related logs and traces from that timestamp.
**Implementation**:
- When clicking a data point on a metric chart, open a correlation panel showing:
- Logs from the same service and time window (+/- 5 minutes around the clicked point)
- Traces from the same service and time window
- Filtered by the same template variables
- The correlation panel appears as a slide-over or split view below the chart
- This is a major differentiator vs Grafana (which requires separate datasources) and ties into OneUptime's all-in-one advantage
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (add click handler)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/CorrelationPanel.tsx` (new - shows correlated logs/traces)
### 2.4 Annotations / Event Overlays
**Current**: No event markers on charts.
**Target**: Show deployment events, incidents, and alerts as vertical markers on time series charts.
**Implementation**:
- Query OneUptime's own data for events in the chart's time range:
- Incidents (from Incident model)
- Deployments (can be sent as OTLP resource attributes or a custom event API)
- Alert triggers (from monitor alert history)
- Render as vertical dashed lines with icons on hover
- Color-code by type: red for incidents, blue for deployments, yellow for alerts
- Allow users to add manual annotations (text + timestamp)
**Files to modify**:
- `Common/Types/Dashboard/DashboardAnnotation.ts` (new)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render annotation markers)
- `Common/Server/API/DashboardAnnotationAPI.ts` (new - query events)
### 2.5 Alert Integration
**Current**: No connection between dashboards and alerting.
**Target**: Create alerts from dashboard panels and display alert state on panels.
**Implementation**:
- "Create Alert" button in chart settings that pre-fills a metric monitor with the chart's query
- Show alert state indicator on chart headers (green/yellow/red dot) based on associated monitor status
- Alert status widget: shows a summary of all active alerts with severity and duration
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/ComponentSettingsSideOver.tsx` (add "Create Alert" button)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (show alert state)
- `Common/Types/Dashboard/DashboardComponentType.ts` (add AlertStatus widget type)
### 2.6 SLO/SLI Widget
**Current**: No SLO visualization.
**Target**: Dedicated widget showing SLO status, error budget burn rate, and remaining budget.
**Implementation** (depends on Metrics roadmap Phase 3.2 - SLO/SLI Tracking):
- New `DashboardComponentType.SLO` widget type
- Configuration: select an SLO definition
- Displays: current attainment (%), target (%), error budget remaining (%), burn rate chart
- Color-coded: green (healthy), yellow (burning fast), red (budget exhausted)
**Files to modify**:
- `Common/Types/Dashboard/DashboardComponentType.ts` (add SLO)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardSLOComponent.tsx` (new)
---
## Phase 3: Collaboration & Sharing (P1) — Production Workflows
### 3.1 Public/Shared Dashboards
**Current**: Dashboards require login.
**Target**: Share dashboards with external stakeholders without requiring authentication.
**Implementation**:
- Add `isPublic` flag and `publicAccessToken` to Dashboard model
- Generate a shareable URL with token: `/public/dashboard/{token}`
- Public view is read-only with no editing controls
- Option to restrict public access to specific IP ranges
- Render without the OneUptime navigation chrome
**Files to modify**:
- `Common/Models/DatabaseModels/Dashboard.ts` (add isPublic, publicAccessToken)
- `App/FeatureSet/Dashboard/src/Pages/Public/Dashboard.tsx` (new - public dashboard view)
### 3.2 JSON Import/Export
**Current**: No import/export capability.
**Target**: Export dashboards as JSON and re-import for backup, migration, and dashboard-as-code.
**Implementation**:
- Export: serialize `dashboardViewConfig` + metadata (name, description, variables) as a JSON file download
- Import: upload a JSON file, validate schema, create a new dashboard from the config
- Handle version compatibility (include a schema version in the export)
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Dashboards.tsx` (add import button)
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/View/Settings.tsx` (add export button)
- `Common/Server/API/DashboardImportExportAPI.ts` (new)
### 3.3 Dashboard Versioning
**Current**: No change history.
**Target**: Track changes to dashboards over time with the ability to view history and revert.
**Implementation**:
- Create `DashboardVersion` model: dashboardId, version number, config snapshot, changedBy, timestamp
- On each save, create a new version entry
- UI: "Version History" in settings showing a list of versions with timestamps and authors
- "Restore" button to revert to a previous version
- Optional: diff view comparing two versions
**Files to modify**:
- `Common/Models/DatabaseModels/DashboardVersion.ts` (new)
- `Common/Server/Services/DashboardService.ts` (create version on save)
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/View/VersionHistory.tsx` (new)
### 3.4 Row/Section Grouping
**Current**: Components placed freely with no grouping.
**Target**: Collapsible rows/sections for organizing related panels.
**Implementation**:
- Add a "Section" component type that acts as a collapsible container
- Section has a title bar that can be clicked to collapse/expand
- When collapsed, hides all components within the section's vertical range
- Sections can be nested one level deep
**Files to modify**:
- `Common/Types/Dashboard/DashboardComponentType.ts` (add Section)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardSectionComponent.tsx` (new)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/Index.tsx` (handle section collapse)
### 3.5 TV/Kiosk Mode
**Current**: Full-screen only.
**Target**: Dedicated kiosk mode optimized for wall-mounted monitors with auto-cycling.
**Implementation**:
- Kiosk mode: hides all chrome (toolbar, navigation, URL bar), shows only the dashboard grid
- Auto-cycle: rotate through a list of dashboards at a configurable interval (30s, 1m, 5m)
- Dashboard playlist: define an ordered list of dashboards to cycle through
- Support per-dashboard display duration
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Kiosk.tsx` (new - kiosk view)
- `Common/Models/DatabaseModels/DashboardPlaylist.ts` (new - playlist model)
### 3.6 CSV Export
**Current**: No data export.
**Target**: Export chart/table data as CSV for offline analysis.
**Implementation**:
- Add "Export CSV" option in chart/table context menu
- Client-side: serialize the current rendered data to CSV format
- Include column headers, timestamps, and values
- Trigger browser download
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (add export option)
- `Common/Utils/Dashboard/CSVExport.ts` (new - CSV serialization)
### 3.7 Custom Time Range per Widget
**Current**: All widgets share the global time range.
**Target**: Individual widgets can override the global time range.
**Implementation**:
- Add optional `timeRangeOverride` to each component's config
- When set, the widget uses its own time range instead of the global one
- Show a small clock icon on widgets with custom time ranges
- Configuration in the component settings side panel
**Files to modify**:
- `Common/Utils/Dashboard/Components/DashboardBaseComponent.ts` (add timeRangeOverride)
- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (pass per-widget time ranges)
---
## Phase 4: Differentiation (P2-P3) — Surpass Competition
### 4.1 AI-Powered Dashboard Creation
**Current**: Manual dashboard creation only.
**Target**: Natural language dashboard creation - "Show me CPU usage by service for the last 24 hours" auto-creates the right widget.
**Implementation**:
- Natural language input in the "Add Widget" dialog
- AI translates to: metric name, aggregation, group by, chart type, time range
- Uses available MetricType metadata to match metric names
- Preview the generated widget before adding to dashboard
- This is a feature NO competitor has done well yet - major differentiator
### 4.2 Pre-Built Dashboard Templates
**Current**: No templates.
**Target**: One-click dashboard templates for common stacks.
**Implementation**:
- Template library: Node.js, Python, Go, Java, Kubernetes, PostgreSQL, Redis, Nginx, MongoDB, etc.
- Auto-detect relevant templates based on ingested telemetry data
- "One-click create" instantiates a full dashboard from the template
- Community template sharing (future)
### 4.3 Auto-Generated Dashboards
**Current**: Users must manually build dashboards.
**Target**: When a service connects, auto-generate a relevant dashboard.
**Implementation**:
- On first telemetry ingest from a new service, analyze the metric names and types
- Auto-create a service dashboard with relevant charts based on detected metrics
- Include golden signals (latency, traffic, errors, saturation) where applicable
- Notify the user and link to the auto-generated dashboard
### 4.4 Customer-Facing Dashboards on Status Pages
**Current**: Status pages and dashboards are separate.
**Target**: Embed dashboard widgets on status pages for real-time performance visibility.
**Implementation**:
- Allow selecting specific dashboard widgets to embed on a status page
- Render widgets in read-only mode without internal navigation
- Respect public/private data boundaries (only show metrics the customer should see)
- This is unique to OneUptime - no competitor has integrated observability dashboards with status pages
### 4.5 Dashboard-as-Code SDK
**Current**: No programmatic dashboard creation.
**Target**: TypeScript SDK for defining dashboards as code.
**Implementation**:
```typescript
const dashboard = new Dashboard("Service Health")
.addVariable("service", { type: "query", query: "SELECT DISTINCT service FROM MetricItem" })
.addRow("Latency")
.addChart({ metric: "http.server.duration", aggregation: "p99", groupBy: ["$service"] })
.addChart({ metric: "http.server.duration", aggregation: "p50", groupBy: ["$service"] })
.addRow("Throughput")
.addChart({ metric: "http.server.request.count", aggregation: "rate", groupBy: ["$service"] })
```
- SDK generates the JSON config and uses the Dashboard API to create/update
- Git-based provisioning: store dashboard definitions in repo, CI/CD syncs to OneUptime
### 4.6 Anomaly Detection Overlays
**Current**: No anomaly visualization.
**Target**: AI highlights anomalous data points on charts without manual threshold configuration.
**Implementation** (depends on Metrics roadmap Phase 3.1 - Anomaly Detection):
- Automatically overlay expected range bands (baseline +/- N sigma) on metric charts
- Highlight data points outside the expected range with color indicators
- Click an anomaly to see correlated changes across metrics, logs, and traces
---
## Quick Wins (Can Ship This Week)
1. **Auto-refresh** - Add a simple `setInterval` refresh with dropdown selector in toolbar
2. **Full markdown for text widget** - Replace custom formatting with a markdown renderer
3. **Legend show/hide** - Add click handler on legend items to toggle series
4. **Stacked area chart** - Simple extension of existing line chart with fill
5. **Chart zoom** - Enable brush selection on time series charts
---
## Recommended Implementation Order
1. **Quick Wins** - Auto-refresh, markdown, legend toggle, stacked area, chart zoom
2. **Phase 1.1** - More chart types (Area, Pie, Table, Gauge)
3. **Phase 1.2** - Template variables (highest-impact feature for dashboard usability)
4. **Phase 1.4** - Multiple queries per chart
5. **Phase 1.6** - Threshold lines & color coding
6. **Phase 2.1** - Log stream widget (leverages all-in-one platform)
7. **Phase 2.2** - Trace list widget
8. **Phase 2.3** - Click-to-correlate (major differentiator)
9. **Phase 2.4** - Annotations / event overlays
10. **Phase 2.5** - Alert integration
11. **Phase 3.1** - Public/shared dashboards
12. **Phase 3.2** - JSON import/export
13. **Phase 3.4** - Row/section grouping
14. **Phase 3.5** - TV/Kiosk mode
15. **Phase 3.3** - Dashboard versioning
16. **Phase 2.6** - SLO widget (depends on SLO/SLI from Metrics roadmap)
17. **Phase 4.2** - Pre-built dashboard templates
18. **Phase 4.3** - Auto-generated dashboards
19. **Phase 4.1** - AI-powered dashboard creation
20. **Phase 4.4** - Customer-facing dashboards on status pages
21. **Phase 4.5** - Dashboard-as-code SDK
## Verification
For each feature:
1. Unit tests for new widget types, template variable resolution, CSV export logic
2. Integration tests for new API endpoints (annotations, public dashboards, import/export)
3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/dashboards`
4. Visual regression testing for new chart types (ensure correct rendering across browsers)
5. Performance testing: verify dashboards with 20+ widgets and auto-refresh don't cause excessive API load
6. Test template variables with edge cases: empty results, special characters, multi-value selections
7. Verify public dashboards don't leak private data

View File

@@ -20,125 +20,41 @@ The following features have been implemented and removed from this plan:
- **Phase 2.2** - Log Analytics View (LogsAnalyticsView with timeseries, toplist, table charts; analytics endpoint)
- **Phase 2.3** - Column Customization (ColumnSelector with dynamic columns from log attributes)
- **Phase 5.8** - Store Missing OpenTelemetry Log Fields (observedTimeUnixNano, droppedAttributesCount, flags columns + ingestion + migration)
- **Phase 3.1** - Log Context / Surrounding Logs (Context tab in LogDetailsPanel, context endpoint in TelemetryAPI)
- **Phase 3.2** - Log Pipelines (LogPipeline + LogPipelineProcessor models, GrokParser/AttributeRemapper/SeverityRemapper/CategoryProcessor, pipeline execution service)
- **Phase 3.3** - Drop Filters (LogDropFilter model, LogDropFilterService, dashboard UI for configuration)
- **Phase 3.4** - Export to CSV/JSON (Export button in toolbar, LogExport utility with CSV and JSON support)
- **Phase 4.2** - Keyboard Shortcuts (j/k navigation, Enter expand/collapse, Esc close, / focus search, Ctrl+Enter apply filters, ? help)
- **Phase 4.3** - Sensitive Data Scrubbing (LogScrubRule model with PII patterns: Email, CreditCard, SSN, PhoneNumber, IPAddress, custom regex)
## Gap Analysis Summary
| Feature | OneUptime | Datadog | New Relic | Priority |
|---------|-----------|---------|-----------|----------|
| Log Patterns (ML clustering) | None | Auto-clustering + Pattern Inspector | ML clustering + anomaly | **P1** |
| Log context (surrounding logs) | None | Before/after from same host/service | Automatic via APM agent | **P2** |
| Log Pipelines (server-side processing) | None (raw storage only) | 270+ OOTB, 14+ processor types | Grok parsing, built-in rules | **P2** |
| Log-based Metrics | None | Count + Distribution, 15-month retention | Via NRQL | **P2** |
| Drop Filters (pre-storage filtering) | None | Exclusion filters with sampling | Drop rules per NRQL | **P2** |
| Export to CSV/JSON | None | CSV up to 100K rows | CSV/JSON up to 5K | **P2** |
| Keyboard shortcuts | None | Full keyboard nav | Basic | **P3** |
| Sensitive Data Scrubbing | None | Multi-layer (SaaS + agent + pipeline) | Auto-obfuscation + custom rules | **P3** |
| Data retention config UI | Referenced but no UI | Multi-tier (Standard/Flex/Archive) | Partitions + Live Archives | **P3** |
---
## Phase 3: Processing & Operations (P2) — Platform Capabilities
### 3.1 Log Context (Surrounding Logs)
**Current**: Clicking a log shows only that log's details.
**Target**: A "Context" tab in the log detail panel showing N logs before/after from the same service.
**Implementation**:
- When a log is expanded, add a "Context" tab that queries ClickHouse:
```sql
(SELECT * FROM LogItem WHERE projectId={pid} AND serviceId={sid} AND time < {logTime} ORDER BY time DESC LIMIT 5)
UNION ALL
(SELECT * FROM LogItem WHERE projectId={pid} AND serviceId={sid} AND time >= {logTime} ORDER BY time ASC LIMIT 6)
```
- Display as a mini log list with the current log highlighted
- Add to `LogDetailsPanel.tsx` as a tabbed section alongside the existing body/attributes view
**Files to modify**:
- `Common/Server/API/TelemetryAPI.ts` (add context endpoint)
- `Common/UI/Components/LogsViewer/components/LogDetailsPanel.tsx` (add tabs + context view)
### 3.2 Log Pipelines (Server-Side Processing)
**Current**: Logs are stored raw as received (after OTLP normalization).
**Target**: Configurable processing pipelines that transform logs at ingest time.
**Implementation**:
- Create `LogPipeline` and `LogPipelineProcessor` PostgreSQL models
- Pipeline has: name, filter (which logs it applies to), enabled flag, sort order
- Processor types (start with these 4):
- **Grok Parser**: Parse body text into structured attributes using Grok patterns
- **Attribute Remapper**: Rename/copy one attribute to another
- **Severity Remapper**: Map an attribute value to the severity field
- **Category Processor**: Assign a new attribute value based on if/else conditions
- Processing runs in the telemetry ingestion worker (`Telemetry/Jobs/TelemetryIngest/ProcessTelemetry.ts`) after normalization but before ClickHouse insert
- Pipeline configuration UI under Settings > Log Pipelines
**Files to modify**:
- `Common/Models/DatabaseModels/LogPipeline.ts` (new)
- `Common/Models/DatabaseModels/LogPipelineProcessor.ts` (new)
- `Telemetry/Services/LogPipelineService.ts` (new - pipeline execution engine)
- `Telemetry/Services/OtelLogsIngestService.ts` (hook pipeline execution before insert)
- Dashboard: new Settings page for pipeline configuration
### 3.3 Drop Filters (Pre-Storage Filtering)
**Current**: All ingested logs are stored.
**Target**: Configurable rules to drop or sample logs before storage.
**Implementation**:
- Create `LogDropFilter` PostgreSQL model: name, filter query, action (drop or sample at N%), enabled
- Evaluate drop filters in the ingestion pipeline before ClickHouse insert
- UI under Settings > Log Configuration > Drop Filters
- Show estimated volume savings based on recent log volume
### 3.4 Export to CSV/JSON
**Current**: No export capability.
**Target**: Download current filtered log results as CSV or JSON.
**Implementation**:
- Add "Export" button in the toolbar
- Client-side: serialize current page of logs to CSV/JSON and trigger browser download
- Server-side (for large exports): new endpoint that streams results to a downloadable file (up to 10K rows)
**Files to modify**:
- `Common/UI/Components/LogsViewer/components/LogsViewerToolbar.tsx` (add export button)
- `Common/UI/Utils/LogExport.ts` (new - CSV/JSON serialization)
- `Common/Server/API/TelemetryAPI.ts` (add export endpoint for large exports)
---
## Phase 4: Advanced Features (P3) — Differentiation
## Remaining Features
### Log Patterns (ML Clustering) — P1
### 4.2 Keyboard Shortcuts
**Current**: No pattern detection.
**Target**: Auto-cluster similar log messages and surface pattern groups with anomaly detection.
- `j`/`k` to navigate between log rows
- `Enter` to expand/collapse selected log
- `Escape` to close detail panel
- `/` to focus search bar
- `Ctrl+Enter` to apply filters
### Log-based Metrics — P2
### 4.3 Sensitive Data Scrubbing
**Current**: No log-to-metric conversion.
**Target**: Create count/distribution metrics from log queries with long-term retention.
- Auto-detect common PII patterns (credit cards, SSNs, emails) at ingest time
- Configurable scrubbing rules: mask, hash, or redact
- UI under Settings > Data Privacy
### Data Retention Config UI — P3
---
## Recommended Implementation Order
1. **Phase 3.4** - Export CSV/JSON (small effort, table-stakes feature)
2. **Phase 3.1** - Log Context (moderate effort, high debugging value)
3. **Phase 3.2** - Log Pipelines (large effort, platform capability)
4. **Phase 3.3** - Drop Filters (moderate effort, cost optimization)
5. **Phase 4.x** - Patterns, Shortcuts, Data Scrubbing (future)
**Current**: `retainTelemetryDataForDays` exists on the service model and is displayed in usage history, but there is no dedicated UI to configure retention settings.
**Target**: Settings page for configuring per-service log data retention.
## Phase 5: ClickHouse Storage & Query Optimizations (P0) — Performance Foundation
@@ -205,18 +121,22 @@ These optimizations address fundamental storage and indexing gaps in the telemet
| 5.3 DateTime64 time column | Sub-second log ordering | Correctness fix | Medium |
| 5.7 Histogram projections | Histogram and severity aggregation | 5-10x | Medium |
### 5.x Recommended Remaining Order
---
## Recommended Remaining Implementation Order
1. **5.3** — DateTime64 upgrade (correctness)
2. **5.7** — Projections (performance polish)
3. **Log-based Metrics** (platform capability)
4. **Data Retention Config UI** (operational)
5. **Log Patterns / ML Clustering** (advanced, larger effort)
---
## Verification
For each feature:
1. Unit tests for new parsers/utilities (LogQueryParser, CSV export, etc.)
2. Integration tests for new API endpoints (histogram, facets, analytics, context)
For each remaining feature:
1. Unit tests for new utilities
2. Integration tests for new API endpoints
3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/logs`
4. Check ClickHouse query performance with `EXPLAIN` for new aggregation queries
5. Verify real-time/live mode still works correctly with new UI components

468
Internal/Roadmap/Metrics.md Normal file
View File

@@ -0,0 +1,468 @@
# Plan: Bring OneUptime Metrics to Industry Parity and Beyond
## Context
OneUptime's metrics implementation provides OTLP ingestion (HTTP and gRPC), ClickHouse storage with support for Gauge, Sum, Histogram, and ExponentialHistogram metric types, basic aggregations (Avg, Sum, Min, Max, Count), single-attribute GROUP BY, formula support for calculated metrics, threshold-based metric monitors, and a metric explorer with line/bar charts. Auto-discovery creates MetricType metadata (name, description, unit) on first ingest. Per-service data retention with TTL (default 15 days).
This plan identifies the remaining gaps vs DataDog and New Relic, and proposes a phased implementation to close them and build a best-in-class metrics product.
## Completed
The following features have been implemented:
- **OTLP Ingestion** - HTTP and gRPC metric ingestion with async queue-based batch processing
- **Metric Types** - Gauge, Sum, Histogram, ExponentialHistogram support
- **ClickHouse Storage** - MergeTree with `sipHash64(projectId) % 16` partitioning, per-service TTL
- **Aggregations** - Avg, Sum, Min, Max, Count
- **Single-Attribute GROUP BY** - Group by one attribute at a time
- **Formulas** - Calculated metrics using aliases (e.g., `a / b * 100`)
- **Metric Explorer** - Time range selection, multiple queries with aliases, URL state persistence
- **Threshold-Based Monitors** - Static threshold alerting on aggregated metric values
- **MetricType Auto-Discovery** - Name, description, unit captured on first ingest
- **Attribute Storage** - Full JSON with extracted `attributeKeys` array for fast enumeration
- **BloomFilter index** on `name`, Set index on `serviceType`
## Gap Analysis Summary
| Feature | OneUptime | DataDog | New Relic | Priority |
|---------|-----------|---------|-----------|----------|
| Percentile aggregations (p50/p75/p90/p95/p99) | None | DDSketch distributions | NRQL percentile() | **P0** |
| Rate/derivative calculations | None | Native Rate type + .as_rate() | rate() NRQL function | **P0** |
| Multi-attribute GROUP BY | Single attribute only | Multiple tags | FACET on multiple attrs | **P0** |
| Rollup/downsampling for long-range queries | None (raw data, 15-day TTL) | Automatic tiered rollups | 30-day raw + 13-month rollups | **P0** |
| Anomaly detection | Static thresholds only | Watchdog + anomaly monitors | Anomaly detection + sigma bands | **P1** |
| SLO/SLI tracking | None | Metric-based + Time Slice SLOs | One-click setup + error budgets | **P1** |
| Heatmap visualization | None | Purpose-built for distributions | Built-in chart type | **P1** |
| Time-over-time comparison | None | Yes | COMPARE WITH in NRQL | **P1** |
| Summary metric type | Not supported | N/A (uses Distribution) | Yes | **P1** |
| Query language | Form-based UI only | Graphing editor + NLQ | Full NRQL language | **P2** |
| Predictive alerting | None | Watchdog forecasting | GA predictive alerting | **P2** |
| Metric correlations | None | Auto-surfaces related metrics | Applied Intelligence correlation | **P2** |
| Golden Signals dashboards | None | Available via APM | Pre-built with default alerts | **P2** |
| Cardinality management | None | Metrics Without Limits + Explorer | Budget system + pruning rules | **P2** |
| More chart types | Line and bar only | 12+ types | 10+ types with conditional coloring | **P2** |
| Dashboard templates | None | Pre-built integration dashboards | Pre-built entity dashboards | **P2** |
| Units on charts | Stored but not rendered | Auto-formatted by unit type | Y-axis unit customization | **P2** |
| Natural language querying | None | NLQ translates English to queries | None | **P3** |
| Metric cost/volume management | None | Cost attribution dashboards | Volume dashboards | **P3** |
---
## Phase 1: Foundation (P0) — Close Critical Gaps
These are table-stakes features without which the metrics product is fundamentally limited.
### 1.1 Percentile Aggregations (p50, p75, p90, p95, p99)
**Current**: Only Avg, Sum, Min, Max, Count aggregations.
**Target**: Support percentile aggregations on all metric data, especially histograms and distributions.
**Implementation**:
- Add `P50`, `P75`, `P90`, `P95`, `P99` to the `AggregationType` enum
- For raw metric values: use ClickHouse `quantile(0.50)(value)`, `quantile(0.95)(value)`, etc.
- For histogram data (with `bucketCounts` and `explicitBounds`): implement approximate percentile calculation from bucket data using linear interpolation between bucket boundaries
- Update the metric query builder to include percentile options in the aggregation dropdown
- Update chart rendering to display percentile series
**Files to modify**:
- `Common/Types/BaseDatabase/AggregationType.ts` (add P50, P75, P90, P95, P99)
- `Common/Server/Services/MetricService.ts` (generate quantile SQL)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx` (add to dropdown)
### 1.2 Rate/Derivative Calculations
**Current**: No rate or delta computation. Raw cumulative counters are meaningless without rate calculation.
**Target**: Compute per-second rates and deltas from counter/sum metrics.
**Implementation**:
- Add `Rate` and `Delta` as aggregation options
- For cumulative sums: compute `(value_t - value_t-1) / (time_t - time_t-1)` using ClickHouse `runningDifference()`
- Handle counter resets (when value decreases, treat as reset and skip that interval)
- For delta temporality sums: rate is simply `value / interval_seconds`
- Display rate with appropriate units (e.g., "req/s", "bytes/s")
**Files to modify**:
- `Common/Types/BaseDatabase/AggregationType.ts` (add Rate, Delta)
- `Common/Server/Services/MetricService.ts` (generate rate SQL with runningDifference)
- `Common/Types/Metrics/MetricsQuery.ts` (support rate in query config)
### 1.3 Multi-Attribute GROUP BY
**Current**: Single `groupByAttribute: string` field.
**Target**: Group by multiple attributes simultaneously (e.g., by region AND status_code).
**Implementation**:
- Change `groupByAttribute` from `string` to `string[]` in `MetricsQuery`
- Update ClickHouse query generation to GROUP BY multiple extracted JSON attributes
- Update chart rendering to handle multi-dimensional grouping (composite legend labels)
- Update the UI to allow selecting multiple group-by attributes
**Files to modify**:
- `Common/Types/Metrics/MetricsQuery.ts` (change type)
- `Common/Server/Services/MetricService.ts` (update query generation)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx` (multi-select UI)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (composite legends)
### 1.4 Rollups / Downsampling
**Current**: Raw data only with 15-day default TTL. No rollups means long-range queries are slow and historical analysis is limited.
**Target**: Pre-aggregated rollups at multiple resolutions with extended retention.
**Implementation**:
- Create ClickHouse materialized views for automatic rollup:
```
Raw Data (1s resolution) -> 15-day retention
|-> Materialized View -> 1-min rollups -> 90-day retention
|-> Materialized View -> 1-hour rollups -> 13-month retention
|-> Materialized View -> 1-day rollups -> 3-year retention
```
- Each rollup table stores: min, max, sum, count, avg, and quantile sketches per metric name + attributes
- Route queries based on time range:
- < 6 hours: raw data
- 6 hours - 7 days: 1-min rollups
- 7 days - 30 days: 1-hour rollups
- 30+ days: 1-day rollups
- Automatic query routing in the metric service layer
**Files to modify**:
- `Common/Models/AnalyticsModels/MetricRollup1Min.ts` (new)
- `Common/Models/AnalyticsModels/MetricRollup1Hour.ts` (new)
- `Common/Models/AnalyticsModels/MetricRollup1Day.ts` (new)
- `Common/Server/Services/MetricService.ts` (query routing by time range)
- `Worker/DataMigrations/` (new migration to create materialized views)
---
## Phase 2: Visualization & UX (P1) — Match Industry Standard
### 2.1 More Chart Types
**Current**: Line and bar charts only.
**Target**: Add Heatmap, Stacked Area, Pie/Donut, Scatter, Single-Value Billboard, and Gauge.
**Implementation**:
- **Heatmap**: Essential for histogram/distribution data. Use a heatmap library that renders time on X-axis, bucket values on Y-axis, and color intensity for count
- **Stacked Area**: Extension of existing line chart with fill and stacking
- **Pie/Donut**: For showing proportional breakdowns (e.g., request distribution by service)
- **Scatter**: For correlation analysis between two metrics
- **Billboard**: Large single-value display with configurable thresholds for color coding (green/yellow/red)
- **Gauge**: Circular gauge showing a value against a min/max range
**Files to modify**:
- `Common/Types/Dashboard/Chart/ChartType.ts` (add new chart types)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render new chart types)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricCharts.tsx` (chart type selector)
### 2.2 Time-Over-Time Comparison
**Current**: No comparison capability.
**Target**: Overlay current metric data with data from a previous period (1h ago, 1d ago, 1w ago).
**Implementation**:
- Add a "Compare with" dropdown in the metric explorer toolbar (options: 1 hour ago, 1 day ago, 1 week ago, custom)
- Execute the same query twice with shifted time ranges
- Render the comparison series as a dashed/translucent overlay on the same chart
- Show the delta (absolute and percentage) in tooltips
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricExplorer.tsx` (add compare dropdown)
- `Common/Types/Metrics/MetricsQuery.ts` (add compareWith field)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render comparison series)
### 2.3 Render Metric Units on Charts
**Current**: Units stored in MetricType but not rendered on chart axes.
**Target**: Display units on Y-axis labels and tooltips with smart formatting.
**Implementation**:
- Pass `MetricType.unit` through to chart rendering
- Implement unit-aware formatting:
- Bytes: auto-convert to KB/MB/GB/TB
- Duration: auto-convert ns/us/ms/s
- Percentage: append `%`
- Rate: append `/s`
- Display formatted unit on Y-axis label and in tooltip values
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (Y-axis unit formatting)
- `Common/Utils/Metrics/UnitFormatter.ts` (new - unit formatting logic)
### 2.4 Dashboard Templates
**Current**: No templates.
**Target**: Pre-built dashboards for common scenarios that auto-populate based on detected metrics.
**Implementation**:
- Create MetricsViewConfig templates for:
- HTTP Service Health (request rate, error rate, latency percentiles)
- Database Performance (query duration, connection pool, error rate)
- Kubernetes Metrics (CPU, memory, pod restarts, network)
- Host Metrics (CPU, memory, disk, network)
- Runtime Metrics (GC, heap, threads - per language)
- Auto-detect which templates are relevant based on ingested metric names
- "One-click apply" creates a dashboard from the template
**Files to modify**:
- `Common/Types/Metrics/DashboardTemplates/` (new directory with template definitions)
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Templates.tsx` (new - template gallery)
### 2.5 Summary Metric Type Support
**Current**: Summary type not supported.
**Target**: Ingest and store Summary metrics from OTLP.
**Implementation**:
- Add `Summary` to the metric point type enum
- Store quantile values from summary data points
- Display summary quantiles in the metric explorer
**Files to modify**:
- `Telemetry/Services/OtelMetricsIngestService.ts` (handle summary type)
- `Common/Models/AnalyticsModels/Metric.ts` (add summary-specific columns if needed)
---
## Phase 3: Alerting & Intelligence (P1-P2) — Smart Monitoring
### 3.1 Anomaly Detection
**Current**: Static threshold alerting only.
**Target**: Detect metrics deviating from expected patterns using statistical methods.
**Implementation**:
- Start with rolling mean + N standard deviations (configurable sensitivity: low/medium/high)
- Account for daily/weekly seasonality by comparing to same-time-last-week baselines
- Store baselines in ClickHouse (periodic computation job, hourly)
- Baseline table: metric name, service, hour_of_week, mean, stddev
- On each evaluation: compare current value to baseline, alert if deviation > configured sigma
- Surface anomalies as visual highlights on metric charts (shaded band showing expected range)
**Files to modify**:
- `Common/Models/AnalyticsModels/MetricBaseline.ts` (new - baseline storage)
- `Worker/Jobs/Metrics/ComputeMetricBaselines.ts` (new - periodic baseline computation)
- `Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts` (add anomaly detection)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render anomaly bands)
### 3.2 SLO/SLI Tracking
**Current**: No SLO support.
**Target**: Define Service Level Objectives based on metric queries, track attainment over rolling windows, calculate error budgets.
**Implementation**:
- Create `SLO` PostgreSQL model:
- Name, description, target percentage (e.g., 99.9%)
- SLI definition: good events query / total events query (both metric queries)
- Time window: 7-day, 28-day, or 30-day rolling
- Alert thresholds: error budget remaining %, burn rate
- SLO dashboard page showing:
- Current attainment vs target (e.g., 99.85% / 99.9%)
- Error budget remaining (absolute and percentage)
- Burn rate chart (current burn rate vs sustainable burn rate)
- SLI time series chart
- Alert when error budget drops below threshold or burn rate exceeds sustainable rate
- Integrate with existing monitor/incident system
**Files to modify**:
- `Common/Models/DatabaseModels/SLO.ts` (new)
- `Common/Server/Services/SLOService.ts` (new - SLI computation, budget calculation)
- `Worker/Jobs/SLO/EvaluateSLOs.ts` (new - periodic SLO evaluation)
- `App/FeatureSet/Dashboard/src/Pages/SLO/` (new - SLO list, detail, creation pages)
### 3.3 Metric Correlations
**Current**: No correlation capability.
**Target**: When an anomaly is detected, automatically identify other metrics that changed around the same time.
**Implementation**:
- When an anomaly is detected on a metric, query all metrics for the same service/project in the surrounding time window (e.g., +/- 30 minutes)
- Compute Pearson correlation coefficient between the anomalous metric and each candidate
- Rank by absolute correlation value
- Surface top 5-10 correlated metrics in the alert/incident view
- Show correlation chart: anomalous metric overlaid with top correlated metrics
**Files to modify**:
- `Common/Server/Services/MetricCorrelationService.ts` (new)
- `App/FeatureSet/Dashboard/src/Components/Metrics/CorrelatedMetrics.tsx` (new - correlation view)
---
## Phase 4: Scale & Power Features (P2-P3) — Differentiation
### 4.1 Cardinality Management
**Current**: No cardinality visibility or controls.
**Target**: Track unique series count, alert on spikes, allow attribute allowlist/blocklist.
**Implementation**:
- Track unique series count per metric name (via periodic ClickHouse `uniq()` queries)
- Store in a dedicated cardinality tracking table
- Dashboard showing: top metrics by cardinality, cardinality trend over time, per-attribute breakdown
- Allow configuring attribute allowlists/blocklists per metric (applied at ingest time)
- Alert when cardinality exceeds configured budget
**Files to modify**:
- `Worker/Jobs/Metrics/TrackMetricCardinality.ts` (new - periodic cardinality computation)
- `Common/Models/DatabaseModels/MetricCardinalityConfig.ts` (new - allowlist/blocklist)
- `Telemetry/Services/OtelMetricsIngestService.ts` (apply attribute filtering)
- `App/FeatureSet/Dashboard/src/Pages/Settings/MetricCardinality.tsx` (new - cardinality dashboard)
### 4.2 Query Language
**Current**: Form-based UI only.
**Target**: Text-based metrics query language inspired by PromQL/NRQL for advanced users.
**Implementation**:
- Define a grammar supporting:
```
metric_name{attribute="value", attribute2=~"regex"}
| aggregation(duration)
by (attribute1, attribute2)
```
- Build a parser that translates to the existing ClickHouse query builder
- Offer both UI builder and text modes (toggle like New Relic's basic/advanced)
- Syntax highlighting and autocomplete in the text editor (metric names, attribute keys, functions)
- Functions: `rate()`, `delta()`, `avg()`, `sum()`, `min()`, `max()`, `p50()`, `p95()`, `p99()`, `count()`, `topk()`, `bottomk()`
**Files to modify**:
- `Common/Utils/Metrics/MetricsQueryLanguage.ts` (new - parser and translator)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryEditor.tsx` (new - text editor with autocomplete)
### 4.3 Golden Signals Dashboards
**Current**: No auto-generated dashboards.
**Target**: Auto-generated dashboards showing Latency, Traffic, Errors, Saturation for each service.
**Implementation**:
- Detect common OpenTelemetry metric names per service:
- Latency: `http.server.duration`, `http.server.request.duration`
- Traffic: `http.server.request.count`, `http.server.active_requests`
- Errors: `http.server.request.count` where status_code >= 500
- Saturation: `process.runtime.*.memory`, `system.cpu.utilization`
- Auto-create a Golden Signals dashboard for each service with detected metrics
- Include default alert thresholds
**Files to modify**:
- `Worker/Jobs/Metrics/GenerateGoldenSignalsDashboards.ts` (new)
- `Common/Utils/Metrics/GoldenSignalsDetector.ts` (new - metric name pattern matching)
### 4.4 Predictive Alerting
**Current**: No forecasting capability.
**Target**: Forecast metric values and alert before thresholds are breached.
**Implementation**:
- Use linear regression or Holt-Winters on recent data to project forward
- Alert if projected value crosses threshold within configurable forecast window (e.g., "disk full in 4 hours")
- Particularly valuable for capacity planning metrics (disk, memory, connection pools)
- Show forecast as a dashed line extension on metric charts
**Files to modify**:
- `Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts` (add predictive evaluation)
- `Common/Utils/Metrics/Forecasting.ts` (new - regression/Holt-Winters)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render forecast line)
---
## ClickHouse Storage Improvements
### S.1 Fix Sort Key Order (CRITICAL)
**Current**: Sort key is `(projectId, time, serviceId)`.
**Target**: Change to `(projectId, name, serviceId, time)`.
**Impact**: ~100x improvement for name-filtered queries. Virtually every metric query filters by `name`, but currently ClickHouse must scan all metric names within the time range.
**Migration**: Requires creating `MetricItem_v2` with new sort key and migrating data (ClickHouse doesn't support `ALTER TABLE MODIFY ORDER BY`).
**Files to modify**:
- `Common/Models/AnalyticsModels/Metric.ts` (change sort key)
- `Worker/DataMigrations/` (new migration - create v2 table, backfill, swap)
### S.2 Upgrade time to DateTime64 (HIGH)
**Current**: `DateTime` with second precision.
**Target**: `DateTime64(3)` or `DateTime64(6)` for sub-second precision.
**Impact**: Correct sub-second metric ordering. Removes need for separate `timeUnixNano`/`startTimeUnixNano` columns.
**Files to modify**:
- `Common/Models/AnalyticsModels/Metric.ts` (change column type)
- `Common/Types/AnalyticsDatabase/TableColumnType.ts` (add DateTime64 type if not present)
- `Common/Server/Utils/AnalyticsDatabase/StatementGenerator.ts` (handle DateTime64)
- `Worker/DataMigrations/` (migration)
### S.3 Add Skip Index on metricPointType (MEDIUM)
**Current**: No index support for metric type filtering.
**Target**: Set skip index on `metricPointType`.
**Files to modify**:
- `Common/Models/AnalyticsModels/Metric.ts` (add skip index)
### S.4 Evaluate Map Type for Attributes (MEDIUM)
**Current**: Attributes stored as JSON.
**Target**: Evaluate `Map(LowCardinality(String), String)` for faster attribute-based filtering.
### S.5 Upgrade count/bucketCounts to Int64 (LOW)
**Current**: `Int32` for count and `Array(Int32)` for bucketCounts.
**Target**: `Int64` / `Array(Int64)` to prevent overflow in high-throughput systems.
---
## Quick Wins (Can Ship This Week)
1. **Display units on chart Y-axes** - Data exists in MetricType, just needs wiring to chart rendering
2. **Add p50/p95/p99 to aggregation dropdown** - ClickHouse `quantile()` is straightforward to add
3. **Extend default retention** - 15 days is too short; increase default to 30 days
4. **Multi-attribute GROUP BY** - Change `groupByAttribute: string` to `groupByAttribute: string[]`
5. **Add stacked area chart type** - Simple extension of existing line chart
6. **Add skip index on metricPointType** - Low effort, faster type-filtered queries
---
## Recommended Implementation Order
1. **Quick Wins** - Ship units on charts, p50/p95/p99, multi-attribute GROUP BY, stacked area
2. **Phase 1.1** - Percentile aggregations (full implementation beyond quick win)
3. **Phase 1.2** - Rate/derivative calculations
4. **S.1** - Fix sort key order (critical performance improvement)
5. **Phase 1.4** - Rollups/downsampling (enables long-range queries)
6. **Phase 2.1** - More chart types (heatmap, pie, gauge, billboard)
7. **Phase 2.2** - Time-over-time comparison
8. **Phase 1.3** - Multi-attribute GROUP BY (full implementation)
9. **S.2** - Upgrade time to DateTime64
10. **Phase 3.1** - Anomaly detection
11. **Phase 3.2** - SLO/SLI tracking
12. **Phase 2.4** - Dashboard templates
13. **Phase 4.1** - Cardinality management
14. **Phase 4.2** - Query language
15. **Phase 4.3** - Golden Signals dashboards
16. **Phase 4.4** - Predictive alerting
17. **Phase 3.3** - Metric correlations
## Verification
For each feature:
1. Unit tests for new aggregation types, rate calculations, unit formatting, query language parser
2. Integration tests for new API endpoints (percentiles, rollup queries, SLO evaluation)
3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/metrics`
4. Check ClickHouse query performance with `EXPLAIN` for new query patterns
5. Verify rollup accuracy by comparing rollup results to raw data results for overlapping time ranges
6. Load test cardinality tracking and anomaly detection jobs to ensure they don't impact ingestion

412
Internal/Roadmap/Traces.md Normal file
View File

@@ -0,0 +1,412 @@
# Plan: Bring OneUptime Traces to Industry Parity and Beyond
## Context
OneUptime's trace implementation provides OTLP-native ingestion (HTTP and gRPC), ClickHouse storage with a full OpenTelemetry span model (events, links, status, attributes, resources, scope), a Gantt/waterfall visualization, trace-to-log and trace-to-exception correlation, a basic service dependency graph, queue-based async ingestion, and per-service data retention with TTL. ClickHouse schema has been optimized with BloomFilter indexes on traceId/spanId/parentSpanId, Set indexes on statusCode/kind/hasException, TokenBF on name, and ZSTD compression on key columns.
This plan identifies the remaining gaps vs DataDog, NewRelic, Honeycomb, and Grafana Tempo, and proposes a phased implementation to close them and surpass competition.
## Completed
The following features have been implemented:
- **OTLP Ingestion** - HTTP and gRPC trace ingestion with async queue-based processing
- **ClickHouse Storage** - MergeTree with `sipHash64(projectId) % 16` partitioning, per-service TTL
- **Gantt/Waterfall View** - Hierarchical span visualization with color-coded services, time-unit auto-scaling, error indicators
- **Trace-to-Log Correlation** - Log model has traceId/spanId columns; SpanViewer shows associated logs
- **Trace-to-Exception Correlation** - ExceptionInstance model links to traceId/spanId with stack trace parsing and fingerprinting
- **Span Detail Panel** - Side-over with tabs for Basic Info, Logs, Attributes, Events, Exceptions
- **BloomFilter indexes** on traceId, spanId, parentSpanId
- **Set indexes** on statusCode, kind, hasException
- **TokenBF index** on name
- **ZSTD compression** on time/ID/attribute columns
- **hasException boolean column** for fast error span filtering
- **links default value** corrected to `[]`
## Gap Analysis Summary
| Feature | OneUptime | DataDog | NewRelic | Tempo/Honeycomb | Priority |
|---------|-----------|---------|----------|-----------------|----------|
| Trace analytics / aggregation engine | None | Trace Explorer with COUNT/percentiles | NRQL on span data | TraceQL rate/count/quantile | **P0** |
| RED metrics from traces | None | Auto-computed on 100% traffic | Derived golden signals | Metrics-generator to Prometheus | **P0** |
| Trace-based alerting | None | APM Monitors (p50-p99, error rate, Apdex) | NRQL alert conditions | Via Grafana alerting / Triggers | **P0** |
| Sampling controls | None (100% ingestion) | Head-based adaptive + retention filters | Infinite Tracing (tail-based) | Refinery (rules/dynamic/tail) | **P0** |
| Flame graph view | None | Yes (default view) | No | No | **P1** |
| Latency breakdown / critical path | None | Per-hop latency, bottleneck detection | No | BubbleUp (Honeycomb) | **P1** |
| In-trace search | None | Yes | No | No | **P1** |
| Per-trace service map | None | Yes (Map view) | No | No | **P1** |
| Trace-to-metric exemplars | None | Pivot from metric graph to traces | Metric-to-trace linking | Prometheus exemplars | **P1** |
| Custom metrics from spans | None | Generate count/distribution/gauge from tags | Via NRQL | SLOs from span data | **P2** |
| Structural trace queries | None | Trace Queries (multi-span relationships) | Via NRQL | TraceQL spanset pipelines | **P2** |
| Trace comparison / diffing | None | Partial | Side-by-side comparison | compare() in TraceQL | **P2** |
| AI/ML on traces | None | Watchdog (auto anomaly + RCA) | NRAI | BubbleUp (pattern detection) | **P3** |
| RUM correlation | None | Frontend-to-backend trace linking | Yes | Faro / frontend observability | **P3** |
| Continuous profiling | None | Code Hotspots (span-to-profile) | Partial | Pyroscope | **P3** |
---
## Phase 1: Analytics & Alerting Foundation (P0) — Highest Impact
Without these, users cannot answer basic questions like "is my service healthy?" from trace data.
### 1.1 Trace Analytics / Aggregation Engine
**Current**: Can list/filter individual spans and view individual traces. No way to aggregate or compute statistics.
**Target**: Full trace analytics supporting COUNT, AVG, SUM, MIN, MAX, P50/P75/P90/P95/P99 aggregations with GROUP BY on any span attribute and time-series bucketing.
**Implementation**:
- Build a trace analytics API endpoint that translates query configs into ClickHouse aggregation queries
- Use ClickHouse's native functions: `quantile(0.99)(durationUnixNano)`, `countIf(statusCode = 2)`, `toStartOfInterval(startTime, INTERVAL 1 MINUTE)`
- Support GROUP BY on service, span name, kind, status, and any custom attribute (via JSON extraction)
- Frontend: Add an "Analytics" tab to the Traces page with chart types (timeseries, top list, table) similar to the existing LogsAnalyticsView
- Support switching between "List" view (current) and "Analytics" view
**Files to modify**:
- `Common/Server/API/TelemetryAPI.ts` (add trace analytics endpoint)
- `Common/Server/Services/SpanService.ts` (add aggregation query methods)
- `Common/Types/Traces/TraceAnalyticsQuery.ts` (new - query interface)
- `App/FeatureSet/Dashboard/src/Pages/Traces/Index.tsx` (add analytics view toggle)
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceAnalyticsView.tsx` (new - analytics UI)
### 1.2 RED Metrics from Traces (Request Rate, Error Rate, Duration)
**Current**: No automatic computation of service-level metrics from trace data.
**Target**: Auto-computed per-service, per-operation RED metrics displayed on a Service Overview page.
**Implementation**:
- Create a ClickHouse materialized view that aggregates spans into per-service, per-operation metrics at 1-minute intervals:
```sql
CREATE MATERIALIZED VIEW span_red_metrics
ENGINE = AggregatingMergeTree()
ORDER BY (projectId, serviceId, name, minute)
AS SELECT
projectId, serviceId, name,
toStartOfMinute(startTime) AS minute,
countState() AS request_count,
countIfState(statusCode = 2) AS error_count,
quantileState(0.50)(durationUnixNano) AS p50_duration,
quantileState(0.95)(durationUnixNano) AS p95_duration,
quantileState(0.99)(durationUnixNano) AS p99_duration
FROM SpanItem
GROUP BY projectId, serviceId, name, minute
```
- Build a Service Overview page showing: request rate chart, error rate chart, p50/p95/p99 latency charts
- Add an API endpoint to query the materialized view
**Files to modify**:
- `Common/Models/AnalyticsModels/SpanRedMetrics.ts` (new - materialized view model)
- `Telemetry/Services/SpanRedMetricsService.ts` (new - query service)
- `App/FeatureSet/Dashboard/src/Pages/Service/View/Overview.tsx` (new or enhanced - RED dashboard)
- `Worker/DataMigrations/` (new migration to create materialized view)
### 1.3 Trace-Based Alerting
**Current**: No ability to alert on trace data.
**Target**: Create alerts on p50/p75/p90/p95/p99 latency thresholds, error rate thresholds, and request rate anomalies per service/operation.
**Implementation**:
- Extend the existing monitor system to add a `TraceMonitor` type
- Monitor evaluates against the RED metrics materialized view (depends on 1.2)
- Alert conditions: latency exceeds threshold, error rate exceeds threshold, request rate drops below threshold
- Integrate with existing OneUptime alerting/incident system
- UI: Add "Trace Monitor" as a new monitor type in the monitor creation wizard
**Files to modify**:
- `Common/Types/Monitor/MonitorType.ts` (add Trace monitor type)
- `Common/Types/Monitor/MonitorStepTraceMonitor.ts` (new - trace monitor config)
- `Common/Server/Utils/Monitor/Criteria/TraceMonitorCriteria.ts` (new - evaluation logic)
- `App/FeatureSet/Dashboard/src/Components/Form/Monitor/TraceMonitor/` (new - monitor form UI)
### 1.4 Head-Based Probabilistic Sampling
**Current**: Ingests 100% of received traces.
**Target**: Configurable per-service probabilistic sampling with rules to always keep errors and slow traces.
**Implementation**:
- Create `TraceSamplingRule` PostgreSQL model: service filter, sample rate (0-100%), conditions to always keep (error status, duration > threshold)
- Evaluate sampling rules in `OtelTracesIngestService.ts` before ClickHouse insert
- Use deterministic sampling based on traceId hash (so all spans from the same trace are kept or dropped together)
- UI under Settings > Trace Configuration > Sampling Rules
- Show estimated storage savings
**Files to modify**:
- `Common/Models/DatabaseModels/TraceSamplingRule.ts` (new)
- `Telemetry/Services/OtelTracesIngestService.ts` (add sampling logic)
- Dashboard: new Settings page for sampling configuration
---
## Phase 2: Visualization & Debugging UX (P1) — Industry-Standard Features
### 2.1 Flame Graph View
**Current**: Only Gantt/waterfall view.
**Target**: Flame graph visualization showing proportional time spent in each span, with service color coding.
**Implementation**:
- Build a flame graph component that renders spans as horizontally stacked rectangles proportional to duration
- Allow switching between Waterfall and Flame Graph views in TraceExplorer
- Color-code by service (consistent with waterfall view)
- Click a span rectangle to focus/zoom into that subtree
- Show tooltip with span name, service, duration, self-time on hover
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Traces/FlameGraph.tsx` (new)
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view toggle)
### 2.2 Latency Breakdown / Critical Path Analysis
**Current**: Shows individual span durations but no automated analysis.
**Target**: Compute and display critical path, self-time vs child-time, and bottleneck identification.
**Implementation**:
- Compute critical path: the longest sequential chain of spans through the trace (accounts for parallelism)
- Calculate "self time" per span: `span.duration - sum(child.duration)` (clamped to 0 for overlapping children)
- Display latency breakdown by service: percentage of total trace time spent in each service
- Highlight bottleneck spans (spans contributing most to critical path duration)
- Add "Critical Path" toggle in TraceExplorer that highlights the critical path spans
**Files to modify**:
- `Common/Utils/Traces/CriticalPath.ts` (new - critical path algorithm)
- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (show self-time)
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add critical path view)
### 2.3 In-Trace Span Search
**Current**: TraceExplorer shows all spans with service filtering and error toggle, but no text search.
**Target**: Search box to filter spans by name, attribute values, or status within the current trace.
**Implementation**:
- Add a search input in TraceExplorer toolbar
- Client-side filtering: match span name, service name, attribute keys/values against search text
- Highlight matching spans in the waterfall/flame graph
- Show match count (e.g., "3 of 47 spans")
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add search bar and filtering)
### 2.4 Per-Trace Service Flow Map
**Current**: Service dependency graph exists globally but not per-trace.
**Target**: Per-trace visualization showing the path of a request through services with latency annotations.
**Implementation**:
- Build a directed graph from the spans in a single trace (services as nodes, calls as edges)
- Annotate edges with call count and latency
- Color-code nodes by error status
- Add as a new view tab alongside Waterfall and Flame Graph
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceServiceMap.tsx` (new)
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view tab)
### 2.5 Span Link Navigation
**Current**: Links data is stored in spans but not navigable in the UI.
**Target**: Clickable links in the span detail panel that navigate to related traces/spans.
**Implementation**:
- In the SpanViewer detail panel, render the `links` array as clickable items
- Each link shows the linked traceId, spanId, and relationship type
- Clicking navigates to the linked trace view
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (render clickable links)
---
## Phase 3: Advanced Analytics & Correlation (P2) — Power Features
### 3.1 Trace-to-Metric Exemplars
**Current**: Metric model has no traceId/spanId fields.
**Target**: Link metric data points to trace IDs; show exemplar dots on metric charts that navigate to traces.
**Implementation**:
- Add optional `traceId` and `spanId` columns to the Metric ClickHouse model
- During metric ingestion, extract exemplar trace/span IDs from OTLP exemplar fields
- On metric charts, render exemplar dots at data points that have associated traces
- Clicking an exemplar dot navigates to the trace view
**Files to modify**:
- `Common/Models/AnalyticsModels/Metric.ts` (add traceId/spanId columns)
- `Telemetry/Services/OtelMetricsIngestService.ts` (extract exemplars)
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render exemplar dots)
### 3.2 Custom Metrics from Spans
**Current**: No way to create persistent metrics from trace data.
**Target**: Users define custom metrics from span attributes that are computed via ClickHouse materialized views and available for alerting and dashboards.
**Implementation**:
- Create `SpanDerivedMetric` model: name, filter query (which spans), aggregation (count/avg/p99 of what field), GROUP BY attributes
- Use ClickHouse materialized views for efficient computation
- Surface derived metrics in the metric explorer and alerting system
**Files to modify**:
- `Common/Models/DatabaseModels/SpanDerivedMetric.ts` (new)
- `Common/Server/Services/SpanDerivedMetricService.ts` (new)
- Dashboard: UI for defining derived metrics
### 3.3 Structural Trace Queries
**Current**: Can only filter on individual span attributes.
**Target**: Query traces based on properties of multiple spans and their relationships (e.g., "find traces where service A called service B and B returned an error").
**Implementation**:
- Design a visual query builder for structural queries (easier adoption than a query language)
- Translate structural queries to ClickHouse subqueries with JOINs on traceId
- Example: "Find traces where span with service=frontend has child span with service=database AND duration > 500ms"
```sql
SELECT DISTINCT s1.traceId FROM SpanItem s1
JOIN SpanItem s2 ON s1.traceId = s2.traceId AND s1.spanId = s2.parentSpanId
WHERE s1.projectId = {pid}
AND JSONExtractString(s1.attributes, 'service.name') = 'frontend'
AND JSONExtractString(s2.attributes, 'service.name') = 'database'
AND s2.durationUnixNano > 500000000
```
**Files to modify**:
- `Common/Types/Traces/StructuralTraceQuery.ts` (new - query model)
- `Common/Server/Services/SpanService.ts` (add structural query execution)
- `App/FeatureSet/Dashboard/src/Components/Traces/StructuralQueryBuilder.tsx` (new - visual builder)
### 3.4 Trace Comparison / Diffing
**Current**: No way to compare traces.
**Target**: Side-by-side comparison of two traces of the same operation, highlighting differences in span count, latency, and structure.
**Implementation**:
- Add "Compare" action to trace list (select two traces)
- Build a diff view showing: added/removed spans, latency differences per span, structural changes
- Useful for comparing a slow trace to a fast trace of the same operation
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceComparison.tsx` (new)
- `App/FeatureSet/Dashboard/src/Pages/Traces/Compare.tsx` (new page)
---
## Phase 4: Competitive Differentiation (P3) — Long-Term
### 4.1 Rules-Based and Tail-Based Sampling
**Current**: Phase 1 adds head-based probabilistic sampling.
**Target**: Rules-based sampling (always keep errors/slow traces, sample successes) and eventually tail-based sampling (buffer complete traces, decide after seeing all spans).
**Implementation**:
- Rules engine: configurable conditions (service, status, duration, attributes) with per-rule sample rates
- Tail-based: buffer spans for a configurable window (30s), assemble complete traces, then apply retention decisions
- Tail-based is complex; consider integrating with OpenTelemetry Collector's tail sampling processor as an alternative
### 4.2 AI/ML on Trace Data
- **Anomaly detection** on RED metrics (statistical deviation from baseline)
- **Auto-surfacing correlated attributes** when latency spikes (similar to Honeycomb BubbleUp)
- **Natural language trace queries** ("show me slow database calls from the last hour")
- **Automatic root cause analysis** from trace data during incidents
### 4.3 RUM (Real User Monitoring) Correlation
- Browser SDK that propagates W3C trace context from frontend to backend
- Link frontend page loads, interactions, and web vitals to backend traces
- Show end-to-end user experience from browser to backend services
### 4.4 Continuous Profiling Integration
- Integrate with a profiling backend (e.g., Pyroscope)
- Link profile data to span time windows
- Show "Code Hotspots" within spans (similar to DataDog)
---
## ClickHouse Storage Improvements
### S.1 Migrate `attributes` to Map(String, String) (HIGH)
**Current**: `attributes` is stored as opaque `String` (JSON). Querying by attribute value requires `LIKE` or `JSONExtract()` scans.
**Target**: `Map(String, String)` type enabling `attributes['http.method'] = 'GET'` without JSON parsing.
**Impact**: Significant query speedup for attribute-based span filtering -- the most common query pattern after time-range filtering.
**Files to modify**:
- `Common/Models/AnalyticsModels/Span.ts` (change column type)
- `Common/Server/Utils/AnalyticsDatabase/StatementGenerator.ts` (handle Map type)
- `Telemetry/Services/OtelTracesIngestService.ts` (write Map format)
- `Worker/DataMigrations/` (new migration)
### S.2 Add Aggregation Projection (MEDIUM)
**Current**: `projections: []` is empty.
**Target**: Pre-aggregation projection for common dashboard queries.
```sql
PROJECTION agg_by_service (
SELECT
serviceId,
toStartOfMinute(startTime) AS minute,
count(),
avg(durationUnixNano),
quantile(0.99)(durationUnixNano)
GROUP BY serviceId, minute
)
```
**Impact**: 5-10x faster aggregation queries for service overview dashboards.
### S.3 Add Trace-by-ID Projection (LOW)
**Current**: Trace detail view relies on BloomFilter skip index for traceId lookups.
**Target**: Projection sorted by `(projectId, traceId, startTime)` for faster trace-by-ID queries.
---
## Quick Wins (Can Ship This Week)
1. **In-trace span search** - Add a text filter in TraceExplorer (few hours of work)
2. **Self-time calculation** - Show "self time" (span duration minus child durations) in SpanViewer
3. **Span link navigation** - Links data is stored but not clickable in UI
4. **Top-N slowest operations** - Simple ClickHouse query: `ORDER BY durationUnixNano DESC LIMIT N`
5. **Error rate by service** - Aggregate `statusCode=2` counts grouped by serviceId
6. **Trace duration distribution histogram** - Use ClickHouse `histogram()` on durationUnixNano
7. **Span count per service display** - Already tracked in `servicesInTrace`, just needs better display
---
## Recommended Implementation Order
1. **Phase 1.1** - Trace Analytics Engine (highest impact, unlocks everything else)
2. **Phase 1.2** - RED Metrics from Traces (prerequisite for alerting, service overview)
3. **Quick Wins** - Ship in-trace search, self-time, span links, top-N operations
4. **Phase 1.3** - Trace-Based Alerting (core observability workflow)
5. **Phase 2.1** - Flame Graph View (industry-standard visualization)
6. **Phase 2.2** - Critical Path Analysis (key debugging capability)
7. **Phase 1.4** - Head-Based Sampling (essential for high-volume users)
8. **S.1** - Migrate attributes to Map type (storage optimization)
9. **Phase 2.3-2.5** - In-trace search, per-trace map, span links
10. **Phase 3.1** - Trace-to-Metric Exemplars
11. **Phase 3.2-3.4** - Custom metrics, structural queries, comparison
12. **Phase 4.x** - AI/ML, RUM, profiling (long-term)
## Verification
For each feature:
1. Unit tests for new query builders, critical path algorithm, sampling logic
2. Integration tests for new API endpoints (analytics, RED metrics, sampling)
3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/traces`
4. Check ClickHouse query performance with `EXPLAIN` for new aggregation queries
5. Verify trace correlation (logs, exceptions, metrics) still works correctly with new features
6. Load test sampling logic to ensure it doesn't add ingestion latency

View File

@@ -1 +1 @@
10.0.32
10.0.33