mirror of
https://github.com/OneUptime/oneuptime.git
synced 2026-04-06 00:32:12 +02:00
feat: Add comprehensive metrics and traces roadmap for industry parity
- Introduced detailed plans for enhancing OneUptime's metrics and traces capabilities to match and exceed industry standards. - Metrics roadmap includes features like percentile aggregations, rate calculations, multi-attribute grouping, rollups, and advanced visualizations. - Traces roadmap outlines improvements such as trace analytics, RED metrics, trace-based alerting, and enhanced visualization options like flame graphs and critical path analysis. - Both roadmaps emphasize phased implementation, quick wins, and verification strategies to ensure robust feature delivery and performance.
This commit is contained in:
2
.github/workflows/release.yml
vendored
2
.github/workflows/release.yml
vendored
@@ -18,7 +18,7 @@ on:
|
||||
default: false
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
contents: write
|
||||
packages: write
|
||||
|
||||
jobs:
|
||||
|
||||
2
.github/workflows/test-release.yaml
vendored
2
.github/workflows/test-release.yaml
vendored
@@ -10,7 +10,7 @@ on:
|
||||
- "master"
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
contents: write
|
||||
packages: write
|
||||
|
||||
jobs:
|
||||
|
||||
568
Internal/Roadmap/Dashboards.md
Normal file
568
Internal/Roadmap/Dashboards.md
Normal file
@@ -0,0 +1,568 @@
|
||||
# Plan: Bring OneUptime Dashboards to Industry Parity and Beyond
|
||||
|
||||
## Context
|
||||
|
||||
OneUptime's dashboard implementation provides a 12-column grid layout with drag-and-drop editing, 3 widget types (Chart with Line/Bar, Value, Text with basic formatting), global time range with presets, view/edit modes, role-based permissions, and full-screen support. Dashboard config is stored as a single JSON column. Dashboards can only query OpenTelemetry metrics from ClickHouse.
|
||||
|
||||
This plan identifies the remaining gaps vs Grafana, Datadog, and New Relic, and proposes a phased implementation to build a best-in-class dashboard product that leverages OneUptime's unique position as an all-in-one observability + status page platform.
|
||||
|
||||
## Completed
|
||||
|
||||
The following features have been implemented:
|
||||
- **12-Column Grid Layout** - Fixed grid with dynamic unit sizing, 60 default rows (expandable)
|
||||
- **Drag-and-Drop Editing** - Move and resize components with bounds checking
|
||||
- **Chart Widget** - Line and Bar chart types with single metric query, configurable title/description/legend
|
||||
- **Value Widget** - Single metric aggregation displayed as large number
|
||||
- **Text Widget** - Bold/Italic/Underline formatting (no markdown)
|
||||
- **Global Time Range** - Presets (30min to 3mo) + custom date range picker
|
||||
- **View/Edit Modes** - Read-only view with full-screen, edit mode with side panel settings
|
||||
- **Role-Based Permissions** - ProjectOwner, ProjectAdmin, ProjectMember + custom permissions
|
||||
- **Dashboard CRUD API** - Standard REST API with slug generation
|
||||
- **Billing Enforcement** - Free plan limited to 1 dashboard
|
||||
|
||||
## Gap Analysis Summary
|
||||
|
||||
| Feature | OneUptime | Grafana | Datadog | New Relic | Priority |
|
||||
|---------|-----------|---------|---------|-----------|----------|
|
||||
| Widget types | 3 | 20+ | 40+ | 15+ | **P0** |
|
||||
| Chart types | 2 (Line, Bar) | 10+ | 12+ | 10+ | **P0** |
|
||||
| Template variables | None | 6+ types | Yes | 3 types | **P0** |
|
||||
| Auto-refresh | None | Configurable | Real-time | Yes | **P0** |
|
||||
| Log panels | None | Yes (Loki) | Yes | Yes (NRQL) | **P0** |
|
||||
| Trace panels | None | Yes (Tempo) | Yes | Yes | **P0** |
|
||||
| Table widget | None | Yes | Yes | Yes | **P0** |
|
||||
| Multiple queries per chart | Single query | Yes | Yes | Yes | **P0** |
|
||||
| Markdown support | Basic formatting only | Full markdown | Full markdown | Full markdown | **P0** |
|
||||
| Threshold lines / color coding | None | Yes | Yes | Yes | **P0** |
|
||||
| Legend interaction (show/hide) | None | Yes | Yes | Yes | **P0** |
|
||||
| Chart zoom | None | Yes | Yes | Yes | **P0** |
|
||||
| Dashboard linking / drill-down | None | Data links | Yes | Facet linking | **P1** |
|
||||
| Annotations / event overlays | None | Yes | Yes | Yes (Labs) | **P1** |
|
||||
| Row/section grouping | None | Collapsible rows | Groups | No | **P1** |
|
||||
| Public/shared dashboards | None | Yes | Yes | Yes | **P1** |
|
||||
| JSON import/export | None | Yes | Yes | Yes | **P1** |
|
||||
| Dashboard versioning | None | Yes | Yes | No | **P1** |
|
||||
| Alert integration | None | Create from panel + show state | Yes | NRQL alerts | **P1** |
|
||||
| TV/Kiosk mode | Full-screen only | Kiosk mode | Yes | Auto-cycling | **P1** |
|
||||
| CSV export | None | Yes | Yes | Yes | **P1** |
|
||||
| Custom time per widget | None | No | No | No | **P1** |
|
||||
| AI dashboard creation | None | None | None | None | **P2** |
|
||||
| Dashboard-as-code SDK | None | Foundation SDK | No | No | **P2** |
|
||||
| Terraform provider | None | Yes | Yes | Yes | **P2** |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Foundation (P0) — Close Critical Gaps
|
||||
|
||||
These gaps make OneUptime dashboards fundamentally non-competitive. Every major competitor has these.
|
||||
|
||||
### 1.1 Add Core Chart Types: Area, Pie, Table, Gauge, Heatmap, Histogram
|
||||
|
||||
**Current**: Line and Bar only.
|
||||
**Target**: 8+ chart types covering all standard observability visualization needs.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- **Area Chart** (stacked and non-stacked): Extension of line chart with fill. Use existing chart library's area mode
|
||||
- **Pie/Donut Chart**: For proportional breakdowns (e.g., error distribution by service). New component
|
||||
- **Table Widget**: Tabular metric data, top-N lists, multi-column display with sortable columns. New component type
|
||||
- **Gauge Widget**: Circular gauge with configurable min/max/thresholds and color zones. New component
|
||||
- **Heatmap**: Time on X-axis, value buckets on Y-axis, color intensity for count. Essential for distribution/histogram metrics
|
||||
- **Histogram**: Bar chart showing value distribution. Important for latency analysis
|
||||
|
||||
Each chart type needs:
|
||||
- A new entry in `DashboardComponentType` or `ChartType` enum
|
||||
- A rendering component in `Dashboard/Components/`
|
||||
- Configuration options in the component settings side panel
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Dashboard/Chart/ChartType.ts` (add Area, Pie, Heatmap, Histogram, Gauge)
|
||||
- `Common/Types/Dashboard/DashboardComponentType.ts` (add Table, Gauge)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render new types)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTableComponent.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardGaugeComponent.tsx` (new)
|
||||
|
||||
### 1.2 Template Variables
|
||||
|
||||
**Current**: No template variables. Users must create separate dashboards for each service/host/environment.
|
||||
**Target**: Drop-down variable selectors that dynamically filter all widgets.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create a `DashboardVariable` type stored in `dashboardViewConfig`:
|
||||
- Name, label, type (query-based, custom list, text input)
|
||||
- Query-based: runs a ClickHouse query to populate options (e.g., `SELECT DISTINCT service FROM MetricItem WHERE projectId = {pid}`)
|
||||
- Custom list: manually defined options
|
||||
- Multi-value selection support
|
||||
- Render variables as dropdown selectors in the dashboard toolbar
|
||||
- Variables can be referenced in metric queries as `$variable_name`
|
||||
- When a variable changes, all widgets re-query with the new value
|
||||
- Support cascading variables (variable B's query depends on variable A's value)
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Dashboard/DashboardVariable.ts` (new)
|
||||
- `Common/Types/Dashboard/DashboardViewConfig.ts` (add variables array)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Toolbar/DashboardToolbar.tsx` (render variable dropdowns)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (pass variable values to widgets)
|
||||
- `Common/Server/Services/MetricService.ts` (resolve variable references in queries)
|
||||
|
||||
### 1.3 Auto-Refresh
|
||||
|
||||
**Current**: Data goes stale after initial load.
|
||||
**Target**: Configurable auto-refresh intervals.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add auto-refresh dropdown in toolbar with options: Off, 5s, 10s, 30s, 1m, 5m, 15m
|
||||
- Store selected interval in dashboard config and URL state
|
||||
- Use `setInterval` to trigger re-fetch on all metric widgets
|
||||
- Show a subtle refresh indicator when data is being updated
|
||||
- Pause auto-refresh when the dashboard is in edit mode
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Toolbar/DashboardToolbar.tsx` (add refresh dropdown)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (implement refresh timer)
|
||||
- `Common/Types/Dashboard/DashboardViewConfig.ts` (store refresh interval)
|
||||
|
||||
### 1.4 Multiple Queries per Chart
|
||||
|
||||
**Current**: Single `MetricQueryConfigData` per chart.
|
||||
**Target**: Overlay multiple metric series on a single chart for correlation.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Change chart component's data source from single `MetricQueryConfigData` to `MetricQueryConfigData[]`
|
||||
- Each query gets its own alias and legend entry
|
||||
- Support formula references across queries (e.g., `a / b * 100`)
|
||||
- Y-axis: support dual Y-axes for metrics with different scales
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Utils/Dashboard/Components/DashboardChartComponent.ts` (change to array)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render multiple series)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/ComponentSettingsSideOver.tsx` (multi-query config UI)
|
||||
|
||||
### 1.5 Full Markdown Support for Text Widget
|
||||
|
||||
**Current**: Only bold, italic, underline formatting.
|
||||
**Target**: Full markdown rendering including headers, links, lists, code blocks, tables, and images.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Replace the current custom formatting with a markdown renderer (e.g., `react-markdown` or `marked`)
|
||||
- Support: headers (h1-h6), links, ordered/unordered lists, code blocks with syntax highlighting, tables, images, blockquotes
|
||||
- Edit mode: raw markdown text area with preview toggle
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTextComponent.tsx` (replace with markdown renderer)
|
||||
- `Common/Utils/Dashboard/Components/DashboardTextComponent.ts` (store raw markdown)
|
||||
|
||||
### 1.6 Threshold Lines & Color Coding
|
||||
|
||||
**Current**: No threshold visualization.
|
||||
**Target**: Configurable warning/critical thresholds on charts with color-coded regions.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add threshold configuration to chart settings: value, label, color (default: yellow for warning, red for critical)
|
||||
- Render horizontal lines on the chart at threshold values
|
||||
- Optionally fill regions above/below thresholds with translucent color
|
||||
- For value/billboard widgets: change background color based on which threshold range the value falls in (green/yellow/red)
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Utils/Dashboard/Components/DashboardChartComponent.ts` (add thresholds config)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render threshold lines)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardValueComponent.tsx` (color coding)
|
||||
|
||||
### 1.7 Legend Interaction (Show/Hide Series)
|
||||
|
||||
**Current**: Legends are display-only.
|
||||
**Target**: Click legend items to toggle series visibility.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add click handler on legend items to toggle series visibility
|
||||
- Clicked-off series should be visually dimmed in the legend and removed from the chart
|
||||
- Support "isolate" mode: Ctrl+Click shows only that series and hides all others
|
||||
- Persist toggled state during the session (reset on page reload)
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (add legend click handlers)
|
||||
|
||||
### 1.8 Chart Zoom (Click-Drag Time Selection)
|
||||
|
||||
**Current**: No zoom capability.
|
||||
**Target**: Click and drag on a time series chart to zoom into a time range.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Enable brush/selection mode on time series charts
|
||||
- When user drags to select a range, update the global time range to the selected range
|
||||
- Show a "Reset zoom" button to return to the previous time range
|
||||
- Maintain a zoom stack so users can zoom in multiple times and zoom back out
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (add brush selection)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (handle time range updates from zoom)
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Observability Integration (P0-P1) — Leverage the Full Platform
|
||||
|
||||
This is where OneUptime can differentiate: metrics, logs, and traces in one platform.
|
||||
|
||||
### 2.1 Log Stream Widget
|
||||
|
||||
**Current**: Dashboards can only show metrics.
|
||||
**Target**: Widget that displays a live log stream with filtering.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- New `DashboardComponentType.LogStream` widget type
|
||||
- Configuration: log query filter, severity filter, service filter, max rows
|
||||
- Renders as a scrolling log list with severity color coding, timestamp, and body
|
||||
- Click a log entry to expand and see full details
|
||||
- Respects dashboard time range and template variables
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Dashboard/DashboardComponentType.ts` (add LogStream)
|
||||
- `Common/Utils/Dashboard/Components/DashboardLogStreamComponent.ts` (new - config)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardLogStreamComponent.tsx` (new - rendering)
|
||||
|
||||
### 2.2 Trace List Widget
|
||||
|
||||
**Current**: No trace visualization in dashboards.
|
||||
**Target**: Widget showing a filtered trace list with duration and status.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- New `DashboardComponentType.TraceList` widget type
|
||||
- Configuration: service filter, operation filter, status filter, min duration
|
||||
- Renders as a table: trace ID, operation, service, duration, status, timestamp
|
||||
- Click a row to navigate to the full trace view
|
||||
- Respects dashboard time range and template variables
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Dashboard/DashboardComponentType.ts` (add TraceList)
|
||||
- `Common/Utils/Dashboard/Components/DashboardTraceListComponent.ts` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTraceListComponent.tsx` (new)
|
||||
|
||||
### 2.3 Click-to-Correlate Across Signals
|
||||
|
||||
**Current**: No cross-signal correlation in dashboards.
|
||||
**Target**: Click a point on a metric chart to instantly see related logs and traces from that timestamp.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- When clicking a data point on a metric chart, open a correlation panel showing:
|
||||
- Logs from the same service and time window (+/- 5 minutes around the clicked point)
|
||||
- Traces from the same service and time window
|
||||
- Filtered by the same template variables
|
||||
- The correlation panel appears as a slide-over or split view below the chart
|
||||
- This is a major differentiator vs Grafana (which requires separate datasources) and ties into OneUptime's all-in-one advantage
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (add click handler)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/CorrelationPanel.tsx` (new - shows correlated logs/traces)
|
||||
|
||||
### 2.4 Annotations / Event Overlays
|
||||
|
||||
**Current**: No event markers on charts.
|
||||
**Target**: Show deployment events, incidents, and alerts as vertical markers on time series charts.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Query OneUptime's own data for events in the chart's time range:
|
||||
- Incidents (from Incident model)
|
||||
- Deployments (can be sent as OTLP resource attributes or a custom event API)
|
||||
- Alert triggers (from monitor alert history)
|
||||
- Render as vertical dashed lines with icons on hover
|
||||
- Color-code by type: red for incidents, blue for deployments, yellow for alerts
|
||||
- Allow users to add manual annotations (text + timestamp)
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Dashboard/DashboardAnnotation.ts` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render annotation markers)
|
||||
- `Common/Server/API/DashboardAnnotationAPI.ts` (new - query events)
|
||||
|
||||
### 2.5 Alert Integration
|
||||
|
||||
**Current**: No connection between dashboards and alerting.
|
||||
**Target**: Create alerts from dashboard panels and display alert state on panels.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- "Create Alert" button in chart settings that pre-fills a metric monitor with the chart's query
|
||||
- Show alert state indicator on chart headers (green/yellow/red dot) based on associated monitor status
|
||||
- Alert status widget: shows a summary of all active alerts with severity and duration
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/ComponentSettingsSideOver.tsx` (add "Create Alert" button)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (show alert state)
|
||||
- `Common/Types/Dashboard/DashboardComponentType.ts` (add AlertStatus widget type)
|
||||
|
||||
### 2.6 SLO/SLI Widget
|
||||
|
||||
**Current**: No SLO visualization.
|
||||
**Target**: Dedicated widget showing SLO status, error budget burn rate, and remaining budget.
|
||||
|
||||
**Implementation** (depends on Metrics roadmap Phase 3.2 - SLO/SLI Tracking):
|
||||
|
||||
- New `DashboardComponentType.SLO` widget type
|
||||
- Configuration: select an SLO definition
|
||||
- Displays: current attainment (%), target (%), error budget remaining (%), burn rate chart
|
||||
- Color-coded: green (healthy), yellow (burning fast), red (budget exhausted)
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Dashboard/DashboardComponentType.ts` (add SLO)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardSLOComponent.tsx` (new)
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Collaboration & Sharing (P1) — Production Workflows
|
||||
|
||||
### 3.1 Public/Shared Dashboards
|
||||
|
||||
**Current**: Dashboards require login.
|
||||
**Target**: Share dashboards with external stakeholders without requiring authentication.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add `isPublic` flag and `publicAccessToken` to Dashboard model
|
||||
- Generate a shareable URL with token: `/public/dashboard/{token}`
|
||||
- Public view is read-only with no editing controls
|
||||
- Option to restrict public access to specific IP ranges
|
||||
- Render without the OneUptime navigation chrome
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/DatabaseModels/Dashboard.ts` (add isPublic, publicAccessToken)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Public/Dashboard.tsx` (new - public dashboard view)
|
||||
|
||||
### 3.2 JSON Import/Export
|
||||
|
||||
**Current**: No import/export capability.
|
||||
**Target**: Export dashboards as JSON and re-import for backup, migration, and dashboard-as-code.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Export: serialize `dashboardViewConfig` + metadata (name, description, variables) as a JSON file download
|
||||
- Import: upload a JSON file, validate schema, create a new dashboard from the config
|
||||
- Handle version compatibility (include a schema version in the export)
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Dashboards.tsx` (add import button)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/View/Settings.tsx` (add export button)
|
||||
- `Common/Server/API/DashboardImportExportAPI.ts` (new)
|
||||
|
||||
### 3.3 Dashboard Versioning
|
||||
|
||||
**Current**: No change history.
|
||||
**Target**: Track changes to dashboards over time with the ability to view history and revert.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create `DashboardVersion` model: dashboardId, version number, config snapshot, changedBy, timestamp
|
||||
- On each save, create a new version entry
|
||||
- UI: "Version History" in settings showing a list of versions with timestamps and authors
|
||||
- "Restore" button to revert to a previous version
|
||||
- Optional: diff view comparing two versions
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/DatabaseModels/DashboardVersion.ts` (new)
|
||||
- `Common/Server/Services/DashboardService.ts` (create version on save)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/View/VersionHistory.tsx` (new)
|
||||
|
||||
### 3.4 Row/Section Grouping
|
||||
|
||||
**Current**: Components placed freely with no grouping.
|
||||
**Target**: Collapsible rows/sections for organizing related panels.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add a "Section" component type that acts as a collapsible container
|
||||
- Section has a title bar that can be clicked to collapse/expand
|
||||
- When collapsed, hides all components within the section's vertical range
|
||||
- Sections can be nested one level deep
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Dashboard/DashboardComponentType.ts` (add Section)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardSectionComponent.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/Index.tsx` (handle section collapse)
|
||||
|
||||
### 3.5 TV/Kiosk Mode
|
||||
|
||||
**Current**: Full-screen only.
|
||||
**Target**: Dedicated kiosk mode optimized for wall-mounted monitors with auto-cycling.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Kiosk mode: hides all chrome (toolbar, navigation, URL bar), shows only the dashboard grid
|
||||
- Auto-cycle: rotate through a list of dashboards at a configurable interval (30s, 1m, 5m)
|
||||
- Dashboard playlist: define an ordered list of dashboards to cycle through
|
||||
- Support per-dashboard display duration
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Kiosk.tsx` (new - kiosk view)
|
||||
- `Common/Models/DatabaseModels/DashboardPlaylist.ts` (new - playlist model)
|
||||
|
||||
### 3.6 CSV Export
|
||||
|
||||
**Current**: No data export.
|
||||
**Target**: Export chart/table data as CSV for offline analysis.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add "Export CSV" option in chart/table context menu
|
||||
- Client-side: serialize the current rendered data to CSV format
|
||||
- Include column headers, timestamps, and values
|
||||
- Trigger browser download
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (add export option)
|
||||
- `Common/Utils/Dashboard/CSVExport.ts` (new - CSV serialization)
|
||||
|
||||
### 3.7 Custom Time Range per Widget
|
||||
|
||||
**Current**: All widgets share the global time range.
|
||||
**Target**: Individual widgets can override the global time range.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add optional `timeRangeOverride` to each component's config
|
||||
- When set, the widget uses its own time range instead of the global one
|
||||
- Show a small clock icon on widgets with custom time ranges
|
||||
- Configuration in the component settings side panel
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Utils/Dashboard/Components/DashboardBaseComponent.ts` (add timeRangeOverride)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (pass per-widget time ranges)
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Differentiation (P2-P3) — Surpass Competition
|
||||
|
||||
### 4.1 AI-Powered Dashboard Creation
|
||||
|
||||
**Current**: Manual dashboard creation only.
|
||||
**Target**: Natural language dashboard creation - "Show me CPU usage by service for the last 24 hours" auto-creates the right widget.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Natural language input in the "Add Widget" dialog
|
||||
- AI translates to: metric name, aggregation, group by, chart type, time range
|
||||
- Uses available MetricType metadata to match metric names
|
||||
- Preview the generated widget before adding to dashboard
|
||||
- This is a feature NO competitor has done well yet - major differentiator
|
||||
|
||||
### 4.2 Pre-Built Dashboard Templates
|
||||
|
||||
**Current**: No templates.
|
||||
**Target**: One-click dashboard templates for common stacks.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Template library: Node.js, Python, Go, Java, Kubernetes, PostgreSQL, Redis, Nginx, MongoDB, etc.
|
||||
- Auto-detect relevant templates based on ingested telemetry data
|
||||
- "One-click create" instantiates a full dashboard from the template
|
||||
- Community template sharing (future)
|
||||
|
||||
### 4.3 Auto-Generated Dashboards
|
||||
|
||||
**Current**: Users must manually build dashboards.
|
||||
**Target**: When a service connects, auto-generate a relevant dashboard.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- On first telemetry ingest from a new service, analyze the metric names and types
|
||||
- Auto-create a service dashboard with relevant charts based on detected metrics
|
||||
- Include golden signals (latency, traffic, errors, saturation) where applicable
|
||||
- Notify the user and link to the auto-generated dashboard
|
||||
|
||||
### 4.4 Customer-Facing Dashboards on Status Pages
|
||||
|
||||
**Current**: Status pages and dashboards are separate.
|
||||
**Target**: Embed dashboard widgets on status pages for real-time performance visibility.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Allow selecting specific dashboard widgets to embed on a status page
|
||||
- Render widgets in read-only mode without internal navigation
|
||||
- Respect public/private data boundaries (only show metrics the customer should see)
|
||||
- This is unique to OneUptime - no competitor has integrated observability dashboards with status pages
|
||||
|
||||
### 4.5 Dashboard-as-Code SDK
|
||||
|
||||
**Current**: No programmatic dashboard creation.
|
||||
**Target**: TypeScript SDK for defining dashboards as code.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
```typescript
|
||||
const dashboard = new Dashboard("Service Health")
|
||||
.addVariable("service", { type: "query", query: "SELECT DISTINCT service FROM MetricItem" })
|
||||
.addRow("Latency")
|
||||
.addChart({ metric: "http.server.duration", aggregation: "p99", groupBy: ["$service"] })
|
||||
.addChart({ metric: "http.server.duration", aggregation: "p50", groupBy: ["$service"] })
|
||||
.addRow("Throughput")
|
||||
.addChart({ metric: "http.server.request.count", aggregation: "rate", groupBy: ["$service"] })
|
||||
```
|
||||
|
||||
- SDK generates the JSON config and uses the Dashboard API to create/update
|
||||
- Git-based provisioning: store dashboard definitions in repo, CI/CD syncs to OneUptime
|
||||
|
||||
### 4.6 Anomaly Detection Overlays
|
||||
|
||||
**Current**: No anomaly visualization.
|
||||
**Target**: AI highlights anomalous data points on charts without manual threshold configuration.
|
||||
|
||||
**Implementation** (depends on Metrics roadmap Phase 3.1 - Anomaly Detection):
|
||||
|
||||
- Automatically overlay expected range bands (baseline +/- N sigma) on metric charts
|
||||
- Highlight data points outside the expected range with color indicators
|
||||
- Click an anomaly to see correlated changes across metrics, logs, and traces
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins (Can Ship This Week)
|
||||
|
||||
1. **Auto-refresh** - Add a simple `setInterval` refresh with dropdown selector in toolbar
|
||||
2. **Full markdown for text widget** - Replace custom formatting with a markdown renderer
|
||||
3. **Legend show/hide** - Add click handler on legend items to toggle series
|
||||
4. **Stacked area chart** - Simple extension of existing line chart with fill
|
||||
5. **Chart zoom** - Enable brush selection on time series charts
|
||||
|
||||
---
|
||||
|
||||
## Recommended Implementation Order
|
||||
|
||||
1. **Quick Wins** - Auto-refresh, markdown, legend toggle, stacked area, chart zoom
|
||||
2. **Phase 1.1** - More chart types (Area, Pie, Table, Gauge)
|
||||
3. **Phase 1.2** - Template variables (highest-impact feature for dashboard usability)
|
||||
4. **Phase 1.4** - Multiple queries per chart
|
||||
5. **Phase 1.6** - Threshold lines & color coding
|
||||
6. **Phase 2.1** - Log stream widget (leverages all-in-one platform)
|
||||
7. **Phase 2.2** - Trace list widget
|
||||
8. **Phase 2.3** - Click-to-correlate (major differentiator)
|
||||
9. **Phase 2.4** - Annotations / event overlays
|
||||
10. **Phase 2.5** - Alert integration
|
||||
11. **Phase 3.1** - Public/shared dashboards
|
||||
12. **Phase 3.2** - JSON import/export
|
||||
13. **Phase 3.4** - Row/section grouping
|
||||
14. **Phase 3.5** - TV/Kiosk mode
|
||||
15. **Phase 3.3** - Dashboard versioning
|
||||
16. **Phase 2.6** - SLO widget (depends on SLO/SLI from Metrics roadmap)
|
||||
17. **Phase 4.2** - Pre-built dashboard templates
|
||||
18. **Phase 4.3** - Auto-generated dashboards
|
||||
19. **Phase 4.1** - AI-powered dashboard creation
|
||||
20. **Phase 4.4** - Customer-facing dashboards on status pages
|
||||
21. **Phase 4.5** - Dashboard-as-code SDK
|
||||
|
||||
## Verification
|
||||
|
||||
For each feature:
|
||||
1. Unit tests for new widget types, template variable resolution, CSV export logic
|
||||
2. Integration tests for new API endpoints (annotations, public dashboards, import/export)
|
||||
3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/dashboards`
|
||||
4. Visual regression testing for new chart types (ensure correct rendering across browsers)
|
||||
5. Performance testing: verify dashboards with 20+ widgets and auto-refresh don't cause excessive API load
|
||||
6. Test template variables with edge cases: empty results, special characters, multi-value selections
|
||||
7. Verify public dashboards don't leak private data
|
||||
@@ -20,125 +20,41 @@ The following features have been implemented and removed from this plan:
|
||||
- **Phase 2.2** - Log Analytics View (LogsAnalyticsView with timeseries, toplist, table charts; analytics endpoint)
|
||||
- **Phase 2.3** - Column Customization (ColumnSelector with dynamic columns from log attributes)
|
||||
- **Phase 5.8** - Store Missing OpenTelemetry Log Fields (observedTimeUnixNano, droppedAttributesCount, flags columns + ingestion + migration)
|
||||
- **Phase 3.1** - Log Context / Surrounding Logs (Context tab in LogDetailsPanel, context endpoint in TelemetryAPI)
|
||||
- **Phase 3.2** - Log Pipelines (LogPipeline + LogPipelineProcessor models, GrokParser/AttributeRemapper/SeverityRemapper/CategoryProcessor, pipeline execution service)
|
||||
- **Phase 3.3** - Drop Filters (LogDropFilter model, LogDropFilterService, dashboard UI for configuration)
|
||||
- **Phase 3.4** - Export to CSV/JSON (Export button in toolbar, LogExport utility with CSV and JSON support)
|
||||
- **Phase 4.2** - Keyboard Shortcuts (j/k navigation, Enter expand/collapse, Esc close, / focus search, Ctrl+Enter apply filters, ? help)
|
||||
- **Phase 4.3** - Sensitive Data Scrubbing (LogScrubRule model with PII patterns: Email, CreditCard, SSN, PhoneNumber, IPAddress, custom regex)
|
||||
|
||||
## Gap Analysis Summary
|
||||
|
||||
| Feature | OneUptime | Datadog | New Relic | Priority |
|
||||
|---------|-----------|---------|-----------|----------|
|
||||
| Log Patterns (ML clustering) | None | Auto-clustering + Pattern Inspector | ML clustering + anomaly | **P1** |
|
||||
| Log context (surrounding logs) | None | Before/after from same host/service | Automatic via APM agent | **P2** |
|
||||
| Log Pipelines (server-side processing) | None (raw storage only) | 270+ OOTB, 14+ processor types | Grok parsing, built-in rules | **P2** |
|
||||
| Log-based Metrics | None | Count + Distribution, 15-month retention | Via NRQL | **P2** |
|
||||
| Drop Filters (pre-storage filtering) | None | Exclusion filters with sampling | Drop rules per NRQL | **P2** |
|
||||
| Export to CSV/JSON | None | CSV up to 100K rows | CSV/JSON up to 5K | **P2** |
|
||||
| Keyboard shortcuts | None | Full keyboard nav | Basic | **P3** |
|
||||
| Sensitive Data Scrubbing | None | Multi-layer (SaaS + agent + pipeline) | Auto-obfuscation + custom rules | **P3** |
|
||||
| Data retention config UI | Referenced but no UI | Multi-tier (Standard/Flex/Archive) | Partitions + Live Archives | **P3** |
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Processing & Operations (P2) — Platform Capabilities
|
||||
|
||||
### 3.1 Log Context (Surrounding Logs)
|
||||
|
||||
**Current**: Clicking a log shows only that log's details.
|
||||
**Target**: A "Context" tab in the log detail panel showing N logs before/after from the same service.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- When a log is expanded, add a "Context" tab that queries ClickHouse:
|
||||
```sql
|
||||
(SELECT * FROM LogItem WHERE projectId={pid} AND serviceId={sid} AND time < {logTime} ORDER BY time DESC LIMIT 5)
|
||||
UNION ALL
|
||||
(SELECT * FROM LogItem WHERE projectId={pid} AND serviceId={sid} AND time >= {logTime} ORDER BY time ASC LIMIT 6)
|
||||
```
|
||||
- Display as a mini log list with the current log highlighted
|
||||
- Add to `LogDetailsPanel.tsx` as a tabbed section alongside the existing body/attributes view
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Server/API/TelemetryAPI.ts` (add context endpoint)
|
||||
- `Common/UI/Components/LogsViewer/components/LogDetailsPanel.tsx` (add tabs + context view)
|
||||
|
||||
### 3.2 Log Pipelines (Server-Side Processing)
|
||||
|
||||
**Current**: Logs are stored raw as received (after OTLP normalization).
|
||||
**Target**: Configurable processing pipelines that transform logs at ingest time.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create `LogPipeline` and `LogPipelineProcessor` PostgreSQL models
|
||||
- Pipeline has: name, filter (which logs it applies to), enabled flag, sort order
|
||||
- Processor types (start with these 4):
|
||||
- **Grok Parser**: Parse body text into structured attributes using Grok patterns
|
||||
- **Attribute Remapper**: Rename/copy one attribute to another
|
||||
- **Severity Remapper**: Map an attribute value to the severity field
|
||||
- **Category Processor**: Assign a new attribute value based on if/else conditions
|
||||
- Processing runs in the telemetry ingestion worker (`Telemetry/Jobs/TelemetryIngest/ProcessTelemetry.ts`) after normalization but before ClickHouse insert
|
||||
- Pipeline configuration UI under Settings > Log Pipelines
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/DatabaseModels/LogPipeline.ts` (new)
|
||||
- `Common/Models/DatabaseModels/LogPipelineProcessor.ts` (new)
|
||||
- `Telemetry/Services/LogPipelineService.ts` (new - pipeline execution engine)
|
||||
- `Telemetry/Services/OtelLogsIngestService.ts` (hook pipeline execution before insert)
|
||||
- Dashboard: new Settings page for pipeline configuration
|
||||
|
||||
### 3.3 Drop Filters (Pre-Storage Filtering)
|
||||
|
||||
**Current**: All ingested logs are stored.
|
||||
**Target**: Configurable rules to drop or sample logs before storage.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create `LogDropFilter` PostgreSQL model: name, filter query, action (drop or sample at N%), enabled
|
||||
- Evaluate drop filters in the ingestion pipeline before ClickHouse insert
|
||||
- UI under Settings > Log Configuration > Drop Filters
|
||||
- Show estimated volume savings based on recent log volume
|
||||
|
||||
### 3.4 Export to CSV/JSON
|
||||
|
||||
**Current**: No export capability.
|
||||
**Target**: Download current filtered log results as CSV or JSON.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add "Export" button in the toolbar
|
||||
- Client-side: serialize current page of logs to CSV/JSON and trigger browser download
|
||||
- Server-side (for large exports): new endpoint that streams results to a downloadable file (up to 10K rows)
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/UI/Components/LogsViewer/components/LogsViewerToolbar.tsx` (add export button)
|
||||
- `Common/UI/Utils/LogExport.ts` (new - CSV/JSON serialization)
|
||||
- `Common/Server/API/TelemetryAPI.ts` (add export endpoint for large exports)
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Advanced Features (P3) — Differentiation
|
||||
## Remaining Features
|
||||
|
||||
### Log Patterns (ML Clustering) — P1
|
||||
|
||||
### 4.2 Keyboard Shortcuts
|
||||
**Current**: No pattern detection.
|
||||
**Target**: Auto-cluster similar log messages and surface pattern groups with anomaly detection.
|
||||
|
||||
- `j`/`k` to navigate between log rows
|
||||
- `Enter` to expand/collapse selected log
|
||||
- `Escape` to close detail panel
|
||||
- `/` to focus search bar
|
||||
- `Ctrl+Enter` to apply filters
|
||||
### Log-based Metrics — P2
|
||||
|
||||
### 4.3 Sensitive Data Scrubbing
|
||||
**Current**: No log-to-metric conversion.
|
||||
**Target**: Create count/distribution metrics from log queries with long-term retention.
|
||||
|
||||
- Auto-detect common PII patterns (credit cards, SSNs, emails) at ingest time
|
||||
- Configurable scrubbing rules: mask, hash, or redact
|
||||
- UI under Settings > Data Privacy
|
||||
### Data Retention Config UI — P3
|
||||
|
||||
---
|
||||
|
||||
## Recommended Implementation Order
|
||||
|
||||
1. **Phase 3.4** - Export CSV/JSON (small effort, table-stakes feature)
|
||||
2. **Phase 3.1** - Log Context (moderate effort, high debugging value)
|
||||
3. **Phase 3.2** - Log Pipelines (large effort, platform capability)
|
||||
4. **Phase 3.3** - Drop Filters (moderate effort, cost optimization)
|
||||
5. **Phase 4.x** - Patterns, Shortcuts, Data Scrubbing (future)
|
||||
**Current**: `retainTelemetryDataForDays` exists on the service model and is displayed in usage history, but there is no dedicated UI to configure retention settings.
|
||||
**Target**: Settings page for configuring per-service log data retention.
|
||||
|
||||
## Phase 5: ClickHouse Storage & Query Optimizations (P0) — Performance Foundation
|
||||
|
||||
@@ -205,18 +121,22 @@ These optimizations address fundamental storage and indexing gaps in the telemet
|
||||
| 5.3 DateTime64 time column | Sub-second log ordering | Correctness fix | Medium |
|
||||
| 5.7 Histogram projections | Histogram and severity aggregation | 5-10x | Medium |
|
||||
|
||||
### 5.x Recommended Remaining Order
|
||||
---
|
||||
|
||||
## Recommended Remaining Implementation Order
|
||||
|
||||
1. **5.3** — DateTime64 upgrade (correctness)
|
||||
2. **5.7** — Projections (performance polish)
|
||||
3. **Log-based Metrics** (platform capability)
|
||||
4. **Data Retention Config UI** (operational)
|
||||
5. **Log Patterns / ML Clustering** (advanced, larger effort)
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
For each feature:
|
||||
1. Unit tests for new parsers/utilities (LogQueryParser, CSV export, etc.)
|
||||
2. Integration tests for new API endpoints (histogram, facets, analytics, context)
|
||||
For each remaining feature:
|
||||
1. Unit tests for new utilities
|
||||
2. Integration tests for new API endpoints
|
||||
3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/logs`
|
||||
4. Check ClickHouse query performance with `EXPLAIN` for new aggregation queries
|
||||
5. Verify real-time/live mode still works correctly with new UI components
|
||||
|
||||
468
Internal/Roadmap/Metrics.md
Normal file
468
Internal/Roadmap/Metrics.md
Normal file
@@ -0,0 +1,468 @@
|
||||
# Plan: Bring OneUptime Metrics to Industry Parity and Beyond
|
||||
|
||||
## Context
|
||||
|
||||
OneUptime's metrics implementation provides OTLP ingestion (HTTP and gRPC), ClickHouse storage with support for Gauge, Sum, Histogram, and ExponentialHistogram metric types, basic aggregations (Avg, Sum, Min, Max, Count), single-attribute GROUP BY, formula support for calculated metrics, threshold-based metric monitors, and a metric explorer with line/bar charts. Auto-discovery creates MetricType metadata (name, description, unit) on first ingest. Per-service data retention with TTL (default 15 days).
|
||||
|
||||
This plan identifies the remaining gaps vs DataDog and New Relic, and proposes a phased implementation to close them and build a best-in-class metrics product.
|
||||
|
||||
## Completed
|
||||
|
||||
The following features have been implemented:
|
||||
- **OTLP Ingestion** - HTTP and gRPC metric ingestion with async queue-based batch processing
|
||||
- **Metric Types** - Gauge, Sum, Histogram, ExponentialHistogram support
|
||||
- **ClickHouse Storage** - MergeTree with `sipHash64(projectId) % 16` partitioning, per-service TTL
|
||||
- **Aggregations** - Avg, Sum, Min, Max, Count
|
||||
- **Single-Attribute GROUP BY** - Group by one attribute at a time
|
||||
- **Formulas** - Calculated metrics using aliases (e.g., `a / b * 100`)
|
||||
- **Metric Explorer** - Time range selection, multiple queries with aliases, URL state persistence
|
||||
- **Threshold-Based Monitors** - Static threshold alerting on aggregated metric values
|
||||
- **MetricType Auto-Discovery** - Name, description, unit captured on first ingest
|
||||
- **Attribute Storage** - Full JSON with extracted `attributeKeys` array for fast enumeration
|
||||
- **BloomFilter index** on `name`, Set index on `serviceType`
|
||||
|
||||
## Gap Analysis Summary
|
||||
|
||||
| Feature | OneUptime | DataDog | New Relic | Priority |
|
||||
|---------|-----------|---------|-----------|----------|
|
||||
| Percentile aggregations (p50/p75/p90/p95/p99) | None | DDSketch distributions | NRQL percentile() | **P0** |
|
||||
| Rate/derivative calculations | None | Native Rate type + .as_rate() | rate() NRQL function | **P0** |
|
||||
| Multi-attribute GROUP BY | Single attribute only | Multiple tags | FACET on multiple attrs | **P0** |
|
||||
| Rollup/downsampling for long-range queries | None (raw data, 15-day TTL) | Automatic tiered rollups | 30-day raw + 13-month rollups | **P0** |
|
||||
| Anomaly detection | Static thresholds only | Watchdog + anomaly monitors | Anomaly detection + sigma bands | **P1** |
|
||||
| SLO/SLI tracking | None | Metric-based + Time Slice SLOs | One-click setup + error budgets | **P1** |
|
||||
| Heatmap visualization | None | Purpose-built for distributions | Built-in chart type | **P1** |
|
||||
| Time-over-time comparison | None | Yes | COMPARE WITH in NRQL | **P1** |
|
||||
| Summary metric type | Not supported | N/A (uses Distribution) | Yes | **P1** |
|
||||
| Query language | Form-based UI only | Graphing editor + NLQ | Full NRQL language | **P2** |
|
||||
| Predictive alerting | None | Watchdog forecasting | GA predictive alerting | **P2** |
|
||||
| Metric correlations | None | Auto-surfaces related metrics | Applied Intelligence correlation | **P2** |
|
||||
| Golden Signals dashboards | None | Available via APM | Pre-built with default alerts | **P2** |
|
||||
| Cardinality management | None | Metrics Without Limits + Explorer | Budget system + pruning rules | **P2** |
|
||||
| More chart types | Line and bar only | 12+ types | 10+ types with conditional coloring | **P2** |
|
||||
| Dashboard templates | None | Pre-built integration dashboards | Pre-built entity dashboards | **P2** |
|
||||
| Units on charts | Stored but not rendered | Auto-formatted by unit type | Y-axis unit customization | **P2** |
|
||||
| Natural language querying | None | NLQ translates English to queries | None | **P3** |
|
||||
| Metric cost/volume management | None | Cost attribution dashboards | Volume dashboards | **P3** |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Foundation (P0) — Close Critical Gaps
|
||||
|
||||
These are table-stakes features without which the metrics product is fundamentally limited.
|
||||
|
||||
### 1.1 Percentile Aggregations (p50, p75, p90, p95, p99)
|
||||
|
||||
**Current**: Only Avg, Sum, Min, Max, Count aggregations.
|
||||
**Target**: Support percentile aggregations on all metric data, especially histograms and distributions.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add `P50`, `P75`, `P90`, `P95`, `P99` to the `AggregationType` enum
|
||||
- For raw metric values: use ClickHouse `quantile(0.50)(value)`, `quantile(0.95)(value)`, etc.
|
||||
- For histogram data (with `bucketCounts` and `explicitBounds`): implement approximate percentile calculation from bucket data using linear interpolation between bucket boundaries
|
||||
- Update the metric query builder to include percentile options in the aggregation dropdown
|
||||
- Update chart rendering to display percentile series
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/BaseDatabase/AggregationType.ts` (add P50, P75, P90, P95, P99)
|
||||
- `Common/Server/Services/MetricService.ts` (generate quantile SQL)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx` (add to dropdown)
|
||||
|
||||
### 1.2 Rate/Derivative Calculations
|
||||
|
||||
**Current**: No rate or delta computation. Raw cumulative counters are meaningless without rate calculation.
|
||||
**Target**: Compute per-second rates and deltas from counter/sum metrics.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add `Rate` and `Delta` as aggregation options
|
||||
- For cumulative sums: compute `(value_t - value_t-1) / (time_t - time_t-1)` using ClickHouse `runningDifference()`
|
||||
- Handle counter resets (when value decreases, treat as reset and skip that interval)
|
||||
- For delta temporality sums: rate is simply `value / interval_seconds`
|
||||
- Display rate with appropriate units (e.g., "req/s", "bytes/s")
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/BaseDatabase/AggregationType.ts` (add Rate, Delta)
|
||||
- `Common/Server/Services/MetricService.ts` (generate rate SQL with runningDifference)
|
||||
- `Common/Types/Metrics/MetricsQuery.ts` (support rate in query config)
|
||||
|
||||
### 1.3 Multi-Attribute GROUP BY
|
||||
|
||||
**Current**: Single `groupByAttribute: string` field.
|
||||
**Target**: Group by multiple attributes simultaneously (e.g., by region AND status_code).
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Change `groupByAttribute` from `string` to `string[]` in `MetricsQuery`
|
||||
- Update ClickHouse query generation to GROUP BY multiple extracted JSON attributes
|
||||
- Update chart rendering to handle multi-dimensional grouping (composite legend labels)
|
||||
- Update the UI to allow selecting multiple group-by attributes
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Metrics/MetricsQuery.ts` (change type)
|
||||
- `Common/Server/Services/MetricService.ts` (update query generation)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx` (multi-select UI)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (composite legends)
|
||||
|
||||
### 1.4 Rollups / Downsampling
|
||||
|
||||
**Current**: Raw data only with 15-day default TTL. No rollups means long-range queries are slow and historical analysis is limited.
|
||||
**Target**: Pre-aggregated rollups at multiple resolutions with extended retention.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create ClickHouse materialized views for automatic rollup:
|
||||
```
|
||||
Raw Data (1s resolution) -> 15-day retention
|
||||
|-> Materialized View -> 1-min rollups -> 90-day retention
|
||||
|-> Materialized View -> 1-hour rollups -> 13-month retention
|
||||
|-> Materialized View -> 1-day rollups -> 3-year retention
|
||||
```
|
||||
- Each rollup table stores: min, max, sum, count, avg, and quantile sketches per metric name + attributes
|
||||
- Route queries based on time range:
|
||||
- < 6 hours: raw data
|
||||
- 6 hours - 7 days: 1-min rollups
|
||||
- 7 days - 30 days: 1-hour rollups
|
||||
- 30+ days: 1-day rollups
|
||||
- Automatic query routing in the metric service layer
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/AnalyticsModels/MetricRollup1Min.ts` (new)
|
||||
- `Common/Models/AnalyticsModels/MetricRollup1Hour.ts` (new)
|
||||
- `Common/Models/AnalyticsModels/MetricRollup1Day.ts` (new)
|
||||
- `Common/Server/Services/MetricService.ts` (query routing by time range)
|
||||
- `Worker/DataMigrations/` (new migration to create materialized views)
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Visualization & UX (P1) — Match Industry Standard
|
||||
|
||||
### 2.1 More Chart Types
|
||||
|
||||
**Current**: Line and bar charts only.
|
||||
**Target**: Add Heatmap, Stacked Area, Pie/Donut, Scatter, Single-Value Billboard, and Gauge.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- **Heatmap**: Essential for histogram/distribution data. Use a heatmap library that renders time on X-axis, bucket values on Y-axis, and color intensity for count
|
||||
- **Stacked Area**: Extension of existing line chart with fill and stacking
|
||||
- **Pie/Donut**: For showing proportional breakdowns (e.g., request distribution by service)
|
||||
- **Scatter**: For correlation analysis between two metrics
|
||||
- **Billboard**: Large single-value display with configurable thresholds for color coding (green/yellow/red)
|
||||
- **Gauge**: Circular gauge showing a value against a min/max range
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Dashboard/Chart/ChartType.ts` (add new chart types)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render new chart types)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricCharts.tsx` (chart type selector)
|
||||
|
||||
### 2.2 Time-Over-Time Comparison
|
||||
|
||||
**Current**: No comparison capability.
|
||||
**Target**: Overlay current metric data with data from a previous period (1h ago, 1d ago, 1w ago).
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add a "Compare with" dropdown in the metric explorer toolbar (options: 1 hour ago, 1 day ago, 1 week ago, custom)
|
||||
- Execute the same query twice with shifted time ranges
|
||||
- Render the comparison series as a dashed/translucent overlay on the same chart
|
||||
- Show the delta (absolute and percentage) in tooltips
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricExplorer.tsx` (add compare dropdown)
|
||||
- `Common/Types/Metrics/MetricsQuery.ts` (add compareWith field)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render comparison series)
|
||||
|
||||
### 2.3 Render Metric Units on Charts
|
||||
|
||||
**Current**: Units stored in MetricType but not rendered on chart axes.
|
||||
**Target**: Display units on Y-axis labels and tooltips with smart formatting.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Pass `MetricType.unit` through to chart rendering
|
||||
- Implement unit-aware formatting:
|
||||
- Bytes: auto-convert to KB/MB/GB/TB
|
||||
- Duration: auto-convert ns/us/ms/s
|
||||
- Percentage: append `%`
|
||||
- Rate: append `/s`
|
||||
- Display formatted unit on Y-axis label and in tooltip values
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (Y-axis unit formatting)
|
||||
- `Common/Utils/Metrics/UnitFormatter.ts` (new - unit formatting logic)
|
||||
|
||||
### 2.4 Dashboard Templates
|
||||
|
||||
**Current**: No templates.
|
||||
**Target**: Pre-built dashboards for common scenarios that auto-populate based on detected metrics.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create MetricsViewConfig templates for:
|
||||
- HTTP Service Health (request rate, error rate, latency percentiles)
|
||||
- Database Performance (query duration, connection pool, error rate)
|
||||
- Kubernetes Metrics (CPU, memory, pod restarts, network)
|
||||
- Host Metrics (CPU, memory, disk, network)
|
||||
- Runtime Metrics (GC, heap, threads - per language)
|
||||
- Auto-detect which templates are relevant based on ingested metric names
|
||||
- "One-click apply" creates a dashboard from the template
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Metrics/DashboardTemplates/` (new directory with template definitions)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Templates.tsx` (new - template gallery)
|
||||
|
||||
### 2.5 Summary Metric Type Support
|
||||
|
||||
**Current**: Summary type not supported.
|
||||
**Target**: Ingest and store Summary metrics from OTLP.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add `Summary` to the metric point type enum
|
||||
- Store quantile values from summary data points
|
||||
- Display summary quantiles in the metric explorer
|
||||
|
||||
**Files to modify**:
|
||||
- `Telemetry/Services/OtelMetricsIngestService.ts` (handle summary type)
|
||||
- `Common/Models/AnalyticsModels/Metric.ts` (add summary-specific columns if needed)
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Alerting & Intelligence (P1-P2) — Smart Monitoring
|
||||
|
||||
### 3.1 Anomaly Detection
|
||||
|
||||
**Current**: Static threshold alerting only.
|
||||
**Target**: Detect metrics deviating from expected patterns using statistical methods.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Start with rolling mean + N standard deviations (configurable sensitivity: low/medium/high)
|
||||
- Account for daily/weekly seasonality by comparing to same-time-last-week baselines
|
||||
- Store baselines in ClickHouse (periodic computation job, hourly)
|
||||
- Baseline table: metric name, service, hour_of_week, mean, stddev
|
||||
- On each evaluation: compare current value to baseline, alert if deviation > configured sigma
|
||||
- Surface anomalies as visual highlights on metric charts (shaded band showing expected range)
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/AnalyticsModels/MetricBaseline.ts` (new - baseline storage)
|
||||
- `Worker/Jobs/Metrics/ComputeMetricBaselines.ts` (new - periodic baseline computation)
|
||||
- `Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts` (add anomaly detection)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render anomaly bands)
|
||||
|
||||
### 3.2 SLO/SLI Tracking
|
||||
|
||||
**Current**: No SLO support.
|
||||
**Target**: Define Service Level Objectives based on metric queries, track attainment over rolling windows, calculate error budgets.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create `SLO` PostgreSQL model:
|
||||
- Name, description, target percentage (e.g., 99.9%)
|
||||
- SLI definition: good events query / total events query (both metric queries)
|
||||
- Time window: 7-day, 28-day, or 30-day rolling
|
||||
- Alert thresholds: error budget remaining %, burn rate
|
||||
- SLO dashboard page showing:
|
||||
- Current attainment vs target (e.g., 99.85% / 99.9%)
|
||||
- Error budget remaining (absolute and percentage)
|
||||
- Burn rate chart (current burn rate vs sustainable burn rate)
|
||||
- SLI time series chart
|
||||
- Alert when error budget drops below threshold or burn rate exceeds sustainable rate
|
||||
- Integrate with existing monitor/incident system
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/DatabaseModels/SLO.ts` (new)
|
||||
- `Common/Server/Services/SLOService.ts` (new - SLI computation, budget calculation)
|
||||
- `Worker/Jobs/SLO/EvaluateSLOs.ts` (new - periodic SLO evaluation)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/SLO/` (new - SLO list, detail, creation pages)
|
||||
|
||||
### 3.3 Metric Correlations
|
||||
|
||||
**Current**: No correlation capability.
|
||||
**Target**: When an anomaly is detected, automatically identify other metrics that changed around the same time.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- When an anomaly is detected on a metric, query all metrics for the same service/project in the surrounding time window (e.g., +/- 30 minutes)
|
||||
- Compute Pearson correlation coefficient between the anomalous metric and each candidate
|
||||
- Rank by absolute correlation value
|
||||
- Surface top 5-10 correlated metrics in the alert/incident view
|
||||
- Show correlation chart: anomalous metric overlaid with top correlated metrics
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Server/Services/MetricCorrelationService.ts` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/CorrelatedMetrics.tsx` (new - correlation view)
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Scale & Power Features (P2-P3) — Differentiation
|
||||
|
||||
### 4.1 Cardinality Management
|
||||
|
||||
**Current**: No cardinality visibility or controls.
|
||||
**Target**: Track unique series count, alert on spikes, allow attribute allowlist/blocklist.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Track unique series count per metric name (via periodic ClickHouse `uniq()` queries)
|
||||
- Store in a dedicated cardinality tracking table
|
||||
- Dashboard showing: top metrics by cardinality, cardinality trend over time, per-attribute breakdown
|
||||
- Allow configuring attribute allowlists/blocklists per metric (applied at ingest time)
|
||||
- Alert when cardinality exceeds configured budget
|
||||
|
||||
**Files to modify**:
|
||||
- `Worker/Jobs/Metrics/TrackMetricCardinality.ts` (new - periodic cardinality computation)
|
||||
- `Common/Models/DatabaseModels/MetricCardinalityConfig.ts` (new - allowlist/blocklist)
|
||||
- `Telemetry/Services/OtelMetricsIngestService.ts` (apply attribute filtering)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Settings/MetricCardinality.tsx` (new - cardinality dashboard)
|
||||
|
||||
### 4.2 Query Language
|
||||
|
||||
**Current**: Form-based UI only.
|
||||
**Target**: Text-based metrics query language inspired by PromQL/NRQL for advanced users.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Define a grammar supporting:
|
||||
```
|
||||
metric_name{attribute="value", attribute2=~"regex"}
|
||||
| aggregation(duration)
|
||||
by (attribute1, attribute2)
|
||||
```
|
||||
- Build a parser that translates to the existing ClickHouse query builder
|
||||
- Offer both UI builder and text modes (toggle like New Relic's basic/advanced)
|
||||
- Syntax highlighting and autocomplete in the text editor (metric names, attribute keys, functions)
|
||||
- Functions: `rate()`, `delta()`, `avg()`, `sum()`, `min()`, `max()`, `p50()`, `p95()`, `p99()`, `count()`, `topk()`, `bottomk()`
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Utils/Metrics/MetricsQueryLanguage.ts` (new - parser and translator)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryEditor.tsx` (new - text editor with autocomplete)
|
||||
|
||||
### 4.3 Golden Signals Dashboards
|
||||
|
||||
**Current**: No auto-generated dashboards.
|
||||
**Target**: Auto-generated dashboards showing Latency, Traffic, Errors, Saturation for each service.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Detect common OpenTelemetry metric names per service:
|
||||
- Latency: `http.server.duration`, `http.server.request.duration`
|
||||
- Traffic: `http.server.request.count`, `http.server.active_requests`
|
||||
- Errors: `http.server.request.count` where status_code >= 500
|
||||
- Saturation: `process.runtime.*.memory`, `system.cpu.utilization`
|
||||
- Auto-create a Golden Signals dashboard for each service with detected metrics
|
||||
- Include default alert thresholds
|
||||
|
||||
**Files to modify**:
|
||||
- `Worker/Jobs/Metrics/GenerateGoldenSignalsDashboards.ts` (new)
|
||||
- `Common/Utils/Metrics/GoldenSignalsDetector.ts` (new - metric name pattern matching)
|
||||
|
||||
### 4.4 Predictive Alerting
|
||||
|
||||
**Current**: No forecasting capability.
|
||||
**Target**: Forecast metric values and alert before thresholds are breached.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Use linear regression or Holt-Winters on recent data to project forward
|
||||
- Alert if projected value crosses threshold within configurable forecast window (e.g., "disk full in 4 hours")
|
||||
- Particularly valuable for capacity planning metrics (disk, memory, connection pools)
|
||||
- Show forecast as a dashed line extension on metric charts
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts` (add predictive evaluation)
|
||||
- `Common/Utils/Metrics/Forecasting.ts` (new - regression/Holt-Winters)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render forecast line)
|
||||
|
||||
---
|
||||
|
||||
## ClickHouse Storage Improvements
|
||||
|
||||
### S.1 Fix Sort Key Order (CRITICAL)
|
||||
|
||||
**Current**: Sort key is `(projectId, time, serviceId)`.
|
||||
**Target**: Change to `(projectId, name, serviceId, time)`.
|
||||
|
||||
**Impact**: ~100x improvement for name-filtered queries. Virtually every metric query filters by `name`, but currently ClickHouse must scan all metric names within the time range.
|
||||
|
||||
**Migration**: Requires creating `MetricItem_v2` with new sort key and migrating data (ClickHouse doesn't support `ALTER TABLE MODIFY ORDER BY`).
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/AnalyticsModels/Metric.ts` (change sort key)
|
||||
- `Worker/DataMigrations/` (new migration - create v2 table, backfill, swap)
|
||||
|
||||
### S.2 Upgrade time to DateTime64 (HIGH)
|
||||
|
||||
**Current**: `DateTime` with second precision.
|
||||
**Target**: `DateTime64(3)` or `DateTime64(6)` for sub-second precision.
|
||||
|
||||
**Impact**: Correct sub-second metric ordering. Removes need for separate `timeUnixNano`/`startTimeUnixNano` columns.
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/AnalyticsModels/Metric.ts` (change column type)
|
||||
- `Common/Types/AnalyticsDatabase/TableColumnType.ts` (add DateTime64 type if not present)
|
||||
- `Common/Server/Utils/AnalyticsDatabase/StatementGenerator.ts` (handle DateTime64)
|
||||
- `Worker/DataMigrations/` (migration)
|
||||
|
||||
### S.3 Add Skip Index on metricPointType (MEDIUM)
|
||||
|
||||
**Current**: No index support for metric type filtering.
|
||||
**Target**: Set skip index on `metricPointType`.
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/AnalyticsModels/Metric.ts` (add skip index)
|
||||
|
||||
### S.4 Evaluate Map Type for Attributes (MEDIUM)
|
||||
|
||||
**Current**: Attributes stored as JSON.
|
||||
**Target**: Evaluate `Map(LowCardinality(String), String)` for faster attribute-based filtering.
|
||||
|
||||
### S.5 Upgrade count/bucketCounts to Int64 (LOW)
|
||||
|
||||
**Current**: `Int32` for count and `Array(Int32)` for bucketCounts.
|
||||
**Target**: `Int64` / `Array(Int64)` to prevent overflow in high-throughput systems.
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins (Can Ship This Week)
|
||||
|
||||
1. **Display units on chart Y-axes** - Data exists in MetricType, just needs wiring to chart rendering
|
||||
2. **Add p50/p95/p99 to aggregation dropdown** - ClickHouse `quantile()` is straightforward to add
|
||||
3. **Extend default retention** - 15 days is too short; increase default to 30 days
|
||||
4. **Multi-attribute GROUP BY** - Change `groupByAttribute: string` to `groupByAttribute: string[]`
|
||||
5. **Add stacked area chart type** - Simple extension of existing line chart
|
||||
6. **Add skip index on metricPointType** - Low effort, faster type-filtered queries
|
||||
|
||||
---
|
||||
|
||||
## Recommended Implementation Order
|
||||
|
||||
1. **Quick Wins** - Ship units on charts, p50/p95/p99, multi-attribute GROUP BY, stacked area
|
||||
2. **Phase 1.1** - Percentile aggregations (full implementation beyond quick win)
|
||||
3. **Phase 1.2** - Rate/derivative calculations
|
||||
4. **S.1** - Fix sort key order (critical performance improvement)
|
||||
5. **Phase 1.4** - Rollups/downsampling (enables long-range queries)
|
||||
6. **Phase 2.1** - More chart types (heatmap, pie, gauge, billboard)
|
||||
7. **Phase 2.2** - Time-over-time comparison
|
||||
8. **Phase 1.3** - Multi-attribute GROUP BY (full implementation)
|
||||
9. **S.2** - Upgrade time to DateTime64
|
||||
10. **Phase 3.1** - Anomaly detection
|
||||
11. **Phase 3.2** - SLO/SLI tracking
|
||||
12. **Phase 2.4** - Dashboard templates
|
||||
13. **Phase 4.1** - Cardinality management
|
||||
14. **Phase 4.2** - Query language
|
||||
15. **Phase 4.3** - Golden Signals dashboards
|
||||
16. **Phase 4.4** - Predictive alerting
|
||||
17. **Phase 3.3** - Metric correlations
|
||||
|
||||
## Verification
|
||||
|
||||
For each feature:
|
||||
1. Unit tests for new aggregation types, rate calculations, unit formatting, query language parser
|
||||
2. Integration tests for new API endpoints (percentiles, rollup queries, SLO evaluation)
|
||||
3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/metrics`
|
||||
4. Check ClickHouse query performance with `EXPLAIN` for new query patterns
|
||||
5. Verify rollup accuracy by comparing rollup results to raw data results for overlapping time ranges
|
||||
6. Load test cardinality tracking and anomaly detection jobs to ensure they don't impact ingestion
|
||||
412
Internal/Roadmap/Traces.md
Normal file
412
Internal/Roadmap/Traces.md
Normal file
@@ -0,0 +1,412 @@
|
||||
# Plan: Bring OneUptime Traces to Industry Parity and Beyond
|
||||
|
||||
## Context
|
||||
|
||||
OneUptime's trace implementation provides OTLP-native ingestion (HTTP and gRPC), ClickHouse storage with a full OpenTelemetry span model (events, links, status, attributes, resources, scope), a Gantt/waterfall visualization, trace-to-log and trace-to-exception correlation, a basic service dependency graph, queue-based async ingestion, and per-service data retention with TTL. ClickHouse schema has been optimized with BloomFilter indexes on traceId/spanId/parentSpanId, Set indexes on statusCode/kind/hasException, TokenBF on name, and ZSTD compression on key columns.
|
||||
|
||||
This plan identifies the remaining gaps vs DataDog, NewRelic, Honeycomb, and Grafana Tempo, and proposes a phased implementation to close them and surpass competition.
|
||||
|
||||
## Completed
|
||||
|
||||
The following features have been implemented:
|
||||
- **OTLP Ingestion** - HTTP and gRPC trace ingestion with async queue-based processing
|
||||
- **ClickHouse Storage** - MergeTree with `sipHash64(projectId) % 16` partitioning, per-service TTL
|
||||
- **Gantt/Waterfall View** - Hierarchical span visualization with color-coded services, time-unit auto-scaling, error indicators
|
||||
- **Trace-to-Log Correlation** - Log model has traceId/spanId columns; SpanViewer shows associated logs
|
||||
- **Trace-to-Exception Correlation** - ExceptionInstance model links to traceId/spanId with stack trace parsing and fingerprinting
|
||||
- **Span Detail Panel** - Side-over with tabs for Basic Info, Logs, Attributes, Events, Exceptions
|
||||
- **BloomFilter indexes** on traceId, spanId, parentSpanId
|
||||
- **Set indexes** on statusCode, kind, hasException
|
||||
- **TokenBF index** on name
|
||||
- **ZSTD compression** on time/ID/attribute columns
|
||||
- **hasException boolean column** for fast error span filtering
|
||||
- **links default value** corrected to `[]`
|
||||
|
||||
## Gap Analysis Summary
|
||||
|
||||
| Feature | OneUptime | DataDog | NewRelic | Tempo/Honeycomb | Priority |
|
||||
|---------|-----------|---------|----------|-----------------|----------|
|
||||
| Trace analytics / aggregation engine | None | Trace Explorer with COUNT/percentiles | NRQL on span data | TraceQL rate/count/quantile | **P0** |
|
||||
| RED metrics from traces | None | Auto-computed on 100% traffic | Derived golden signals | Metrics-generator to Prometheus | **P0** |
|
||||
| Trace-based alerting | None | APM Monitors (p50-p99, error rate, Apdex) | NRQL alert conditions | Via Grafana alerting / Triggers | **P0** |
|
||||
| Sampling controls | None (100% ingestion) | Head-based adaptive + retention filters | Infinite Tracing (tail-based) | Refinery (rules/dynamic/tail) | **P0** |
|
||||
| Flame graph view | None | Yes (default view) | No | No | **P1** |
|
||||
| Latency breakdown / critical path | None | Per-hop latency, bottleneck detection | No | BubbleUp (Honeycomb) | **P1** |
|
||||
| In-trace search | None | Yes | No | No | **P1** |
|
||||
| Per-trace service map | None | Yes (Map view) | No | No | **P1** |
|
||||
| Trace-to-metric exemplars | None | Pivot from metric graph to traces | Metric-to-trace linking | Prometheus exemplars | **P1** |
|
||||
| Custom metrics from spans | None | Generate count/distribution/gauge from tags | Via NRQL | SLOs from span data | **P2** |
|
||||
| Structural trace queries | None | Trace Queries (multi-span relationships) | Via NRQL | TraceQL spanset pipelines | **P2** |
|
||||
| Trace comparison / diffing | None | Partial | Side-by-side comparison | compare() in TraceQL | **P2** |
|
||||
| AI/ML on traces | None | Watchdog (auto anomaly + RCA) | NRAI | BubbleUp (pattern detection) | **P3** |
|
||||
| RUM correlation | None | Frontend-to-backend trace linking | Yes | Faro / frontend observability | **P3** |
|
||||
| Continuous profiling | None | Code Hotspots (span-to-profile) | Partial | Pyroscope | **P3** |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Analytics & Alerting Foundation (P0) — Highest Impact
|
||||
|
||||
Without these, users cannot answer basic questions like "is my service healthy?" from trace data.
|
||||
|
||||
### 1.1 Trace Analytics / Aggregation Engine
|
||||
|
||||
**Current**: Can list/filter individual spans and view individual traces. No way to aggregate or compute statistics.
|
||||
**Target**: Full trace analytics supporting COUNT, AVG, SUM, MIN, MAX, P50/P75/P90/P95/P99 aggregations with GROUP BY on any span attribute and time-series bucketing.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Build a trace analytics API endpoint that translates query configs into ClickHouse aggregation queries
|
||||
- Use ClickHouse's native functions: `quantile(0.99)(durationUnixNano)`, `countIf(statusCode = 2)`, `toStartOfInterval(startTime, INTERVAL 1 MINUTE)`
|
||||
- Support GROUP BY on service, span name, kind, status, and any custom attribute (via JSON extraction)
|
||||
- Frontend: Add an "Analytics" tab to the Traces page with chart types (timeseries, top list, table) similar to the existing LogsAnalyticsView
|
||||
- Support switching between "List" view (current) and "Analytics" view
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Server/API/TelemetryAPI.ts` (add trace analytics endpoint)
|
||||
- `Common/Server/Services/SpanService.ts` (add aggregation query methods)
|
||||
- `Common/Types/Traces/TraceAnalyticsQuery.ts` (new - query interface)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Traces/Index.tsx` (add analytics view toggle)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceAnalyticsView.tsx` (new - analytics UI)
|
||||
|
||||
### 1.2 RED Metrics from Traces (Request Rate, Error Rate, Duration)
|
||||
|
||||
**Current**: No automatic computation of service-level metrics from trace data.
|
||||
**Target**: Auto-computed per-service, per-operation RED metrics displayed on a Service Overview page.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create a ClickHouse materialized view that aggregates spans into per-service, per-operation metrics at 1-minute intervals:
|
||||
```sql
|
||||
CREATE MATERIALIZED VIEW span_red_metrics
|
||||
ENGINE = AggregatingMergeTree()
|
||||
ORDER BY (projectId, serviceId, name, minute)
|
||||
AS SELECT
|
||||
projectId, serviceId, name,
|
||||
toStartOfMinute(startTime) AS minute,
|
||||
countState() AS request_count,
|
||||
countIfState(statusCode = 2) AS error_count,
|
||||
quantileState(0.50)(durationUnixNano) AS p50_duration,
|
||||
quantileState(0.95)(durationUnixNano) AS p95_duration,
|
||||
quantileState(0.99)(durationUnixNano) AS p99_duration
|
||||
FROM SpanItem
|
||||
GROUP BY projectId, serviceId, name, minute
|
||||
```
|
||||
- Build a Service Overview page showing: request rate chart, error rate chart, p50/p95/p99 latency charts
|
||||
- Add an API endpoint to query the materialized view
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/AnalyticsModels/SpanRedMetrics.ts` (new - materialized view model)
|
||||
- `Telemetry/Services/SpanRedMetricsService.ts` (new - query service)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Service/View/Overview.tsx` (new or enhanced - RED dashboard)
|
||||
- `Worker/DataMigrations/` (new migration to create materialized view)
|
||||
|
||||
### 1.3 Trace-Based Alerting
|
||||
|
||||
**Current**: No ability to alert on trace data.
|
||||
**Target**: Create alerts on p50/p75/p90/p95/p99 latency thresholds, error rate thresholds, and request rate anomalies per service/operation.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Extend the existing monitor system to add a `TraceMonitor` type
|
||||
- Monitor evaluates against the RED metrics materialized view (depends on 1.2)
|
||||
- Alert conditions: latency exceeds threshold, error rate exceeds threshold, request rate drops below threshold
|
||||
- Integrate with existing OneUptime alerting/incident system
|
||||
- UI: Add "Trace Monitor" as a new monitor type in the monitor creation wizard
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Monitor/MonitorType.ts` (add Trace monitor type)
|
||||
- `Common/Types/Monitor/MonitorStepTraceMonitor.ts` (new - trace monitor config)
|
||||
- `Common/Server/Utils/Monitor/Criteria/TraceMonitorCriteria.ts` (new - evaluation logic)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Form/Monitor/TraceMonitor/` (new - monitor form UI)
|
||||
|
||||
### 1.4 Head-Based Probabilistic Sampling
|
||||
|
||||
**Current**: Ingests 100% of received traces.
|
||||
**Target**: Configurable per-service probabilistic sampling with rules to always keep errors and slow traces.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create `TraceSamplingRule` PostgreSQL model: service filter, sample rate (0-100%), conditions to always keep (error status, duration > threshold)
|
||||
- Evaluate sampling rules in `OtelTracesIngestService.ts` before ClickHouse insert
|
||||
- Use deterministic sampling based on traceId hash (so all spans from the same trace are kept or dropped together)
|
||||
- UI under Settings > Trace Configuration > Sampling Rules
|
||||
- Show estimated storage savings
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/DatabaseModels/TraceSamplingRule.ts` (new)
|
||||
- `Telemetry/Services/OtelTracesIngestService.ts` (add sampling logic)
|
||||
- Dashboard: new Settings page for sampling configuration
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Visualization & Debugging UX (P1) — Industry-Standard Features
|
||||
|
||||
### 2.1 Flame Graph View
|
||||
|
||||
**Current**: Only Gantt/waterfall view.
|
||||
**Target**: Flame graph visualization showing proportional time spent in each span, with service color coding.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Build a flame graph component that renders spans as horizontally stacked rectangles proportional to duration
|
||||
- Allow switching between Waterfall and Flame Graph views in TraceExplorer
|
||||
- Color-code by service (consistent with waterfall view)
|
||||
- Click a span rectangle to focus/zoom into that subtree
|
||||
- Show tooltip with span name, service, duration, self-time on hover
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Traces/FlameGraph.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view toggle)
|
||||
|
||||
### 2.2 Latency Breakdown / Critical Path Analysis
|
||||
|
||||
**Current**: Shows individual span durations but no automated analysis.
|
||||
**Target**: Compute and display critical path, self-time vs child-time, and bottleneck identification.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Compute critical path: the longest sequential chain of spans through the trace (accounts for parallelism)
|
||||
- Calculate "self time" per span: `span.duration - sum(child.duration)` (clamped to 0 for overlapping children)
|
||||
- Display latency breakdown by service: percentage of total trace time spent in each service
|
||||
- Highlight bottleneck spans (spans contributing most to critical path duration)
|
||||
- Add "Critical Path" toggle in TraceExplorer that highlights the critical path spans
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Utils/Traces/CriticalPath.ts` (new - critical path algorithm)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (show self-time)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add critical path view)
|
||||
|
||||
### 2.3 In-Trace Span Search
|
||||
|
||||
**Current**: TraceExplorer shows all spans with service filtering and error toggle, but no text search.
|
||||
**Target**: Search box to filter spans by name, attribute values, or status within the current trace.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add a search input in TraceExplorer toolbar
|
||||
- Client-side filtering: match span name, service name, attribute keys/values against search text
|
||||
- Highlight matching spans in the waterfall/flame graph
|
||||
- Show match count (e.g., "3 of 47 spans")
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add search bar and filtering)
|
||||
|
||||
### 2.4 Per-Trace Service Flow Map
|
||||
|
||||
**Current**: Service dependency graph exists globally but not per-trace.
|
||||
**Target**: Per-trace visualization showing the path of a request through services with latency annotations.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Build a directed graph from the spans in a single trace (services as nodes, calls as edges)
|
||||
- Annotate edges with call count and latency
|
||||
- Color-code nodes by error status
|
||||
- Add as a new view tab alongside Waterfall and Flame Graph
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceServiceMap.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view tab)
|
||||
|
||||
### 2.5 Span Link Navigation
|
||||
|
||||
**Current**: Links data is stored in spans but not navigable in the UI.
|
||||
**Target**: Clickable links in the span detail panel that navigate to related traces/spans.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- In the SpanViewer detail panel, render the `links` array as clickable items
|
||||
- Each link shows the linked traceId, spanId, and relationship type
|
||||
- Clicking navigates to the linked trace view
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (render clickable links)
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Advanced Analytics & Correlation (P2) — Power Features
|
||||
|
||||
### 3.1 Trace-to-Metric Exemplars
|
||||
|
||||
**Current**: Metric model has no traceId/spanId fields.
|
||||
**Target**: Link metric data points to trace IDs; show exemplar dots on metric charts that navigate to traces.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add optional `traceId` and `spanId` columns to the Metric ClickHouse model
|
||||
- During metric ingestion, extract exemplar trace/span IDs from OTLP exemplar fields
|
||||
- On metric charts, render exemplar dots at data points that have associated traces
|
||||
- Clicking an exemplar dot navigates to the trace view
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/AnalyticsModels/Metric.ts` (add traceId/spanId columns)
|
||||
- `Telemetry/Services/OtelMetricsIngestService.ts` (extract exemplars)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render exemplar dots)
|
||||
|
||||
### 3.2 Custom Metrics from Spans
|
||||
|
||||
**Current**: No way to create persistent metrics from trace data.
|
||||
**Target**: Users define custom metrics from span attributes that are computed via ClickHouse materialized views and available for alerting and dashboards.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create `SpanDerivedMetric` model: name, filter query (which spans), aggregation (count/avg/p99 of what field), GROUP BY attributes
|
||||
- Use ClickHouse materialized views for efficient computation
|
||||
- Surface derived metrics in the metric explorer and alerting system
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/DatabaseModels/SpanDerivedMetric.ts` (new)
|
||||
- `Common/Server/Services/SpanDerivedMetricService.ts` (new)
|
||||
- Dashboard: UI for defining derived metrics
|
||||
|
||||
### 3.3 Structural Trace Queries
|
||||
|
||||
**Current**: Can only filter on individual span attributes.
|
||||
**Target**: Query traces based on properties of multiple spans and their relationships (e.g., "find traces where service A called service B and B returned an error").
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Design a visual query builder for structural queries (easier adoption than a query language)
|
||||
- Translate structural queries to ClickHouse subqueries with JOINs on traceId
|
||||
- Example: "Find traces where span with service=frontend has child span with service=database AND duration > 500ms"
|
||||
```sql
|
||||
SELECT DISTINCT s1.traceId FROM SpanItem s1
|
||||
JOIN SpanItem s2 ON s1.traceId = s2.traceId AND s1.spanId = s2.parentSpanId
|
||||
WHERE s1.projectId = {pid}
|
||||
AND JSONExtractString(s1.attributes, 'service.name') = 'frontend'
|
||||
AND JSONExtractString(s2.attributes, 'service.name') = 'database'
|
||||
AND s2.durationUnixNano > 500000000
|
||||
```
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Traces/StructuralTraceQuery.ts` (new - query model)
|
||||
- `Common/Server/Services/SpanService.ts` (add structural query execution)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Traces/StructuralQueryBuilder.tsx` (new - visual builder)
|
||||
|
||||
### 3.4 Trace Comparison / Diffing
|
||||
|
||||
**Current**: No way to compare traces.
|
||||
**Target**: Side-by-side comparison of two traces of the same operation, highlighting differences in span count, latency, and structure.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add "Compare" action to trace list (select two traces)
|
||||
- Build a diff view showing: added/removed spans, latency differences per span, structural changes
|
||||
- Useful for comparing a slow trace to a fast trace of the same operation
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Traces/TraceComparison.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Traces/Compare.tsx` (new page)
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Competitive Differentiation (P3) — Long-Term
|
||||
|
||||
### 4.1 Rules-Based and Tail-Based Sampling
|
||||
|
||||
**Current**: Phase 1 adds head-based probabilistic sampling.
|
||||
**Target**: Rules-based sampling (always keep errors/slow traces, sample successes) and eventually tail-based sampling (buffer complete traces, decide after seeing all spans).
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Rules engine: configurable conditions (service, status, duration, attributes) with per-rule sample rates
|
||||
- Tail-based: buffer spans for a configurable window (30s), assemble complete traces, then apply retention decisions
|
||||
- Tail-based is complex; consider integrating with OpenTelemetry Collector's tail sampling processor as an alternative
|
||||
|
||||
### 4.2 AI/ML on Trace Data
|
||||
|
||||
- **Anomaly detection** on RED metrics (statistical deviation from baseline)
|
||||
- **Auto-surfacing correlated attributes** when latency spikes (similar to Honeycomb BubbleUp)
|
||||
- **Natural language trace queries** ("show me slow database calls from the last hour")
|
||||
- **Automatic root cause analysis** from trace data during incidents
|
||||
|
||||
### 4.3 RUM (Real User Monitoring) Correlation
|
||||
|
||||
- Browser SDK that propagates W3C trace context from frontend to backend
|
||||
- Link frontend page loads, interactions, and web vitals to backend traces
|
||||
- Show end-to-end user experience from browser to backend services
|
||||
|
||||
### 4.4 Continuous Profiling Integration
|
||||
|
||||
- Integrate with a profiling backend (e.g., Pyroscope)
|
||||
- Link profile data to span time windows
|
||||
- Show "Code Hotspots" within spans (similar to DataDog)
|
||||
|
||||
---
|
||||
|
||||
## ClickHouse Storage Improvements
|
||||
|
||||
### S.1 Migrate `attributes` to Map(String, String) (HIGH)
|
||||
|
||||
**Current**: `attributes` is stored as opaque `String` (JSON). Querying by attribute value requires `LIKE` or `JSONExtract()` scans.
|
||||
**Target**: `Map(String, String)` type enabling `attributes['http.method'] = 'GET'` without JSON parsing.
|
||||
|
||||
**Impact**: Significant query speedup for attribute-based span filtering -- the most common query pattern after time-range filtering.
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/AnalyticsModels/Span.ts` (change column type)
|
||||
- `Common/Server/Utils/AnalyticsDatabase/StatementGenerator.ts` (handle Map type)
|
||||
- `Telemetry/Services/OtelTracesIngestService.ts` (write Map format)
|
||||
- `Worker/DataMigrations/` (new migration)
|
||||
|
||||
### S.2 Add Aggregation Projection (MEDIUM)
|
||||
|
||||
**Current**: `projections: []` is empty.
|
||||
**Target**: Pre-aggregation projection for common dashboard queries.
|
||||
|
||||
```sql
|
||||
PROJECTION agg_by_service (
|
||||
SELECT
|
||||
serviceId,
|
||||
toStartOfMinute(startTime) AS minute,
|
||||
count(),
|
||||
avg(durationUnixNano),
|
||||
quantile(0.99)(durationUnixNano)
|
||||
GROUP BY serviceId, minute
|
||||
)
|
||||
```
|
||||
|
||||
**Impact**: 5-10x faster aggregation queries for service overview dashboards.
|
||||
|
||||
### S.3 Add Trace-by-ID Projection (LOW)
|
||||
|
||||
**Current**: Trace detail view relies on BloomFilter skip index for traceId lookups.
|
||||
**Target**: Projection sorted by `(projectId, traceId, startTime)` for faster trace-by-ID queries.
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins (Can Ship This Week)
|
||||
|
||||
1. **In-trace span search** - Add a text filter in TraceExplorer (few hours of work)
|
||||
2. **Self-time calculation** - Show "self time" (span duration minus child durations) in SpanViewer
|
||||
3. **Span link navigation** - Links data is stored but not clickable in UI
|
||||
4. **Top-N slowest operations** - Simple ClickHouse query: `ORDER BY durationUnixNano DESC LIMIT N`
|
||||
5. **Error rate by service** - Aggregate `statusCode=2` counts grouped by serviceId
|
||||
6. **Trace duration distribution histogram** - Use ClickHouse `histogram()` on durationUnixNano
|
||||
7. **Span count per service display** - Already tracked in `servicesInTrace`, just needs better display
|
||||
|
||||
---
|
||||
|
||||
## Recommended Implementation Order
|
||||
|
||||
1. **Phase 1.1** - Trace Analytics Engine (highest impact, unlocks everything else)
|
||||
2. **Phase 1.2** - RED Metrics from Traces (prerequisite for alerting, service overview)
|
||||
3. **Quick Wins** - Ship in-trace search, self-time, span links, top-N operations
|
||||
4. **Phase 1.3** - Trace-Based Alerting (core observability workflow)
|
||||
5. **Phase 2.1** - Flame Graph View (industry-standard visualization)
|
||||
6. **Phase 2.2** - Critical Path Analysis (key debugging capability)
|
||||
7. **Phase 1.4** - Head-Based Sampling (essential for high-volume users)
|
||||
8. **S.1** - Migrate attributes to Map type (storage optimization)
|
||||
9. **Phase 2.3-2.5** - In-trace search, per-trace map, span links
|
||||
10. **Phase 3.1** - Trace-to-Metric Exemplars
|
||||
11. **Phase 3.2-3.4** - Custom metrics, structural queries, comparison
|
||||
12. **Phase 4.x** - AI/ML, RUM, profiling (long-term)
|
||||
|
||||
## Verification
|
||||
|
||||
For each feature:
|
||||
1. Unit tests for new query builders, critical path algorithm, sampling logic
|
||||
2. Integration tests for new API endpoints (analytics, RED metrics, sampling)
|
||||
3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/traces`
|
||||
4. Check ClickHouse query performance with `EXPLAIN` for new aggregation queries
|
||||
5. Verify trace correlation (logs, exceptions, metrics) still works correctly with new features
|
||||
6. Load test sampling logic to ensure it doesn't add ingestion latency
|
||||
Reference in New Issue
Block a user