diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index b11a8ca225..1e4779528f 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -18,7 +18,7 @@ on: default: false permissions: - contents: read + contents: write packages: write jobs: diff --git a/.github/workflows/test-release.yaml b/.github/workflows/test-release.yaml index d8dfedb552..33246fb801 100644 --- a/.github/workflows/test-release.yaml +++ b/.github/workflows/test-release.yaml @@ -10,7 +10,7 @@ on: - "master" permissions: - contents: read + contents: write packages: write jobs: diff --git a/Internal/Roadmap/Dashboards.md b/Internal/Roadmap/Dashboards.md new file mode 100644 index 0000000000..8db64ba417 --- /dev/null +++ b/Internal/Roadmap/Dashboards.md @@ -0,0 +1,568 @@ +# Plan: Bring OneUptime Dashboards to Industry Parity and Beyond + +## Context + +OneUptime's dashboard implementation provides a 12-column grid layout with drag-and-drop editing, 3 widget types (Chart with Line/Bar, Value, Text with basic formatting), global time range with presets, view/edit modes, role-based permissions, and full-screen support. Dashboard config is stored as a single JSON column. Dashboards can only query OpenTelemetry metrics from ClickHouse. + +This plan identifies the remaining gaps vs Grafana, Datadog, and New Relic, and proposes a phased implementation to build a best-in-class dashboard product that leverages OneUptime's unique position as an all-in-one observability + status page platform. + +## Completed + +The following features have been implemented: +- **12-Column Grid Layout** - Fixed grid with dynamic unit sizing, 60 default rows (expandable) +- **Drag-and-Drop Editing** - Move and resize components with bounds checking +- **Chart Widget** - Line and Bar chart types with single metric query, configurable title/description/legend +- **Value Widget** - Single metric aggregation displayed as large number +- **Text Widget** - Bold/Italic/Underline formatting (no markdown) +- **Global Time Range** - Presets (30min to 3mo) + custom date range picker +- **View/Edit Modes** - Read-only view with full-screen, edit mode with side panel settings +- **Role-Based Permissions** - ProjectOwner, ProjectAdmin, ProjectMember + custom permissions +- **Dashboard CRUD API** - Standard REST API with slug generation +- **Billing Enforcement** - Free plan limited to 1 dashboard + +## Gap Analysis Summary + +| Feature | OneUptime | Grafana | Datadog | New Relic | Priority | +|---------|-----------|---------|---------|-----------|----------| +| Widget types | 3 | 20+ | 40+ | 15+ | **P0** | +| Chart types | 2 (Line, Bar) | 10+ | 12+ | 10+ | **P0** | +| Template variables | None | 6+ types | Yes | 3 types | **P0** | +| Auto-refresh | None | Configurable | Real-time | Yes | **P0** | +| Log panels | None | Yes (Loki) | Yes | Yes (NRQL) | **P0** | +| Trace panels | None | Yes (Tempo) | Yes | Yes | **P0** | +| Table widget | None | Yes | Yes | Yes | **P0** | +| Multiple queries per chart | Single query | Yes | Yes | Yes | **P0** | +| Markdown support | Basic formatting only | Full markdown | Full markdown | Full markdown | **P0** | +| Threshold lines / color coding | None | Yes | Yes | Yes | **P0** | +| Legend interaction (show/hide) | None | Yes | Yes | Yes | **P0** | +| Chart zoom | None | Yes | Yes | Yes | **P0** | +| Dashboard linking / drill-down | None | Data links | Yes | Facet linking | **P1** | +| Annotations / event overlays | None | Yes | Yes | Yes (Labs) | **P1** | +| Row/section grouping | None | Collapsible rows | Groups | No | **P1** | +| Public/shared dashboards | None | Yes | Yes | Yes | **P1** | +| JSON import/export | None | Yes | Yes | Yes | **P1** | +| Dashboard versioning | None | Yes | Yes | No | **P1** | +| Alert integration | None | Create from panel + show state | Yes | NRQL alerts | **P1** | +| TV/Kiosk mode | Full-screen only | Kiosk mode | Yes | Auto-cycling | **P1** | +| CSV export | None | Yes | Yes | Yes | **P1** | +| Custom time per widget | None | No | No | No | **P1** | +| AI dashboard creation | None | None | None | None | **P2** | +| Dashboard-as-code SDK | None | Foundation SDK | No | No | **P2** | +| Terraform provider | None | Yes | Yes | Yes | **P2** | + +--- + +## Phase 1: Foundation (P0) — Close Critical Gaps + +These gaps make OneUptime dashboards fundamentally non-competitive. Every major competitor has these. + +### 1.1 Add Core Chart Types: Area, Pie, Table, Gauge, Heatmap, Histogram + +**Current**: Line and Bar only. +**Target**: 8+ chart types covering all standard observability visualization needs. + +**Implementation**: + +- **Area Chart** (stacked and non-stacked): Extension of line chart with fill. Use existing chart library's area mode +- **Pie/Donut Chart**: For proportional breakdowns (e.g., error distribution by service). New component +- **Table Widget**: Tabular metric data, top-N lists, multi-column display with sortable columns. New component type +- **Gauge Widget**: Circular gauge with configurable min/max/thresholds and color zones. New component +- **Heatmap**: Time on X-axis, value buckets on Y-axis, color intensity for count. Essential for distribution/histogram metrics +- **Histogram**: Bar chart showing value distribution. Important for latency analysis + +Each chart type needs: +- A new entry in `DashboardComponentType` or `ChartType` enum +- A rendering component in `Dashboard/Components/` +- Configuration options in the component settings side panel + +**Files to modify**: +- `Common/Types/Dashboard/Chart/ChartType.ts` (add Area, Pie, Heatmap, Histogram, Gauge) +- `Common/Types/Dashboard/DashboardComponentType.ts` (add Table, Gauge) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render new types) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTableComponent.tsx` (new) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardGaugeComponent.tsx` (new) + +### 1.2 Template Variables + +**Current**: No template variables. Users must create separate dashboards for each service/host/environment. +**Target**: Drop-down variable selectors that dynamically filter all widgets. + +**Implementation**: + +- Create a `DashboardVariable` type stored in `dashboardViewConfig`: + - Name, label, type (query-based, custom list, text input) + - Query-based: runs a ClickHouse query to populate options (e.g., `SELECT DISTINCT service FROM MetricItem WHERE projectId = {pid}`) + - Custom list: manually defined options + - Multi-value selection support +- Render variables as dropdown selectors in the dashboard toolbar +- Variables can be referenced in metric queries as `$variable_name` +- When a variable changes, all widgets re-query with the new value +- Support cascading variables (variable B's query depends on variable A's value) + +**Files to modify**: +- `Common/Types/Dashboard/DashboardVariable.ts` (new) +- `Common/Types/Dashboard/DashboardViewConfig.ts` (add variables array) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Toolbar/DashboardToolbar.tsx` (render variable dropdowns) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (pass variable values to widgets) +- `Common/Server/Services/MetricService.ts` (resolve variable references in queries) + +### 1.3 Auto-Refresh + +**Current**: Data goes stale after initial load. +**Target**: Configurable auto-refresh intervals. + +**Implementation**: + +- Add auto-refresh dropdown in toolbar with options: Off, 5s, 10s, 30s, 1m, 5m, 15m +- Store selected interval in dashboard config and URL state +- Use `setInterval` to trigger re-fetch on all metric widgets +- Show a subtle refresh indicator when data is being updated +- Pause auto-refresh when the dashboard is in edit mode + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Toolbar/DashboardToolbar.tsx` (add refresh dropdown) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (implement refresh timer) +- `Common/Types/Dashboard/DashboardViewConfig.ts` (store refresh interval) + +### 1.4 Multiple Queries per Chart + +**Current**: Single `MetricQueryConfigData` per chart. +**Target**: Overlay multiple metric series on a single chart for correlation. + +**Implementation**: + +- Change chart component's data source from single `MetricQueryConfigData` to `MetricQueryConfigData[]` +- Each query gets its own alias and legend entry +- Support formula references across queries (e.g., `a / b * 100`) +- Y-axis: support dual Y-axes for metrics with different scales + +**Files to modify**: +- `Common/Utils/Dashboard/Components/DashboardChartComponent.ts` (change to array) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render multiple series) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/ComponentSettingsSideOver.tsx` (multi-query config UI) + +### 1.5 Full Markdown Support for Text Widget + +**Current**: Only bold, italic, underline formatting. +**Target**: Full markdown rendering including headers, links, lists, code blocks, tables, and images. + +**Implementation**: + +- Replace the current custom formatting with a markdown renderer (e.g., `react-markdown` or `marked`) +- Support: headers (h1-h6), links, ordered/unordered lists, code blocks with syntax highlighting, tables, images, blockquotes +- Edit mode: raw markdown text area with preview toggle + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTextComponent.tsx` (replace with markdown renderer) +- `Common/Utils/Dashboard/Components/DashboardTextComponent.ts` (store raw markdown) + +### 1.6 Threshold Lines & Color Coding + +**Current**: No threshold visualization. +**Target**: Configurable warning/critical thresholds on charts with color-coded regions. + +**Implementation**: + +- Add threshold configuration to chart settings: value, label, color (default: yellow for warning, red for critical) +- Render horizontal lines on the chart at threshold values +- Optionally fill regions above/below thresholds with translucent color +- For value/billboard widgets: change background color based on which threshold range the value falls in (green/yellow/red) + +**Files to modify**: +- `Common/Utils/Dashboard/Components/DashboardChartComponent.ts` (add thresholds config) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render threshold lines) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardValueComponent.tsx` (color coding) + +### 1.7 Legend Interaction (Show/Hide Series) + +**Current**: Legends are display-only. +**Target**: Click legend items to toggle series visibility. + +**Implementation**: + +- Add click handler on legend items to toggle series visibility +- Clicked-off series should be visually dimmed in the legend and removed from the chart +- Support "isolate" mode: Ctrl+Click shows only that series and hides all others +- Persist toggled state during the session (reset on page reload) + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (add legend click handlers) + +### 1.8 Chart Zoom (Click-Drag Time Selection) + +**Current**: No zoom capability. +**Target**: Click and drag on a time series chart to zoom into a time range. + +**Implementation**: + +- Enable brush/selection mode on time series charts +- When user drags to select a range, update the global time range to the selected range +- Show a "Reset zoom" button to return to the previous time range +- Maintain a zoom stack so users can zoom in multiple times and zoom back out + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (add brush selection) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (handle time range updates from zoom) + +--- + +## Phase 2: Observability Integration (P0-P1) — Leverage the Full Platform + +This is where OneUptime can differentiate: metrics, logs, and traces in one platform. + +### 2.1 Log Stream Widget + +**Current**: Dashboards can only show metrics. +**Target**: Widget that displays a live log stream with filtering. + +**Implementation**: + +- New `DashboardComponentType.LogStream` widget type +- Configuration: log query filter, severity filter, service filter, max rows +- Renders as a scrolling log list with severity color coding, timestamp, and body +- Click a log entry to expand and see full details +- Respects dashboard time range and template variables + +**Files to modify**: +- `Common/Types/Dashboard/DashboardComponentType.ts` (add LogStream) +- `Common/Utils/Dashboard/Components/DashboardLogStreamComponent.ts` (new - config) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardLogStreamComponent.tsx` (new - rendering) + +### 2.2 Trace List Widget + +**Current**: No trace visualization in dashboards. +**Target**: Widget showing a filtered trace list with duration and status. + +**Implementation**: + +- New `DashboardComponentType.TraceList` widget type +- Configuration: service filter, operation filter, status filter, min duration +- Renders as a table: trace ID, operation, service, duration, status, timestamp +- Click a row to navigate to the full trace view +- Respects dashboard time range and template variables + +**Files to modify**: +- `Common/Types/Dashboard/DashboardComponentType.ts` (add TraceList) +- `Common/Utils/Dashboard/Components/DashboardTraceListComponent.ts` (new) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTraceListComponent.tsx` (new) + +### 2.3 Click-to-Correlate Across Signals + +**Current**: No cross-signal correlation in dashboards. +**Target**: Click a point on a metric chart to instantly see related logs and traces from that timestamp. + +**Implementation**: + +- When clicking a data point on a metric chart, open a correlation panel showing: + - Logs from the same service and time window (+/- 5 minutes around the clicked point) + - Traces from the same service and time window + - Filtered by the same template variables +- The correlation panel appears as a slide-over or split view below the chart +- This is a major differentiator vs Grafana (which requires separate datasources) and ties into OneUptime's all-in-one advantage + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (add click handler) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/CorrelationPanel.tsx` (new - shows correlated logs/traces) + +### 2.4 Annotations / Event Overlays + +**Current**: No event markers on charts. +**Target**: Show deployment events, incidents, and alerts as vertical markers on time series charts. + +**Implementation**: + +- Query OneUptime's own data for events in the chart's time range: + - Incidents (from Incident model) + - Deployments (can be sent as OTLP resource attributes or a custom event API) + - Alert triggers (from monitor alert history) +- Render as vertical dashed lines with icons on hover +- Color-code by type: red for incidents, blue for deployments, yellow for alerts +- Allow users to add manual annotations (text + timestamp) + +**Files to modify**: +- `Common/Types/Dashboard/DashboardAnnotation.ts` (new) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render annotation markers) +- `Common/Server/API/DashboardAnnotationAPI.ts` (new - query events) + +### 2.5 Alert Integration + +**Current**: No connection between dashboards and alerting. +**Target**: Create alerts from dashboard panels and display alert state on panels. + +**Implementation**: + +- "Create Alert" button in chart settings that pre-fills a metric monitor with the chart's query +- Show alert state indicator on chart headers (green/yellow/red dot) based on associated monitor status +- Alert status widget: shows a summary of all active alerts with severity and duration + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/ComponentSettingsSideOver.tsx` (add "Create Alert" button) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (show alert state) +- `Common/Types/Dashboard/DashboardComponentType.ts` (add AlertStatus widget type) + +### 2.6 SLO/SLI Widget + +**Current**: No SLO visualization. +**Target**: Dedicated widget showing SLO status, error budget burn rate, and remaining budget. + +**Implementation** (depends on Metrics roadmap Phase 3.2 - SLO/SLI Tracking): + +- New `DashboardComponentType.SLO` widget type +- Configuration: select an SLO definition +- Displays: current attainment (%), target (%), error budget remaining (%), burn rate chart +- Color-coded: green (healthy), yellow (burning fast), red (budget exhausted) + +**Files to modify**: +- `Common/Types/Dashboard/DashboardComponentType.ts` (add SLO) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardSLOComponent.tsx` (new) + +--- + +## Phase 3: Collaboration & Sharing (P1) — Production Workflows + +### 3.1 Public/Shared Dashboards + +**Current**: Dashboards require login. +**Target**: Share dashboards with external stakeholders without requiring authentication. + +**Implementation**: + +- Add `isPublic` flag and `publicAccessToken` to Dashboard model +- Generate a shareable URL with token: `/public/dashboard/{token}` +- Public view is read-only with no editing controls +- Option to restrict public access to specific IP ranges +- Render without the OneUptime navigation chrome + +**Files to modify**: +- `Common/Models/DatabaseModels/Dashboard.ts` (add isPublic, publicAccessToken) +- `App/FeatureSet/Dashboard/src/Pages/Public/Dashboard.tsx` (new - public dashboard view) + +### 3.2 JSON Import/Export + +**Current**: No import/export capability. +**Target**: Export dashboards as JSON and re-import for backup, migration, and dashboard-as-code. + +**Implementation**: + +- Export: serialize `dashboardViewConfig` + metadata (name, description, variables) as a JSON file download +- Import: upload a JSON file, validate schema, create a new dashboard from the config +- Handle version compatibility (include a schema version in the export) + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Dashboards.tsx` (add import button) +- `App/FeatureSet/Dashboard/src/Pages/Dashboards/View/Settings.tsx` (add export button) +- `Common/Server/API/DashboardImportExportAPI.ts` (new) + +### 3.3 Dashboard Versioning + +**Current**: No change history. +**Target**: Track changes to dashboards over time with the ability to view history and revert. + +**Implementation**: + +- Create `DashboardVersion` model: dashboardId, version number, config snapshot, changedBy, timestamp +- On each save, create a new version entry +- UI: "Version History" in settings showing a list of versions with timestamps and authors +- "Restore" button to revert to a previous version +- Optional: diff view comparing two versions + +**Files to modify**: +- `Common/Models/DatabaseModels/DashboardVersion.ts` (new) +- `Common/Server/Services/DashboardService.ts` (create version on save) +- `App/FeatureSet/Dashboard/src/Pages/Dashboards/View/VersionHistory.tsx` (new) + +### 3.4 Row/Section Grouping + +**Current**: Components placed freely with no grouping. +**Target**: Collapsible rows/sections for organizing related panels. + +**Implementation**: + +- Add a "Section" component type that acts as a collapsible container +- Section has a title bar that can be clicked to collapse/expand +- When collapsed, hides all components within the section's vertical range +- Sections can be nested one level deep + +**Files to modify**: +- `Common/Types/Dashboard/DashboardComponentType.ts` (add Section) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardSectionComponent.tsx` (new) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/Index.tsx` (handle section collapse) + +### 3.5 TV/Kiosk Mode + +**Current**: Full-screen only. +**Target**: Dedicated kiosk mode optimized for wall-mounted monitors with auto-cycling. + +**Implementation**: + +- Kiosk mode: hides all chrome (toolbar, navigation, URL bar), shows only the dashboard grid +- Auto-cycle: rotate through a list of dashboards at a configurable interval (30s, 1m, 5m) +- Dashboard playlist: define an ordered list of dashboards to cycle through +- Support per-dashboard display duration + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Kiosk.tsx` (new - kiosk view) +- `Common/Models/DatabaseModels/DashboardPlaylist.ts` (new - playlist model) + +### 3.6 CSV Export + +**Current**: No data export. +**Target**: Export chart/table data as CSV for offline analysis. + +**Implementation**: + +- Add "Export CSV" option in chart/table context menu +- Client-side: serialize the current rendered data to CSV format +- Include column headers, timestamps, and values +- Trigger browser download + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (add export option) +- `Common/Utils/Dashboard/CSVExport.ts` (new - CSV serialization) + +### 3.7 Custom Time Range per Widget + +**Current**: All widgets share the global time range. +**Target**: Individual widgets can override the global time range. + +**Implementation**: + +- Add optional `timeRangeOverride` to each component's config +- When set, the widget uses its own time range instead of the global one +- Show a small clock icon on widgets with custom time ranges +- Configuration in the component settings side panel + +**Files to modify**: +- `Common/Utils/Dashboard/Components/DashboardBaseComponent.ts` (add timeRangeOverride) +- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (pass per-widget time ranges) + +--- + +## Phase 4: Differentiation (P2-P3) — Surpass Competition + +### 4.1 AI-Powered Dashboard Creation + +**Current**: Manual dashboard creation only. +**Target**: Natural language dashboard creation - "Show me CPU usage by service for the last 24 hours" auto-creates the right widget. + +**Implementation**: + +- Natural language input in the "Add Widget" dialog +- AI translates to: metric name, aggregation, group by, chart type, time range +- Uses available MetricType metadata to match metric names +- Preview the generated widget before adding to dashboard +- This is a feature NO competitor has done well yet - major differentiator + +### 4.2 Pre-Built Dashboard Templates + +**Current**: No templates. +**Target**: One-click dashboard templates for common stacks. + +**Implementation**: + +- Template library: Node.js, Python, Go, Java, Kubernetes, PostgreSQL, Redis, Nginx, MongoDB, etc. +- Auto-detect relevant templates based on ingested telemetry data +- "One-click create" instantiates a full dashboard from the template +- Community template sharing (future) + +### 4.3 Auto-Generated Dashboards + +**Current**: Users must manually build dashboards. +**Target**: When a service connects, auto-generate a relevant dashboard. + +**Implementation**: + +- On first telemetry ingest from a new service, analyze the metric names and types +- Auto-create a service dashboard with relevant charts based on detected metrics +- Include golden signals (latency, traffic, errors, saturation) where applicable +- Notify the user and link to the auto-generated dashboard + +### 4.4 Customer-Facing Dashboards on Status Pages + +**Current**: Status pages and dashboards are separate. +**Target**: Embed dashboard widgets on status pages for real-time performance visibility. + +**Implementation**: + +- Allow selecting specific dashboard widgets to embed on a status page +- Render widgets in read-only mode without internal navigation +- Respect public/private data boundaries (only show metrics the customer should see) +- This is unique to OneUptime - no competitor has integrated observability dashboards with status pages + +### 4.5 Dashboard-as-Code SDK + +**Current**: No programmatic dashboard creation. +**Target**: TypeScript SDK for defining dashboards as code. + +**Implementation**: + +```typescript +const dashboard = new Dashboard("Service Health") + .addVariable("service", { type: "query", query: "SELECT DISTINCT service FROM MetricItem" }) + .addRow("Latency") + .addChart({ metric: "http.server.duration", aggregation: "p99", groupBy: ["$service"] }) + .addChart({ metric: "http.server.duration", aggregation: "p50", groupBy: ["$service"] }) + .addRow("Throughput") + .addChart({ metric: "http.server.request.count", aggregation: "rate", groupBy: ["$service"] }) +``` + +- SDK generates the JSON config and uses the Dashboard API to create/update +- Git-based provisioning: store dashboard definitions in repo, CI/CD syncs to OneUptime + +### 4.6 Anomaly Detection Overlays + +**Current**: No anomaly visualization. +**Target**: AI highlights anomalous data points on charts without manual threshold configuration. + +**Implementation** (depends on Metrics roadmap Phase 3.1 - Anomaly Detection): + +- Automatically overlay expected range bands (baseline +/- N sigma) on metric charts +- Highlight data points outside the expected range with color indicators +- Click an anomaly to see correlated changes across metrics, logs, and traces + +--- + +## Quick Wins (Can Ship This Week) + +1. **Auto-refresh** - Add a simple `setInterval` refresh with dropdown selector in toolbar +2. **Full markdown for text widget** - Replace custom formatting with a markdown renderer +3. **Legend show/hide** - Add click handler on legend items to toggle series +4. **Stacked area chart** - Simple extension of existing line chart with fill +5. **Chart zoom** - Enable brush selection on time series charts + +--- + +## Recommended Implementation Order + +1. **Quick Wins** - Auto-refresh, markdown, legend toggle, stacked area, chart zoom +2. **Phase 1.1** - More chart types (Area, Pie, Table, Gauge) +3. **Phase 1.2** - Template variables (highest-impact feature for dashboard usability) +4. **Phase 1.4** - Multiple queries per chart +5. **Phase 1.6** - Threshold lines & color coding +6. **Phase 2.1** - Log stream widget (leverages all-in-one platform) +7. **Phase 2.2** - Trace list widget +8. **Phase 2.3** - Click-to-correlate (major differentiator) +9. **Phase 2.4** - Annotations / event overlays +10. **Phase 2.5** - Alert integration +11. **Phase 3.1** - Public/shared dashboards +12. **Phase 3.2** - JSON import/export +13. **Phase 3.4** - Row/section grouping +14. **Phase 3.5** - TV/Kiosk mode +15. **Phase 3.3** - Dashboard versioning +16. **Phase 2.6** - SLO widget (depends on SLO/SLI from Metrics roadmap) +17. **Phase 4.2** - Pre-built dashboard templates +18. **Phase 4.3** - Auto-generated dashboards +19. **Phase 4.1** - AI-powered dashboard creation +20. **Phase 4.4** - Customer-facing dashboards on status pages +21. **Phase 4.5** - Dashboard-as-code SDK + +## Verification + +For each feature: +1. Unit tests for new widget types, template variable resolution, CSV export logic +2. Integration tests for new API endpoints (annotations, public dashboards, import/export) +3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/dashboards` +4. Visual regression testing for new chart types (ensure correct rendering across browsers) +5. Performance testing: verify dashboards with 20+ widgets and auto-refresh don't cause excessive API load +6. Test template variables with edge cases: empty results, special characters, multi-value selections +7. Verify public dashboards don't leak private data diff --git a/Internal/Roadmap/Logs.md b/Internal/Roadmap/Logs.md index adb2c66e49..b41c45adea 100644 --- a/Internal/Roadmap/Logs.md +++ b/Internal/Roadmap/Logs.md @@ -20,125 +20,41 @@ The following features have been implemented and removed from this plan: - **Phase 2.2** - Log Analytics View (LogsAnalyticsView with timeseries, toplist, table charts; analytics endpoint) - **Phase 2.3** - Column Customization (ColumnSelector with dynamic columns from log attributes) - **Phase 5.8** - Store Missing OpenTelemetry Log Fields (observedTimeUnixNano, droppedAttributesCount, flags columns + ingestion + migration) +- **Phase 3.1** - Log Context / Surrounding Logs (Context tab in LogDetailsPanel, context endpoint in TelemetryAPI) +- **Phase 3.2** - Log Pipelines (LogPipeline + LogPipelineProcessor models, GrokParser/AttributeRemapper/SeverityRemapper/CategoryProcessor, pipeline execution service) +- **Phase 3.3** - Drop Filters (LogDropFilter model, LogDropFilterService, dashboard UI for configuration) +- **Phase 3.4** - Export to CSV/JSON (Export button in toolbar, LogExport utility with CSV and JSON support) +- **Phase 4.2** - Keyboard Shortcuts (j/k navigation, Enter expand/collapse, Esc close, / focus search, Ctrl+Enter apply filters, ? help) +- **Phase 4.3** - Sensitive Data Scrubbing (LogScrubRule model with PII patterns: Email, CreditCard, SSN, PhoneNumber, IPAddress, custom regex) ## Gap Analysis Summary | Feature | OneUptime | Datadog | New Relic | Priority | |---------|-----------|---------|-----------|----------| | Log Patterns (ML clustering) | None | Auto-clustering + Pattern Inspector | ML clustering + anomaly | **P1** | -| Log context (surrounding logs) | None | Before/after from same host/service | Automatic via APM agent | **P2** | -| Log Pipelines (server-side processing) | None (raw storage only) | 270+ OOTB, 14+ processor types | Grok parsing, built-in rules | **P2** | | Log-based Metrics | None | Count + Distribution, 15-month retention | Via NRQL | **P2** | -| Drop Filters (pre-storage filtering) | None | Exclusion filters with sampling | Drop rules per NRQL | **P2** | -| Export to CSV/JSON | None | CSV up to 100K rows | CSV/JSON up to 5K | **P2** | -| Keyboard shortcuts | None | Full keyboard nav | Basic | **P3** | -| Sensitive Data Scrubbing | None | Multi-layer (SaaS + agent + pipeline) | Auto-obfuscation + custom rules | **P3** | | Data retention config UI | Referenced but no UI | Multi-tier (Standard/Flex/Archive) | Partitions + Live Archives | **P3** | --- -## Phase 3: Processing & Operations (P2) — Platform Capabilities - -### 3.1 Log Context (Surrounding Logs) - -**Current**: Clicking a log shows only that log's details. -**Target**: A "Context" tab in the log detail panel showing N logs before/after from the same service. - -**Implementation**: - -- When a log is expanded, add a "Context" tab that queries ClickHouse: - ```sql - (SELECT * FROM LogItem WHERE projectId={pid} AND serviceId={sid} AND time < {logTime} ORDER BY time DESC LIMIT 5) - UNION ALL - (SELECT * FROM LogItem WHERE projectId={pid} AND serviceId={sid} AND time >= {logTime} ORDER BY time ASC LIMIT 6) - ``` -- Display as a mini log list with the current log highlighted -- Add to `LogDetailsPanel.tsx` as a tabbed section alongside the existing body/attributes view - -**Files to modify**: -- `Common/Server/API/TelemetryAPI.ts` (add context endpoint) -- `Common/UI/Components/LogsViewer/components/LogDetailsPanel.tsx` (add tabs + context view) - -### 3.2 Log Pipelines (Server-Side Processing) - -**Current**: Logs are stored raw as received (after OTLP normalization). -**Target**: Configurable processing pipelines that transform logs at ingest time. - -**Implementation**: - -- Create `LogPipeline` and `LogPipelineProcessor` PostgreSQL models -- Pipeline has: name, filter (which logs it applies to), enabled flag, sort order -- Processor types (start with these 4): - - **Grok Parser**: Parse body text into structured attributes using Grok patterns - - **Attribute Remapper**: Rename/copy one attribute to another - - **Severity Remapper**: Map an attribute value to the severity field - - **Category Processor**: Assign a new attribute value based on if/else conditions -- Processing runs in the telemetry ingestion worker (`Telemetry/Jobs/TelemetryIngest/ProcessTelemetry.ts`) after normalization but before ClickHouse insert -- Pipeline configuration UI under Settings > Log Pipelines - -**Files to modify**: -- `Common/Models/DatabaseModels/LogPipeline.ts` (new) -- `Common/Models/DatabaseModels/LogPipelineProcessor.ts` (new) -- `Telemetry/Services/LogPipelineService.ts` (new - pipeline execution engine) -- `Telemetry/Services/OtelLogsIngestService.ts` (hook pipeline execution before insert) -- Dashboard: new Settings page for pipeline configuration - -### 3.3 Drop Filters (Pre-Storage Filtering) - -**Current**: All ingested logs are stored. -**Target**: Configurable rules to drop or sample logs before storage. - -**Implementation**: - -- Create `LogDropFilter` PostgreSQL model: name, filter query, action (drop or sample at N%), enabled -- Evaluate drop filters in the ingestion pipeline before ClickHouse insert -- UI under Settings > Log Configuration > Drop Filters -- Show estimated volume savings based on recent log volume - -### 3.4 Export to CSV/JSON - -**Current**: No export capability. -**Target**: Download current filtered log results as CSV or JSON. - -**Implementation**: - -- Add "Export" button in the toolbar -- Client-side: serialize current page of logs to CSV/JSON and trigger browser download -- Server-side (for large exports): new endpoint that streams results to a downloadable file (up to 10K rows) - -**Files to modify**: -- `Common/UI/Components/LogsViewer/components/LogsViewerToolbar.tsx` (add export button) -- `Common/UI/Utils/LogExport.ts` (new - CSV/JSON serialization) -- `Common/Server/API/TelemetryAPI.ts` (add export endpoint for large exports) - --- -## Phase 4: Advanced Features (P3) — Differentiation +## Remaining Features +### Log Patterns (ML Clustering) — P1 -### 4.2 Keyboard Shortcuts +**Current**: No pattern detection. +**Target**: Auto-cluster similar log messages and surface pattern groups with anomaly detection. -- `j`/`k` to navigate between log rows -- `Enter` to expand/collapse selected log -- `Escape` to close detail panel -- `/` to focus search bar -- `Ctrl+Enter` to apply filters +### Log-based Metrics — P2 -### 4.3 Sensitive Data Scrubbing +**Current**: No log-to-metric conversion. +**Target**: Create count/distribution metrics from log queries with long-term retention. -- Auto-detect common PII patterns (credit cards, SSNs, emails) at ingest time -- Configurable scrubbing rules: mask, hash, or redact -- UI under Settings > Data Privacy +### Data Retention Config UI — P3 ---- - -## Recommended Implementation Order - -1. **Phase 3.4** - Export CSV/JSON (small effort, table-stakes feature) -2. **Phase 3.1** - Log Context (moderate effort, high debugging value) -3. **Phase 3.2** - Log Pipelines (large effort, platform capability) -4. **Phase 3.3** - Drop Filters (moderate effort, cost optimization) -5. **Phase 4.x** - Patterns, Shortcuts, Data Scrubbing (future) +**Current**: `retainTelemetryDataForDays` exists on the service model and is displayed in usage history, but there is no dedicated UI to configure retention settings. +**Target**: Settings page for configuring per-service log data retention. ## Phase 5: ClickHouse Storage & Query Optimizations (P0) — Performance Foundation @@ -205,18 +121,22 @@ These optimizations address fundamental storage and indexing gaps in the telemet | 5.3 DateTime64 time column | Sub-second log ordering | Correctness fix | Medium | | 5.7 Histogram projections | Histogram and severity aggregation | 5-10x | Medium | -### 5.x Recommended Remaining Order +--- + +## Recommended Remaining Implementation Order 1. **5.3** — DateTime64 upgrade (correctness) 2. **5.7** — Projections (performance polish) +3. **Log-based Metrics** (platform capability) +4. **Data Retention Config UI** (operational) +5. **Log Patterns / ML Clustering** (advanced, larger effort) --- ## Verification -For each feature: -1. Unit tests for new parsers/utilities (LogQueryParser, CSV export, etc.) -2. Integration tests for new API endpoints (histogram, facets, analytics, context) +For each remaining feature: +1. Unit tests for new utilities +2. Integration tests for new API endpoints 3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/logs` 4. Check ClickHouse query performance with `EXPLAIN` for new aggregation queries -5. Verify real-time/live mode still works correctly with new UI components diff --git a/Internal/Roadmap/Metrics.md b/Internal/Roadmap/Metrics.md new file mode 100644 index 0000000000..9d653929b3 --- /dev/null +++ b/Internal/Roadmap/Metrics.md @@ -0,0 +1,468 @@ +# Plan: Bring OneUptime Metrics to Industry Parity and Beyond + +## Context + +OneUptime's metrics implementation provides OTLP ingestion (HTTP and gRPC), ClickHouse storage with support for Gauge, Sum, Histogram, and ExponentialHistogram metric types, basic aggregations (Avg, Sum, Min, Max, Count), single-attribute GROUP BY, formula support for calculated metrics, threshold-based metric monitors, and a metric explorer with line/bar charts. Auto-discovery creates MetricType metadata (name, description, unit) on first ingest. Per-service data retention with TTL (default 15 days). + +This plan identifies the remaining gaps vs DataDog and New Relic, and proposes a phased implementation to close them and build a best-in-class metrics product. + +## Completed + +The following features have been implemented: +- **OTLP Ingestion** - HTTP and gRPC metric ingestion with async queue-based batch processing +- **Metric Types** - Gauge, Sum, Histogram, ExponentialHistogram support +- **ClickHouse Storage** - MergeTree with `sipHash64(projectId) % 16` partitioning, per-service TTL +- **Aggregations** - Avg, Sum, Min, Max, Count +- **Single-Attribute GROUP BY** - Group by one attribute at a time +- **Formulas** - Calculated metrics using aliases (e.g., `a / b * 100`) +- **Metric Explorer** - Time range selection, multiple queries with aliases, URL state persistence +- **Threshold-Based Monitors** - Static threshold alerting on aggregated metric values +- **MetricType Auto-Discovery** - Name, description, unit captured on first ingest +- **Attribute Storage** - Full JSON with extracted `attributeKeys` array for fast enumeration +- **BloomFilter index** on `name`, Set index on `serviceType` + +## Gap Analysis Summary + +| Feature | OneUptime | DataDog | New Relic | Priority | +|---------|-----------|---------|-----------|----------| +| Percentile aggregations (p50/p75/p90/p95/p99) | None | DDSketch distributions | NRQL percentile() | **P0** | +| Rate/derivative calculations | None | Native Rate type + .as_rate() | rate() NRQL function | **P0** | +| Multi-attribute GROUP BY | Single attribute only | Multiple tags | FACET on multiple attrs | **P0** | +| Rollup/downsampling for long-range queries | None (raw data, 15-day TTL) | Automatic tiered rollups | 30-day raw + 13-month rollups | **P0** | +| Anomaly detection | Static thresholds only | Watchdog + anomaly monitors | Anomaly detection + sigma bands | **P1** | +| SLO/SLI tracking | None | Metric-based + Time Slice SLOs | One-click setup + error budgets | **P1** | +| Heatmap visualization | None | Purpose-built for distributions | Built-in chart type | **P1** | +| Time-over-time comparison | None | Yes | COMPARE WITH in NRQL | **P1** | +| Summary metric type | Not supported | N/A (uses Distribution) | Yes | **P1** | +| Query language | Form-based UI only | Graphing editor + NLQ | Full NRQL language | **P2** | +| Predictive alerting | None | Watchdog forecasting | GA predictive alerting | **P2** | +| Metric correlations | None | Auto-surfaces related metrics | Applied Intelligence correlation | **P2** | +| Golden Signals dashboards | None | Available via APM | Pre-built with default alerts | **P2** | +| Cardinality management | None | Metrics Without Limits + Explorer | Budget system + pruning rules | **P2** | +| More chart types | Line and bar only | 12+ types | 10+ types with conditional coloring | **P2** | +| Dashboard templates | None | Pre-built integration dashboards | Pre-built entity dashboards | **P2** | +| Units on charts | Stored but not rendered | Auto-formatted by unit type | Y-axis unit customization | **P2** | +| Natural language querying | None | NLQ translates English to queries | None | **P3** | +| Metric cost/volume management | None | Cost attribution dashboards | Volume dashboards | **P3** | + +--- + +## Phase 1: Foundation (P0) — Close Critical Gaps + +These are table-stakes features without which the metrics product is fundamentally limited. + +### 1.1 Percentile Aggregations (p50, p75, p90, p95, p99) + +**Current**: Only Avg, Sum, Min, Max, Count aggregations. +**Target**: Support percentile aggregations on all metric data, especially histograms and distributions. + +**Implementation**: + +- Add `P50`, `P75`, `P90`, `P95`, `P99` to the `AggregationType` enum +- For raw metric values: use ClickHouse `quantile(0.50)(value)`, `quantile(0.95)(value)`, etc. +- For histogram data (with `bucketCounts` and `explicitBounds`): implement approximate percentile calculation from bucket data using linear interpolation between bucket boundaries +- Update the metric query builder to include percentile options in the aggregation dropdown +- Update chart rendering to display percentile series + +**Files to modify**: +- `Common/Types/BaseDatabase/AggregationType.ts` (add P50, P75, P90, P95, P99) +- `Common/Server/Services/MetricService.ts` (generate quantile SQL) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx` (add to dropdown) + +### 1.2 Rate/Derivative Calculations + +**Current**: No rate or delta computation. Raw cumulative counters are meaningless without rate calculation. +**Target**: Compute per-second rates and deltas from counter/sum metrics. + +**Implementation**: + +- Add `Rate` and `Delta` as aggregation options +- For cumulative sums: compute `(value_t - value_t-1) / (time_t - time_t-1)` using ClickHouse `runningDifference()` +- Handle counter resets (when value decreases, treat as reset and skip that interval) +- For delta temporality sums: rate is simply `value / interval_seconds` +- Display rate with appropriate units (e.g., "req/s", "bytes/s") + +**Files to modify**: +- `Common/Types/BaseDatabase/AggregationType.ts` (add Rate, Delta) +- `Common/Server/Services/MetricService.ts` (generate rate SQL with runningDifference) +- `Common/Types/Metrics/MetricsQuery.ts` (support rate in query config) + +### 1.3 Multi-Attribute GROUP BY + +**Current**: Single `groupByAttribute: string` field. +**Target**: Group by multiple attributes simultaneously (e.g., by region AND status_code). + +**Implementation**: + +- Change `groupByAttribute` from `string` to `string[]` in `MetricsQuery` +- Update ClickHouse query generation to GROUP BY multiple extracted JSON attributes +- Update chart rendering to handle multi-dimensional grouping (composite legend labels) +- Update the UI to allow selecting multiple group-by attributes + +**Files to modify**: +- `Common/Types/Metrics/MetricsQuery.ts` (change type) +- `Common/Server/Services/MetricService.ts` (update query generation) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx` (multi-select UI) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (composite legends) + +### 1.4 Rollups / Downsampling + +**Current**: Raw data only with 15-day default TTL. No rollups means long-range queries are slow and historical analysis is limited. +**Target**: Pre-aggregated rollups at multiple resolutions with extended retention. + +**Implementation**: + +- Create ClickHouse materialized views for automatic rollup: + ``` + Raw Data (1s resolution) -> 15-day retention + |-> Materialized View -> 1-min rollups -> 90-day retention + |-> Materialized View -> 1-hour rollups -> 13-month retention + |-> Materialized View -> 1-day rollups -> 3-year retention + ``` +- Each rollup table stores: min, max, sum, count, avg, and quantile sketches per metric name + attributes +- Route queries based on time range: + - < 6 hours: raw data + - 6 hours - 7 days: 1-min rollups + - 7 days - 30 days: 1-hour rollups + - 30+ days: 1-day rollups +- Automatic query routing in the metric service layer + +**Files to modify**: +- `Common/Models/AnalyticsModels/MetricRollup1Min.ts` (new) +- `Common/Models/AnalyticsModels/MetricRollup1Hour.ts` (new) +- `Common/Models/AnalyticsModels/MetricRollup1Day.ts` (new) +- `Common/Server/Services/MetricService.ts` (query routing by time range) +- `Worker/DataMigrations/` (new migration to create materialized views) + +--- + +## Phase 2: Visualization & UX (P1) — Match Industry Standard + +### 2.1 More Chart Types + +**Current**: Line and bar charts only. +**Target**: Add Heatmap, Stacked Area, Pie/Donut, Scatter, Single-Value Billboard, and Gauge. + +**Implementation**: + +- **Heatmap**: Essential for histogram/distribution data. Use a heatmap library that renders time on X-axis, bucket values on Y-axis, and color intensity for count +- **Stacked Area**: Extension of existing line chart with fill and stacking +- **Pie/Donut**: For showing proportional breakdowns (e.g., request distribution by service) +- **Scatter**: For correlation analysis between two metrics +- **Billboard**: Large single-value display with configurable thresholds for color coding (green/yellow/red) +- **Gauge**: Circular gauge showing a value against a min/max range + +**Files to modify**: +- `Common/Types/Dashboard/Chart/ChartType.ts` (add new chart types) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render new chart types) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricCharts.tsx` (chart type selector) + +### 2.2 Time-Over-Time Comparison + +**Current**: No comparison capability. +**Target**: Overlay current metric data with data from a previous period (1h ago, 1d ago, 1w ago). + +**Implementation**: + +- Add a "Compare with" dropdown in the metric explorer toolbar (options: 1 hour ago, 1 day ago, 1 week ago, custom) +- Execute the same query twice with shifted time ranges +- Render the comparison series as a dashed/translucent overlay on the same chart +- Show the delta (absolute and percentage) in tooltips + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricExplorer.tsx` (add compare dropdown) +- `Common/Types/Metrics/MetricsQuery.ts` (add compareWith field) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render comparison series) + +### 2.3 Render Metric Units on Charts + +**Current**: Units stored in MetricType but not rendered on chart axes. +**Target**: Display units on Y-axis labels and tooltips with smart formatting. + +**Implementation**: + +- Pass `MetricType.unit` through to chart rendering +- Implement unit-aware formatting: + - Bytes: auto-convert to KB/MB/GB/TB + - Duration: auto-convert ns/us/ms/s + - Percentage: append `%` + - Rate: append `/s` +- Display formatted unit on Y-axis label and in tooltip values + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (Y-axis unit formatting) +- `Common/Utils/Metrics/UnitFormatter.ts` (new - unit formatting logic) + +### 2.4 Dashboard Templates + +**Current**: No templates. +**Target**: Pre-built dashboards for common scenarios that auto-populate based on detected metrics. + +**Implementation**: + +- Create MetricsViewConfig templates for: + - HTTP Service Health (request rate, error rate, latency percentiles) + - Database Performance (query duration, connection pool, error rate) + - Kubernetes Metrics (CPU, memory, pod restarts, network) + - Host Metrics (CPU, memory, disk, network) + - Runtime Metrics (GC, heap, threads - per language) +- Auto-detect which templates are relevant based on ingested metric names +- "One-click apply" creates a dashboard from the template + +**Files to modify**: +- `Common/Types/Metrics/DashboardTemplates/` (new directory with template definitions) +- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Templates.tsx` (new - template gallery) + +### 2.5 Summary Metric Type Support + +**Current**: Summary type not supported. +**Target**: Ingest and store Summary metrics from OTLP. + +**Implementation**: + +- Add `Summary` to the metric point type enum +- Store quantile values from summary data points +- Display summary quantiles in the metric explorer + +**Files to modify**: +- `Telemetry/Services/OtelMetricsIngestService.ts` (handle summary type) +- `Common/Models/AnalyticsModels/Metric.ts` (add summary-specific columns if needed) + +--- + +## Phase 3: Alerting & Intelligence (P1-P2) — Smart Monitoring + +### 3.1 Anomaly Detection + +**Current**: Static threshold alerting only. +**Target**: Detect metrics deviating from expected patterns using statistical methods. + +**Implementation**: + +- Start with rolling mean + N standard deviations (configurable sensitivity: low/medium/high) +- Account for daily/weekly seasonality by comparing to same-time-last-week baselines +- Store baselines in ClickHouse (periodic computation job, hourly) +- Baseline table: metric name, service, hour_of_week, mean, stddev +- On each evaluation: compare current value to baseline, alert if deviation > configured sigma +- Surface anomalies as visual highlights on metric charts (shaded band showing expected range) + +**Files to modify**: +- `Common/Models/AnalyticsModels/MetricBaseline.ts` (new - baseline storage) +- `Worker/Jobs/Metrics/ComputeMetricBaselines.ts` (new - periodic baseline computation) +- `Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts` (add anomaly detection) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render anomaly bands) + +### 3.2 SLO/SLI Tracking + +**Current**: No SLO support. +**Target**: Define Service Level Objectives based on metric queries, track attainment over rolling windows, calculate error budgets. + +**Implementation**: + +- Create `SLO` PostgreSQL model: + - Name, description, target percentage (e.g., 99.9%) + - SLI definition: good events query / total events query (both metric queries) + - Time window: 7-day, 28-day, or 30-day rolling + - Alert thresholds: error budget remaining %, burn rate +- SLO dashboard page showing: + - Current attainment vs target (e.g., 99.85% / 99.9%) + - Error budget remaining (absolute and percentage) + - Burn rate chart (current burn rate vs sustainable burn rate) + - SLI time series chart +- Alert when error budget drops below threshold or burn rate exceeds sustainable rate +- Integrate with existing monitor/incident system + +**Files to modify**: +- `Common/Models/DatabaseModels/SLO.ts` (new) +- `Common/Server/Services/SLOService.ts` (new - SLI computation, budget calculation) +- `Worker/Jobs/SLO/EvaluateSLOs.ts` (new - periodic SLO evaluation) +- `App/FeatureSet/Dashboard/src/Pages/SLO/` (new - SLO list, detail, creation pages) + +### 3.3 Metric Correlations + +**Current**: No correlation capability. +**Target**: When an anomaly is detected, automatically identify other metrics that changed around the same time. + +**Implementation**: + +- When an anomaly is detected on a metric, query all metrics for the same service/project in the surrounding time window (e.g., +/- 30 minutes) +- Compute Pearson correlation coefficient between the anomalous metric and each candidate +- Rank by absolute correlation value +- Surface top 5-10 correlated metrics in the alert/incident view +- Show correlation chart: anomalous metric overlaid with top correlated metrics + +**Files to modify**: +- `Common/Server/Services/MetricCorrelationService.ts` (new) +- `App/FeatureSet/Dashboard/src/Components/Metrics/CorrelatedMetrics.tsx` (new - correlation view) + +--- + +## Phase 4: Scale & Power Features (P2-P3) — Differentiation + +### 4.1 Cardinality Management + +**Current**: No cardinality visibility or controls. +**Target**: Track unique series count, alert on spikes, allow attribute allowlist/blocklist. + +**Implementation**: + +- Track unique series count per metric name (via periodic ClickHouse `uniq()` queries) +- Store in a dedicated cardinality tracking table +- Dashboard showing: top metrics by cardinality, cardinality trend over time, per-attribute breakdown +- Allow configuring attribute allowlists/blocklists per metric (applied at ingest time) +- Alert when cardinality exceeds configured budget + +**Files to modify**: +- `Worker/Jobs/Metrics/TrackMetricCardinality.ts` (new - periodic cardinality computation) +- `Common/Models/DatabaseModels/MetricCardinalityConfig.ts` (new - allowlist/blocklist) +- `Telemetry/Services/OtelMetricsIngestService.ts` (apply attribute filtering) +- `App/FeatureSet/Dashboard/src/Pages/Settings/MetricCardinality.tsx` (new - cardinality dashboard) + +### 4.2 Query Language + +**Current**: Form-based UI only. +**Target**: Text-based metrics query language inspired by PromQL/NRQL for advanced users. + +**Implementation**: + +- Define a grammar supporting: + ``` + metric_name{attribute="value", attribute2=~"regex"} + | aggregation(duration) + by (attribute1, attribute2) + ``` +- Build a parser that translates to the existing ClickHouse query builder +- Offer both UI builder and text modes (toggle like New Relic's basic/advanced) +- Syntax highlighting and autocomplete in the text editor (metric names, attribute keys, functions) +- Functions: `rate()`, `delta()`, `avg()`, `sum()`, `min()`, `max()`, `p50()`, `p95()`, `p99()`, `count()`, `topk()`, `bottomk()` + +**Files to modify**: +- `Common/Utils/Metrics/MetricsQueryLanguage.ts` (new - parser and translator) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryEditor.tsx` (new - text editor with autocomplete) + +### 4.3 Golden Signals Dashboards + +**Current**: No auto-generated dashboards. +**Target**: Auto-generated dashboards showing Latency, Traffic, Errors, Saturation for each service. + +**Implementation**: + +- Detect common OpenTelemetry metric names per service: + - Latency: `http.server.duration`, `http.server.request.duration` + - Traffic: `http.server.request.count`, `http.server.active_requests` + - Errors: `http.server.request.count` where status_code >= 500 + - Saturation: `process.runtime.*.memory`, `system.cpu.utilization` +- Auto-create a Golden Signals dashboard for each service with detected metrics +- Include default alert thresholds + +**Files to modify**: +- `Worker/Jobs/Metrics/GenerateGoldenSignalsDashboards.ts` (new) +- `Common/Utils/Metrics/GoldenSignalsDetector.ts` (new - metric name pattern matching) + +### 4.4 Predictive Alerting + +**Current**: No forecasting capability. +**Target**: Forecast metric values and alert before thresholds are breached. + +**Implementation**: + +- Use linear regression or Holt-Winters on recent data to project forward +- Alert if projected value crosses threshold within configurable forecast window (e.g., "disk full in 4 hours") +- Particularly valuable for capacity planning metrics (disk, memory, connection pools) +- Show forecast as a dashed line extension on metric charts + +**Files to modify**: +- `Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts` (add predictive evaluation) +- `Common/Utils/Metrics/Forecasting.ts` (new - regression/Holt-Winters) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render forecast line) + +--- + +## ClickHouse Storage Improvements + +### S.1 Fix Sort Key Order (CRITICAL) + +**Current**: Sort key is `(projectId, time, serviceId)`. +**Target**: Change to `(projectId, name, serviceId, time)`. + +**Impact**: ~100x improvement for name-filtered queries. Virtually every metric query filters by `name`, but currently ClickHouse must scan all metric names within the time range. + +**Migration**: Requires creating `MetricItem_v2` with new sort key and migrating data (ClickHouse doesn't support `ALTER TABLE MODIFY ORDER BY`). + +**Files to modify**: +- `Common/Models/AnalyticsModels/Metric.ts` (change sort key) +- `Worker/DataMigrations/` (new migration - create v2 table, backfill, swap) + +### S.2 Upgrade time to DateTime64 (HIGH) + +**Current**: `DateTime` with second precision. +**Target**: `DateTime64(3)` or `DateTime64(6)` for sub-second precision. + +**Impact**: Correct sub-second metric ordering. Removes need for separate `timeUnixNano`/`startTimeUnixNano` columns. + +**Files to modify**: +- `Common/Models/AnalyticsModels/Metric.ts` (change column type) +- `Common/Types/AnalyticsDatabase/TableColumnType.ts` (add DateTime64 type if not present) +- `Common/Server/Utils/AnalyticsDatabase/StatementGenerator.ts` (handle DateTime64) +- `Worker/DataMigrations/` (migration) + +### S.3 Add Skip Index on metricPointType (MEDIUM) + +**Current**: No index support for metric type filtering. +**Target**: Set skip index on `metricPointType`. + +**Files to modify**: +- `Common/Models/AnalyticsModels/Metric.ts` (add skip index) + +### S.4 Evaluate Map Type for Attributes (MEDIUM) + +**Current**: Attributes stored as JSON. +**Target**: Evaluate `Map(LowCardinality(String), String)` for faster attribute-based filtering. + +### S.5 Upgrade count/bucketCounts to Int64 (LOW) + +**Current**: `Int32` for count and `Array(Int32)` for bucketCounts. +**Target**: `Int64` / `Array(Int64)` to prevent overflow in high-throughput systems. + +--- + +## Quick Wins (Can Ship This Week) + +1. **Display units on chart Y-axes** - Data exists in MetricType, just needs wiring to chart rendering +2. **Add p50/p95/p99 to aggregation dropdown** - ClickHouse `quantile()` is straightforward to add +3. **Extend default retention** - 15 days is too short; increase default to 30 days +4. **Multi-attribute GROUP BY** - Change `groupByAttribute: string` to `groupByAttribute: string[]` +5. **Add stacked area chart type** - Simple extension of existing line chart +6. **Add skip index on metricPointType** - Low effort, faster type-filtered queries + +--- + +## Recommended Implementation Order + +1. **Quick Wins** - Ship units on charts, p50/p95/p99, multi-attribute GROUP BY, stacked area +2. **Phase 1.1** - Percentile aggregations (full implementation beyond quick win) +3. **Phase 1.2** - Rate/derivative calculations +4. **S.1** - Fix sort key order (critical performance improvement) +5. **Phase 1.4** - Rollups/downsampling (enables long-range queries) +6. **Phase 2.1** - More chart types (heatmap, pie, gauge, billboard) +7. **Phase 2.2** - Time-over-time comparison +8. **Phase 1.3** - Multi-attribute GROUP BY (full implementation) +9. **S.2** - Upgrade time to DateTime64 +10. **Phase 3.1** - Anomaly detection +11. **Phase 3.2** - SLO/SLI tracking +12. **Phase 2.4** - Dashboard templates +13. **Phase 4.1** - Cardinality management +14. **Phase 4.2** - Query language +15. **Phase 4.3** - Golden Signals dashboards +16. **Phase 4.4** - Predictive alerting +17. **Phase 3.3** - Metric correlations + +## Verification + +For each feature: +1. Unit tests for new aggregation types, rate calculations, unit formatting, query language parser +2. Integration tests for new API endpoints (percentiles, rollup queries, SLO evaluation) +3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/metrics` +4. Check ClickHouse query performance with `EXPLAIN` for new query patterns +5. Verify rollup accuracy by comparing rollup results to raw data results for overlapping time ranges +6. Load test cardinality tracking and anomaly detection jobs to ensure they don't impact ingestion diff --git a/Internal/Roadmap/Traces.md b/Internal/Roadmap/Traces.md new file mode 100644 index 0000000000..4afb88eb6a --- /dev/null +++ b/Internal/Roadmap/Traces.md @@ -0,0 +1,412 @@ +# Plan: Bring OneUptime Traces to Industry Parity and Beyond + +## Context + +OneUptime's trace implementation provides OTLP-native ingestion (HTTP and gRPC), ClickHouse storage with a full OpenTelemetry span model (events, links, status, attributes, resources, scope), a Gantt/waterfall visualization, trace-to-log and trace-to-exception correlation, a basic service dependency graph, queue-based async ingestion, and per-service data retention with TTL. ClickHouse schema has been optimized with BloomFilter indexes on traceId/spanId/parentSpanId, Set indexes on statusCode/kind/hasException, TokenBF on name, and ZSTD compression on key columns. + +This plan identifies the remaining gaps vs DataDog, NewRelic, Honeycomb, and Grafana Tempo, and proposes a phased implementation to close them and surpass competition. + +## Completed + +The following features have been implemented: +- **OTLP Ingestion** - HTTP and gRPC trace ingestion with async queue-based processing +- **ClickHouse Storage** - MergeTree with `sipHash64(projectId) % 16` partitioning, per-service TTL +- **Gantt/Waterfall View** - Hierarchical span visualization with color-coded services, time-unit auto-scaling, error indicators +- **Trace-to-Log Correlation** - Log model has traceId/spanId columns; SpanViewer shows associated logs +- **Trace-to-Exception Correlation** - ExceptionInstance model links to traceId/spanId with stack trace parsing and fingerprinting +- **Span Detail Panel** - Side-over with tabs for Basic Info, Logs, Attributes, Events, Exceptions +- **BloomFilter indexes** on traceId, spanId, parentSpanId +- **Set indexes** on statusCode, kind, hasException +- **TokenBF index** on name +- **ZSTD compression** on time/ID/attribute columns +- **hasException boolean column** for fast error span filtering +- **links default value** corrected to `[]` + +## Gap Analysis Summary + +| Feature | OneUptime | DataDog | NewRelic | Tempo/Honeycomb | Priority | +|---------|-----------|---------|----------|-----------------|----------| +| Trace analytics / aggregation engine | None | Trace Explorer with COUNT/percentiles | NRQL on span data | TraceQL rate/count/quantile | **P0** | +| RED metrics from traces | None | Auto-computed on 100% traffic | Derived golden signals | Metrics-generator to Prometheus | **P0** | +| Trace-based alerting | None | APM Monitors (p50-p99, error rate, Apdex) | NRQL alert conditions | Via Grafana alerting / Triggers | **P0** | +| Sampling controls | None (100% ingestion) | Head-based adaptive + retention filters | Infinite Tracing (tail-based) | Refinery (rules/dynamic/tail) | **P0** | +| Flame graph view | None | Yes (default view) | No | No | **P1** | +| Latency breakdown / critical path | None | Per-hop latency, bottleneck detection | No | BubbleUp (Honeycomb) | **P1** | +| In-trace search | None | Yes | No | No | **P1** | +| Per-trace service map | None | Yes (Map view) | No | No | **P1** | +| Trace-to-metric exemplars | None | Pivot from metric graph to traces | Metric-to-trace linking | Prometheus exemplars | **P1** | +| Custom metrics from spans | None | Generate count/distribution/gauge from tags | Via NRQL | SLOs from span data | **P2** | +| Structural trace queries | None | Trace Queries (multi-span relationships) | Via NRQL | TraceQL spanset pipelines | **P2** | +| Trace comparison / diffing | None | Partial | Side-by-side comparison | compare() in TraceQL | **P2** | +| AI/ML on traces | None | Watchdog (auto anomaly + RCA) | NRAI | BubbleUp (pattern detection) | **P3** | +| RUM correlation | None | Frontend-to-backend trace linking | Yes | Faro / frontend observability | **P3** | +| Continuous profiling | None | Code Hotspots (span-to-profile) | Partial | Pyroscope | **P3** | + +--- + +## Phase 1: Analytics & Alerting Foundation (P0) — Highest Impact + +Without these, users cannot answer basic questions like "is my service healthy?" from trace data. + +### 1.1 Trace Analytics / Aggregation Engine + +**Current**: Can list/filter individual spans and view individual traces. No way to aggregate or compute statistics. +**Target**: Full trace analytics supporting COUNT, AVG, SUM, MIN, MAX, P50/P75/P90/P95/P99 aggregations with GROUP BY on any span attribute and time-series bucketing. + +**Implementation**: + +- Build a trace analytics API endpoint that translates query configs into ClickHouse aggregation queries +- Use ClickHouse's native functions: `quantile(0.99)(durationUnixNano)`, `countIf(statusCode = 2)`, `toStartOfInterval(startTime, INTERVAL 1 MINUTE)` +- Support GROUP BY on service, span name, kind, status, and any custom attribute (via JSON extraction) +- Frontend: Add an "Analytics" tab to the Traces page with chart types (timeseries, top list, table) similar to the existing LogsAnalyticsView +- Support switching between "List" view (current) and "Analytics" view + +**Files to modify**: +- `Common/Server/API/TelemetryAPI.ts` (add trace analytics endpoint) +- `Common/Server/Services/SpanService.ts` (add aggregation query methods) +- `Common/Types/Traces/TraceAnalyticsQuery.ts` (new - query interface) +- `App/FeatureSet/Dashboard/src/Pages/Traces/Index.tsx` (add analytics view toggle) +- `App/FeatureSet/Dashboard/src/Components/Traces/TraceAnalyticsView.tsx` (new - analytics UI) + +### 1.2 RED Metrics from Traces (Request Rate, Error Rate, Duration) + +**Current**: No automatic computation of service-level metrics from trace data. +**Target**: Auto-computed per-service, per-operation RED metrics displayed on a Service Overview page. + +**Implementation**: + +- Create a ClickHouse materialized view that aggregates spans into per-service, per-operation metrics at 1-minute intervals: + ```sql + CREATE MATERIALIZED VIEW span_red_metrics + ENGINE = AggregatingMergeTree() + ORDER BY (projectId, serviceId, name, minute) + AS SELECT + projectId, serviceId, name, + toStartOfMinute(startTime) AS minute, + countState() AS request_count, + countIfState(statusCode = 2) AS error_count, + quantileState(0.50)(durationUnixNano) AS p50_duration, + quantileState(0.95)(durationUnixNano) AS p95_duration, + quantileState(0.99)(durationUnixNano) AS p99_duration + FROM SpanItem + GROUP BY projectId, serviceId, name, minute + ``` +- Build a Service Overview page showing: request rate chart, error rate chart, p50/p95/p99 latency charts +- Add an API endpoint to query the materialized view + +**Files to modify**: +- `Common/Models/AnalyticsModels/SpanRedMetrics.ts` (new - materialized view model) +- `Telemetry/Services/SpanRedMetricsService.ts` (new - query service) +- `App/FeatureSet/Dashboard/src/Pages/Service/View/Overview.tsx` (new or enhanced - RED dashboard) +- `Worker/DataMigrations/` (new migration to create materialized view) + +### 1.3 Trace-Based Alerting + +**Current**: No ability to alert on trace data. +**Target**: Create alerts on p50/p75/p90/p95/p99 latency thresholds, error rate thresholds, and request rate anomalies per service/operation. + +**Implementation**: + +- Extend the existing monitor system to add a `TraceMonitor` type +- Monitor evaluates against the RED metrics materialized view (depends on 1.2) +- Alert conditions: latency exceeds threshold, error rate exceeds threshold, request rate drops below threshold +- Integrate with existing OneUptime alerting/incident system +- UI: Add "Trace Monitor" as a new monitor type in the monitor creation wizard + +**Files to modify**: +- `Common/Types/Monitor/MonitorType.ts` (add Trace monitor type) +- `Common/Types/Monitor/MonitorStepTraceMonitor.ts` (new - trace monitor config) +- `Common/Server/Utils/Monitor/Criteria/TraceMonitorCriteria.ts` (new - evaluation logic) +- `App/FeatureSet/Dashboard/src/Components/Form/Monitor/TraceMonitor/` (new - monitor form UI) + +### 1.4 Head-Based Probabilistic Sampling + +**Current**: Ingests 100% of received traces. +**Target**: Configurable per-service probabilistic sampling with rules to always keep errors and slow traces. + +**Implementation**: + +- Create `TraceSamplingRule` PostgreSQL model: service filter, sample rate (0-100%), conditions to always keep (error status, duration > threshold) +- Evaluate sampling rules in `OtelTracesIngestService.ts` before ClickHouse insert +- Use deterministic sampling based on traceId hash (so all spans from the same trace are kept or dropped together) +- UI under Settings > Trace Configuration > Sampling Rules +- Show estimated storage savings + +**Files to modify**: +- `Common/Models/DatabaseModels/TraceSamplingRule.ts` (new) +- `Telemetry/Services/OtelTracesIngestService.ts` (add sampling logic) +- Dashboard: new Settings page for sampling configuration + +--- + +## Phase 2: Visualization & Debugging UX (P1) — Industry-Standard Features + +### 2.1 Flame Graph View + +**Current**: Only Gantt/waterfall view. +**Target**: Flame graph visualization showing proportional time spent in each span, with service color coding. + +**Implementation**: + +- Build a flame graph component that renders spans as horizontally stacked rectangles proportional to duration +- Allow switching between Waterfall and Flame Graph views in TraceExplorer +- Color-code by service (consistent with waterfall view) +- Click a span rectangle to focus/zoom into that subtree +- Show tooltip with span name, service, duration, self-time on hover + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Traces/FlameGraph.tsx` (new) +- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view toggle) + +### 2.2 Latency Breakdown / Critical Path Analysis + +**Current**: Shows individual span durations but no automated analysis. +**Target**: Compute and display critical path, self-time vs child-time, and bottleneck identification. + +**Implementation**: + +- Compute critical path: the longest sequential chain of spans through the trace (accounts for parallelism) +- Calculate "self time" per span: `span.duration - sum(child.duration)` (clamped to 0 for overlapping children) +- Display latency breakdown by service: percentage of total trace time spent in each service +- Highlight bottleneck spans (spans contributing most to critical path duration) +- Add "Critical Path" toggle in TraceExplorer that highlights the critical path spans + +**Files to modify**: +- `Common/Utils/Traces/CriticalPath.ts` (new - critical path algorithm) +- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (show self-time) +- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add critical path view) + +### 2.3 In-Trace Span Search + +**Current**: TraceExplorer shows all spans with service filtering and error toggle, but no text search. +**Target**: Search box to filter spans by name, attribute values, or status within the current trace. + +**Implementation**: + +- Add a search input in TraceExplorer toolbar +- Client-side filtering: match span name, service name, attribute keys/values against search text +- Highlight matching spans in the waterfall/flame graph +- Show match count (e.g., "3 of 47 spans") + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add search bar and filtering) + +### 2.4 Per-Trace Service Flow Map + +**Current**: Service dependency graph exists globally but not per-trace. +**Target**: Per-trace visualization showing the path of a request through services with latency annotations. + +**Implementation**: + +- Build a directed graph from the spans in a single trace (services as nodes, calls as edges) +- Annotate edges with call count and latency +- Color-code nodes by error status +- Add as a new view tab alongside Waterfall and Flame Graph + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Traces/TraceServiceMap.tsx` (new) +- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view tab) + +### 2.5 Span Link Navigation + +**Current**: Links data is stored in spans but not navigable in the UI. +**Target**: Clickable links in the span detail panel that navigate to related traces/spans. + +**Implementation**: + +- In the SpanViewer detail panel, render the `links` array as clickable items +- Each link shows the linked traceId, spanId, and relationship type +- Clicking navigates to the linked trace view + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (render clickable links) + +--- + +## Phase 3: Advanced Analytics & Correlation (P2) — Power Features + +### 3.1 Trace-to-Metric Exemplars + +**Current**: Metric model has no traceId/spanId fields. +**Target**: Link metric data points to trace IDs; show exemplar dots on metric charts that navigate to traces. + +**Implementation**: + +- Add optional `traceId` and `spanId` columns to the Metric ClickHouse model +- During metric ingestion, extract exemplar trace/span IDs from OTLP exemplar fields +- On metric charts, render exemplar dots at data points that have associated traces +- Clicking an exemplar dot navigates to the trace view + +**Files to modify**: +- `Common/Models/AnalyticsModels/Metric.ts` (add traceId/spanId columns) +- `Telemetry/Services/OtelMetricsIngestService.ts` (extract exemplars) +- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render exemplar dots) + +### 3.2 Custom Metrics from Spans + +**Current**: No way to create persistent metrics from trace data. +**Target**: Users define custom metrics from span attributes that are computed via ClickHouse materialized views and available for alerting and dashboards. + +**Implementation**: + +- Create `SpanDerivedMetric` model: name, filter query (which spans), aggregation (count/avg/p99 of what field), GROUP BY attributes +- Use ClickHouse materialized views for efficient computation +- Surface derived metrics in the metric explorer and alerting system + +**Files to modify**: +- `Common/Models/DatabaseModels/SpanDerivedMetric.ts` (new) +- `Common/Server/Services/SpanDerivedMetricService.ts` (new) +- Dashboard: UI for defining derived metrics + +### 3.3 Structural Trace Queries + +**Current**: Can only filter on individual span attributes. +**Target**: Query traces based on properties of multiple spans and their relationships (e.g., "find traces where service A called service B and B returned an error"). + +**Implementation**: + +- Design a visual query builder for structural queries (easier adoption than a query language) +- Translate structural queries to ClickHouse subqueries with JOINs on traceId +- Example: "Find traces where span with service=frontend has child span with service=database AND duration > 500ms" + ```sql + SELECT DISTINCT s1.traceId FROM SpanItem s1 + JOIN SpanItem s2 ON s1.traceId = s2.traceId AND s1.spanId = s2.parentSpanId + WHERE s1.projectId = {pid} + AND JSONExtractString(s1.attributes, 'service.name') = 'frontend' + AND JSONExtractString(s2.attributes, 'service.name') = 'database' + AND s2.durationUnixNano > 500000000 + ``` + +**Files to modify**: +- `Common/Types/Traces/StructuralTraceQuery.ts` (new - query model) +- `Common/Server/Services/SpanService.ts` (add structural query execution) +- `App/FeatureSet/Dashboard/src/Components/Traces/StructuralQueryBuilder.tsx` (new - visual builder) + +### 3.4 Trace Comparison / Diffing + +**Current**: No way to compare traces. +**Target**: Side-by-side comparison of two traces of the same operation, highlighting differences in span count, latency, and structure. + +**Implementation**: + +- Add "Compare" action to trace list (select two traces) +- Build a diff view showing: added/removed spans, latency differences per span, structural changes +- Useful for comparing a slow trace to a fast trace of the same operation + +**Files to modify**: +- `App/FeatureSet/Dashboard/src/Components/Traces/TraceComparison.tsx` (new) +- `App/FeatureSet/Dashboard/src/Pages/Traces/Compare.tsx` (new page) + +--- + +## Phase 4: Competitive Differentiation (P3) — Long-Term + +### 4.1 Rules-Based and Tail-Based Sampling + +**Current**: Phase 1 adds head-based probabilistic sampling. +**Target**: Rules-based sampling (always keep errors/slow traces, sample successes) and eventually tail-based sampling (buffer complete traces, decide after seeing all spans). + +**Implementation**: + +- Rules engine: configurable conditions (service, status, duration, attributes) with per-rule sample rates +- Tail-based: buffer spans for a configurable window (30s), assemble complete traces, then apply retention decisions +- Tail-based is complex; consider integrating with OpenTelemetry Collector's tail sampling processor as an alternative + +### 4.2 AI/ML on Trace Data + +- **Anomaly detection** on RED metrics (statistical deviation from baseline) +- **Auto-surfacing correlated attributes** when latency spikes (similar to Honeycomb BubbleUp) +- **Natural language trace queries** ("show me slow database calls from the last hour") +- **Automatic root cause analysis** from trace data during incidents + +### 4.3 RUM (Real User Monitoring) Correlation + +- Browser SDK that propagates W3C trace context from frontend to backend +- Link frontend page loads, interactions, and web vitals to backend traces +- Show end-to-end user experience from browser to backend services + +### 4.4 Continuous Profiling Integration + +- Integrate with a profiling backend (e.g., Pyroscope) +- Link profile data to span time windows +- Show "Code Hotspots" within spans (similar to DataDog) + +--- + +## ClickHouse Storage Improvements + +### S.1 Migrate `attributes` to Map(String, String) (HIGH) + +**Current**: `attributes` is stored as opaque `String` (JSON). Querying by attribute value requires `LIKE` or `JSONExtract()` scans. +**Target**: `Map(String, String)` type enabling `attributes['http.method'] = 'GET'` without JSON parsing. + +**Impact**: Significant query speedup for attribute-based span filtering -- the most common query pattern after time-range filtering. + +**Files to modify**: +- `Common/Models/AnalyticsModels/Span.ts` (change column type) +- `Common/Server/Utils/AnalyticsDatabase/StatementGenerator.ts` (handle Map type) +- `Telemetry/Services/OtelTracesIngestService.ts` (write Map format) +- `Worker/DataMigrations/` (new migration) + +### S.2 Add Aggregation Projection (MEDIUM) + +**Current**: `projections: []` is empty. +**Target**: Pre-aggregation projection for common dashboard queries. + +```sql +PROJECTION agg_by_service ( + SELECT + serviceId, + toStartOfMinute(startTime) AS minute, + count(), + avg(durationUnixNano), + quantile(0.99)(durationUnixNano) + GROUP BY serviceId, minute +) +``` + +**Impact**: 5-10x faster aggregation queries for service overview dashboards. + +### S.3 Add Trace-by-ID Projection (LOW) + +**Current**: Trace detail view relies on BloomFilter skip index for traceId lookups. +**Target**: Projection sorted by `(projectId, traceId, startTime)` for faster trace-by-ID queries. + +--- + +## Quick Wins (Can Ship This Week) + +1. **In-trace span search** - Add a text filter in TraceExplorer (few hours of work) +2. **Self-time calculation** - Show "self time" (span duration minus child durations) in SpanViewer +3. **Span link navigation** - Links data is stored but not clickable in UI +4. **Top-N slowest operations** - Simple ClickHouse query: `ORDER BY durationUnixNano DESC LIMIT N` +5. **Error rate by service** - Aggregate `statusCode=2` counts grouped by serviceId +6. **Trace duration distribution histogram** - Use ClickHouse `histogram()` on durationUnixNano +7. **Span count per service display** - Already tracked in `servicesInTrace`, just needs better display + +--- + +## Recommended Implementation Order + +1. **Phase 1.1** - Trace Analytics Engine (highest impact, unlocks everything else) +2. **Phase 1.2** - RED Metrics from Traces (prerequisite for alerting, service overview) +3. **Quick Wins** - Ship in-trace search, self-time, span links, top-N operations +4. **Phase 1.3** - Trace-Based Alerting (core observability workflow) +5. **Phase 2.1** - Flame Graph View (industry-standard visualization) +6. **Phase 2.2** - Critical Path Analysis (key debugging capability) +7. **Phase 1.4** - Head-Based Sampling (essential for high-volume users) +8. **S.1** - Migrate attributes to Map type (storage optimization) +9. **Phase 2.3-2.5** - In-trace search, per-trace map, span links +10. **Phase 3.1** - Trace-to-Metric Exemplars +11. **Phase 3.2-3.4** - Custom metrics, structural queries, comparison +12. **Phase 4.x** - AI/ML, RUM, profiling (long-term) + +## Verification + +For each feature: +1. Unit tests for new query builders, critical path algorithm, sampling logic +2. Integration tests for new API endpoints (analytics, RED metrics, sampling) +3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/traces` +4. Check ClickHouse query performance with `EXPLAIN` for new aggregation queries +5. Verify trace correlation (logs, exceptions, metrics) still works correctly with new features +6. Load test sampling logic to ensure it doesn't add ingestion latency diff --git a/VERSION b/VERSION index 3c7e4cfc33..fa79f63915 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -10.0.32 \ No newline at end of file +10.0.33 \ No newline at end of file