feat: Add comprehensive metrics and traces roadmap for industry parity

- Introduced detailed plans for enhancing OneUptime's metrics and traces capabilities to match and exceed industry standards. - Metrics roadmap includes features like percentile aggregations, rate calculations, multi-attribute grouping, rollups, and advanced visualizations. - Traces roadmap outlines improvements such as trace analytics, RED metrics, trace-based alerting, and enhanced visualization options like flame graphs and critical path analysis. - Both roadmaps emphasize phased implementation, quick wins, and verification strategies to ensure robust feature delivery and performance.
2026-04-06 00:32:12 +02:00 · 2026-03-16 09:51:08 +00:00
parent 4781c6a532
commit 2c7486714f
7 changed files with 1476 additions and 108 deletions
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -18,7 +18,7 @@ on:
        default: false

 permissions:
-  contents: read
+  contents: write
  packages: write

 jobs:
--- a/.github/workflows/test-release.yaml
+++ b/.github/workflows/test-release.yaml
@@ -10,7 +10,7 @@ on:
      - "master"

 permissions:
-  contents: read
+  contents: write
  packages: write

 jobs:
--- a/Internal/Roadmap/Dashboards.md
+++ b/Internal/Roadmap/Dashboards.md
@@ -0,0 +1,568 @@
+# Plan: Bring OneUptime Dashboards to Industry Parity and Beyond
+
+## Context
+
+OneUptime's dashboard implementation provides a 12-column grid layout with drag-and-drop editing, 3 widget types (Chart with Line/Bar, Value, Text with basic formatting), global time range with presets, view/edit modes, role-based permissions, and full-screen support. Dashboard config is stored as a single JSON column. Dashboards can only query OpenTelemetry metrics from ClickHouse.
+
+This plan identifies the remaining gaps vs Grafana, Datadog, and New Relic, and proposes a phased implementation to build a best-in-class dashboard product that leverages OneUptime's unique position as an all-in-one observability + status page platform.
+
+## Completed
+
+The following features have been implemented:
+- **12-Column Grid Layout** - Fixed grid with dynamic unit sizing, 60 default rows (expandable)
+- **Drag-and-Drop Editing** - Move and resize components with bounds checking
+- **Chart Widget** - Line and Bar chart types with single metric query, configurable title/description/legend
+- **Value Widget** - Single metric aggregation displayed as large number
+- **Text Widget** - Bold/Italic/Underline formatting (no markdown)
+- **Global Time Range** - Presets (30min to 3mo) + custom date range picker
+- **View/Edit Modes** - Read-only view with full-screen, edit mode with side panel settings
+- **Role-Based Permissions** - ProjectOwner, ProjectAdmin, ProjectMember + custom permissions
+- **Dashboard CRUD API** - Standard REST API with slug generation
+- **Billing Enforcement** - Free plan limited to 1 dashboard
+
+## Gap Analysis Summary
+
+| Feature | OneUptime | Grafana | Datadog | New Relic | Priority |
+|---------|-----------|---------|---------|-----------|----------|
+| Widget types | 3 | 20+ | 40+ | 15+ | **P0** |
+| Chart types | 2 (Line, Bar) | 10+ | 12+ | 10+ | **P0** |
+| Template variables | None | 6+ types | Yes | 3 types | **P0** |
+| Auto-refresh | None | Configurable | Real-time | Yes | **P0** |
+| Log panels | None | Yes (Loki) | Yes | Yes (NRQL) | **P0** |
+| Trace panels | None | Yes (Tempo) | Yes | Yes | **P0** |
+| Table widget | None | Yes | Yes | Yes | **P0** |
+| Multiple queries per chart | Single query | Yes | Yes | Yes | **P0** |
+| Markdown support | Basic formatting only | Full markdown | Full markdown | Full markdown | **P0** |
+| Threshold lines / color coding | None | Yes | Yes | Yes | **P0** |
+| Legend interaction (show/hide) | None | Yes | Yes | Yes | **P0** |
+| Chart zoom | None | Yes | Yes | Yes | **P0** |
+| Dashboard linking / drill-down | None | Data links | Yes | Facet linking | **P1** |
+| Annotations / event overlays | None | Yes | Yes | Yes (Labs) | **P1** |
+| Row/section grouping | None | Collapsible rows | Groups | No | **P1** |
+| Public/shared dashboards | None | Yes | Yes | Yes | **P1** |
+| JSON import/export | None | Yes | Yes | Yes | **P1** |
+| Dashboard versioning | None | Yes | Yes | No | **P1** |
+| Alert integration | None | Create from panel + show state | Yes | NRQL alerts | **P1** |
+| TV/Kiosk mode | Full-screen only | Kiosk mode | Yes | Auto-cycling | **P1** |
+| CSV export | None | Yes | Yes | Yes | **P1** |
+| Custom time per widget | None | No | No | No | **P1** |
+| AI dashboard creation | None | None | None | None | **P2** |
+| Dashboard-as-code SDK | None | Foundation SDK | No | No | **P2** |
+| Terraform provider | None | Yes | Yes | Yes | **P2** |
+
+---
+
+## Phase 1: Foundation (P0) — Close Critical Gaps
+
+These gaps make OneUptime dashboards fundamentally non-competitive. Every major competitor has these.
+
+### 1.1 Add Core Chart Types: Area, Pie, Table, Gauge, Heatmap, Histogram
+
+**Current**: Line and Bar only.
+**Target**: 8+ chart types covering all standard observability visualization needs.
+
+**Implementation**:
+
+- **Area Chart** (stacked and non-stacked): Extension of line chart with fill. Use existing chart library's area mode
+- **Pie/Donut Chart**: For proportional breakdowns (e.g., error distribution by service). New component
+- **Table Widget**: Tabular metric data, top-N lists, multi-column display with sortable columns. New component type
+- **Gauge Widget**: Circular gauge with configurable min/max/thresholds and color zones. New component
+- **Heatmap**: Time on X-axis, value buckets on Y-axis, color intensity for count. Essential for distribution/histogram metrics
+- **Histogram**: Bar chart showing value distribution. Important for latency analysis
+
+Each chart type needs:
+- A new entry in `DashboardComponentType` or `ChartType` enum
+- A rendering component in `Dashboard/Components/`
+- Configuration options in the component settings side panel
+
+**Files to modify**:
+- `Common/Types/Dashboard/Chart/ChartType.ts` (add Area, Pie, Heatmap, Histogram, Gauge)
+- `Common/Types/Dashboard/DashboardComponentType.ts` (add Table, Gauge)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render new types)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTableComponent.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardGaugeComponent.tsx` (new)
+
+### 1.2 Template Variables
+
+**Current**: No template variables. Users must create separate dashboards for each service/host/environment.
+**Target**: Drop-down variable selectors that dynamically filter all widgets.
+
+**Implementation**:
+
+- Create a `DashboardVariable` type stored in `dashboardViewConfig`:
+  - Name, label, type (query-based, custom list, text input)
+  - Query-based: runs a ClickHouse query to populate options (e.g., `SELECT DISTINCT service FROM MetricItem WHERE projectId = {pid}`)
+  - Custom list: manually defined options
+  - Multi-value selection support
+- Render variables as dropdown selectors in the dashboard toolbar
+- Variables can be referenced in metric queries as `$variable_name`
+- When a variable changes, all widgets re-query with the new value
+- Support cascading variables (variable B's query depends on variable A's value)
+
+**Files to modify**:
+- `Common/Types/Dashboard/DashboardVariable.ts` (new)
+- `Common/Types/Dashboard/DashboardViewConfig.ts` (add variables array)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Toolbar/DashboardToolbar.tsx` (render variable dropdowns)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (pass variable values to widgets)
+- `Common/Server/Services/MetricService.ts` (resolve variable references in queries)
+
+### 1.3 Auto-Refresh
+
+**Current**: Data goes stale after initial load.
+**Target**: Configurable auto-refresh intervals.
+
+**Implementation**:
+
+- Add auto-refresh dropdown in toolbar with options: Off, 5s, 10s, 30s, 1m, 5m, 15m
+- Store selected interval in dashboard config and URL state
+- Use `setInterval` to trigger re-fetch on all metric widgets
+- Show a subtle refresh indicator when data is being updated
+- Pause auto-refresh when the dashboard is in edit mode
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Toolbar/DashboardToolbar.tsx` (add refresh dropdown)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (implement refresh timer)
+- `Common/Types/Dashboard/DashboardViewConfig.ts` (store refresh interval)
+
+### 1.4 Multiple Queries per Chart
+
+**Current**: Single `MetricQueryConfigData` per chart.
+**Target**: Overlay multiple metric series on a single chart for correlation.
+
+**Implementation**:
+
+- Change chart component's data source from single `MetricQueryConfigData` to `MetricQueryConfigData[]`
+- Each query gets its own alias and legend entry
+- Support formula references across queries (e.g., `a / b * 100`)
+- Y-axis: support dual Y-axes for metrics with different scales
+
+**Files to modify**:
+- `Common/Utils/Dashboard/Components/DashboardChartComponent.ts` (change to array)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render multiple series)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/ComponentSettingsSideOver.tsx` (multi-query config UI)
+
+### 1.5 Full Markdown Support for Text Widget
+
+**Current**: Only bold, italic, underline formatting.
+**Target**: Full markdown rendering including headers, links, lists, code blocks, tables, and images.
+
+**Implementation**:
+
+- Replace the current custom formatting with a markdown renderer (e.g., `react-markdown` or `marked`)
+- Support: headers (h1-h6), links, ordered/unordered lists, code blocks with syntax highlighting, tables, images, blockquotes
+- Edit mode: raw markdown text area with preview toggle
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTextComponent.tsx` (replace with markdown renderer)
+- `Common/Utils/Dashboard/Components/DashboardTextComponent.ts` (store raw markdown)
+
+### 1.6 Threshold Lines & Color Coding
+
+**Current**: No threshold visualization.
+**Target**: Configurable warning/critical thresholds on charts with color-coded regions.
+
+**Implementation**:
+
+- Add threshold configuration to chart settings: value, label, color (default: yellow for warning, red for critical)
+- Render horizontal lines on the chart at threshold values
+- Optionally fill regions above/below thresholds with translucent color
+- For value/billboard widgets: change background color based on which threshold range the value falls in (green/yellow/red)
+
+**Files to modify**:
+- `Common/Utils/Dashboard/Components/DashboardChartComponent.ts` (add thresholds config)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (render threshold lines)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardValueComponent.tsx` (color coding)
+
+### 1.7 Legend Interaction (Show/Hide Series)
+
+**Current**: Legends are display-only.
+**Target**: Click legend items to toggle series visibility.
+
+**Implementation**:
+
+- Add click handler on legend items to toggle series visibility
+- Clicked-off series should be visually dimmed in the legend and removed from the chart
+- Support "isolate" mode: Ctrl+Click shows only that series and hides all others
+- Persist toggled state during the session (reset on page reload)
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (add legend click handlers)
+
+### 1.8 Chart Zoom (Click-Drag Time Selection)
+
+**Current**: No zoom capability.
+**Target**: Click and drag on a time series chart to zoom into a time range.
+
+**Implementation**:
+
+- Enable brush/selection mode on time series charts
+- When user drags to select a range, update the global time range to the selected range
+- Show a "Reset zoom" button to return to the previous time range
+- Maintain a zoom stack so users can zoom in multiple times and zoom back out
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (add brush selection)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (handle time range updates from zoom)
+
+---
+
+## Phase 2: Observability Integration (P0-P1) — Leverage the Full Platform
+
+This is where OneUptime can differentiate: metrics, logs, and traces in one platform.
+
+### 2.1 Log Stream Widget
+
+**Current**: Dashboards can only show metrics.
+**Target**: Widget that displays a live log stream with filtering.
+
+**Implementation**:
+
+- New `DashboardComponentType.LogStream` widget type
+- Configuration: log query filter, severity filter, service filter, max rows
+- Renders as a scrolling log list with severity color coding, timestamp, and body
+- Click a log entry to expand and see full details
+- Respects dashboard time range and template variables
+
+**Files to modify**:
+- `Common/Types/Dashboard/DashboardComponentType.ts` (add LogStream)
+- `Common/Utils/Dashboard/Components/DashboardLogStreamComponent.ts` (new - config)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardLogStreamComponent.tsx` (new - rendering)
+
+### 2.2 Trace List Widget
+
+**Current**: No trace visualization in dashboards.
+**Target**: Widget showing a filtered trace list with duration and status.
+
+**Implementation**:
+
+- New `DashboardComponentType.TraceList` widget type
+- Configuration: service filter, operation filter, status filter, min duration
+- Renders as a table: trace ID, operation, service, duration, status, timestamp
+- Click a row to navigate to the full trace view
+- Respects dashboard time range and template variables
+
+**Files to modify**:
+- `Common/Types/Dashboard/DashboardComponentType.ts` (add TraceList)
+- `Common/Utils/Dashboard/Components/DashboardTraceListComponent.ts` (new)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardTraceListComponent.tsx` (new)
+
+### 2.3 Click-to-Correlate Across Signals
+
+**Current**: No cross-signal correlation in dashboards.
+**Target**: Click a point on a metric chart to instantly see related logs and traces from that timestamp.
+
+**Implementation**:
+
+- When clicking a data point on a metric chart, open a correlation panel showing:
+  - Logs from the same service and time window (+/- 5 minutes around the clicked point)
+  - Traces from the same service and time window
+  - Filtered by the same template variables
+- The correlation panel appears as a slide-over or split view below the chart
+- This is a major differentiator vs Grafana (which requires separate datasources) and ties into OneUptime's all-in-one advantage
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (add click handler)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/CorrelationPanel.tsx` (new - shows correlated logs/traces)
+
+### 2.4 Annotations / Event Overlays
+
+**Current**: No event markers on charts.
+**Target**: Show deployment events, incidents, and alerts as vertical markers on time series charts.
+
+**Implementation**:
+
+- Query OneUptime's own data for events in the chart's time range:
+  - Incidents (from Incident model)
+  - Deployments (can be sent as OTLP resource attributes or a custom event API)
+  - Alert triggers (from monitor alert history)
+- Render as vertical dashed lines with icons on hover
+- Color-code by type: red for incidents, blue for deployments, yellow for alerts
+- Allow users to add manual annotations (text + timestamp)
+
+**Files to modify**:
+- `Common/Types/Dashboard/DashboardAnnotation.ts` (new)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render annotation markers)
+- `Common/Server/API/DashboardAnnotationAPI.ts` (new - query events)
+
+### 2.5 Alert Integration
+
+**Current**: No connection between dashboards and alerting.
+**Target**: Create alerts from dashboard panels and display alert state on panels.
+
+**Implementation**:
+
+- "Create Alert" button in chart settings that pre-fills a metric monitor with the chart's query
+- Show alert state indicator on chart headers (green/yellow/red dot) based on associated monitor status
+- Alert status widget: shows a summary of all active alerts with severity and duration
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/ComponentSettingsSideOver.tsx` (add "Create Alert" button)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (show alert state)
+- `Common/Types/Dashboard/DashboardComponentType.ts` (add AlertStatus widget type)
+
+### 2.6 SLO/SLI Widget
+
+**Current**: No SLO visualization.
+**Target**: Dedicated widget showing SLO status, error budget burn rate, and remaining budget.
+
+**Implementation** (depends on Metrics roadmap Phase 3.2 - SLO/SLI Tracking):
+
+- New `DashboardComponentType.SLO` widget type
+- Configuration: select an SLO definition
+- Displays: current attainment (%), target (%), error budget remaining (%), burn rate chart
+- Color-coded: green (healthy), yellow (burning fast), red (budget exhausted)
+
+**Files to modify**:
+- `Common/Types/Dashboard/DashboardComponentType.ts` (add SLO)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardSLOComponent.tsx` (new)
+
+---
+
+## Phase 3: Collaboration & Sharing (P1) — Production Workflows
+
+### 3.1 Public/Shared Dashboards
+
+**Current**: Dashboards require login.
+**Target**: Share dashboards with external stakeholders without requiring authentication.
+
+**Implementation**:
+
+- Add `isPublic` flag and `publicAccessToken` to Dashboard model
+- Generate a shareable URL with token: `/public/dashboard/{token}`
+- Public view is read-only with no editing controls
+- Option to restrict public access to specific IP ranges
+- Render without the OneUptime navigation chrome
+
+**Files to modify**:
+- `Common/Models/DatabaseModels/Dashboard.ts` (add isPublic, publicAccessToken)
+- `App/FeatureSet/Dashboard/src/Pages/Public/Dashboard.tsx` (new - public dashboard view)
+
+### 3.2 JSON Import/Export
+
+**Current**: No import/export capability.
+**Target**: Export dashboards as JSON and re-import for backup, migration, and dashboard-as-code.
+
+**Implementation**:
+
+- Export: serialize `dashboardViewConfig` + metadata (name, description, variables) as a JSON file download
+- Import: upload a JSON file, validate schema, create a new dashboard from the config
+- Handle version compatibility (include a schema version in the export)
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Dashboards.tsx` (add import button)
+- `App/FeatureSet/Dashboard/src/Pages/Dashboards/View/Settings.tsx` (add export button)
+- `Common/Server/API/DashboardImportExportAPI.ts` (new)
+
+### 3.3 Dashboard Versioning
+
+**Current**: No change history.
+**Target**: Track changes to dashboards over time with the ability to view history and revert.
+
+**Implementation**:
+
+- Create `DashboardVersion` model: dashboardId, version number, config snapshot, changedBy, timestamp
+- On each save, create a new version entry
+- UI: "Version History" in settings showing a list of versions with timestamps and authors
+- "Restore" button to revert to a previous version
+- Optional: diff view comparing two versions
+
+**Files to modify**:
+- `Common/Models/DatabaseModels/DashboardVersion.ts` (new)
+- `Common/Server/Services/DashboardService.ts` (create version on save)
+- `App/FeatureSet/Dashboard/src/Pages/Dashboards/View/VersionHistory.tsx` (new)
+
+### 3.4 Row/Section Grouping
+
+**Current**: Components placed freely with no grouping.
+**Target**: Collapsible rows/sections for organizing related panels.
+
+**Implementation**:
+
+- Add a "Section" component type that acts as a collapsible container
+- Section has a title bar that can be clicked to collapse/expand
+- When collapsed, hides all components within the section's vertical range
+- Sections can be nested one level deep
+
+**Files to modify**:
+- `Common/Types/Dashboard/DashboardComponentType.ts` (add Section)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardSectionComponent.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Canvas/Index.tsx` (handle section collapse)
+
+### 3.5 TV/Kiosk Mode
+
+**Current**: Full-screen only.
+**Target**: Dedicated kiosk mode optimized for wall-mounted monitors with auto-cycling.
+
+**Implementation**:
+
+- Kiosk mode: hides all chrome (toolbar, navigation, URL bar), shows only the dashboard grid
+- Auto-cycle: rotate through a list of dashboards at a configurable interval (30s, 1m, 5m)
+- Dashboard playlist: define an ordered list of dashboards to cycle through
+- Support per-dashboard display duration
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Kiosk.tsx` (new - kiosk view)
+- `Common/Models/DatabaseModels/DashboardPlaylist.ts` (new - playlist model)
+
+### 3.6 CSV Export
+
+**Current**: No data export.
+**Target**: Export chart/table data as CSV for offline analysis.
+
+**Implementation**:
+
+- Add "Export CSV" option in chart/table context menu
+- Client-side: serialize the current rendered data to CSV format
+- Include column headers, timestamps, and values
+- Trigger browser download
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/Components/DashboardChartComponent.tsx` (add export option)
+- `Common/Utils/Dashboard/CSVExport.ts` (new - CSV serialization)
+
+### 3.7 Custom Time Range per Widget
+
+**Current**: All widgets share the global time range.
+**Target**: Individual widgets can override the global time range.
+
+**Implementation**:
+
+- Add optional `timeRangeOverride` to each component's config
+- When set, the widget uses its own time range instead of the global one
+- Show a small clock icon on widgets with custom time ranges
+- Configuration in the component settings side panel
+
+**Files to modify**:
+- `Common/Utils/Dashboard/Components/DashboardBaseComponent.ts` (add timeRangeOverride)
+- `App/FeatureSet/Dashboard/src/Components/Dashboard/DashboardView.tsx` (pass per-widget time ranges)
+
+---
+
+## Phase 4: Differentiation (P2-P3) — Surpass Competition
+
+### 4.1 AI-Powered Dashboard Creation
+
+**Current**: Manual dashboard creation only.
+**Target**: Natural language dashboard creation - "Show me CPU usage by service for the last 24 hours" auto-creates the right widget.
+
+**Implementation**:
+
+- Natural language input in the "Add Widget" dialog
+- AI translates to: metric name, aggregation, group by, chart type, time range
+- Uses available MetricType metadata to match metric names
+- Preview the generated widget before adding to dashboard
+- This is a feature NO competitor has done well yet - major differentiator
+
+### 4.2 Pre-Built Dashboard Templates
+
+**Current**: No templates.
+**Target**: One-click dashboard templates for common stacks.
+
+**Implementation**:
+
+- Template library: Node.js, Python, Go, Java, Kubernetes, PostgreSQL, Redis, Nginx, MongoDB, etc.
+- Auto-detect relevant templates based on ingested telemetry data
+- "One-click create" instantiates a full dashboard from the template
+- Community template sharing (future)
+
+### 4.3 Auto-Generated Dashboards
+
+**Current**: Users must manually build dashboards.
+**Target**: When a service connects, auto-generate a relevant dashboard.
+
+**Implementation**:
+
+- On first telemetry ingest from a new service, analyze the metric names and types
+- Auto-create a service dashboard with relevant charts based on detected metrics
+- Include golden signals (latency, traffic, errors, saturation) where applicable
+- Notify the user and link to the auto-generated dashboard
+
+### 4.4 Customer-Facing Dashboards on Status Pages
+
+**Current**: Status pages and dashboards are separate.
+**Target**: Embed dashboard widgets on status pages for real-time performance visibility.
+
+**Implementation**:
+
+- Allow selecting specific dashboard widgets to embed on a status page
+- Render widgets in read-only mode without internal navigation
+- Respect public/private data boundaries (only show metrics the customer should see)
+- This is unique to OneUptime - no competitor has integrated observability dashboards with status pages
+
+### 4.5 Dashboard-as-Code SDK
+
+**Current**: No programmatic dashboard creation.
+**Target**: TypeScript SDK for defining dashboards as code.
+
+**Implementation**:
+
+```typescript
+const dashboard = new Dashboard("Service Health")
+  .addVariable("service", { type: "query", query: "SELECT DISTINCT service FROM MetricItem" })
+  .addRow("Latency")
+    .addChart({ metric: "http.server.duration", aggregation: "p99", groupBy: ["$service"] })
+    .addChart({ metric: "http.server.duration", aggregation: "p50", groupBy: ["$service"] })
+  .addRow("Throughput")
+    .addChart({ metric: "http.server.request.count", aggregation: "rate", groupBy: ["$service"] })
+```
+
+- SDK generates the JSON config and uses the Dashboard API to create/update
+- Git-based provisioning: store dashboard definitions in repo, CI/CD syncs to OneUptime
+
+### 4.6 Anomaly Detection Overlays
+
+**Current**: No anomaly visualization.
+**Target**: AI highlights anomalous data points on charts without manual threshold configuration.
+
+**Implementation** (depends on Metrics roadmap Phase 3.1 - Anomaly Detection):
+
+- Automatically overlay expected range bands (baseline +/- N sigma) on metric charts
+- Highlight data points outside the expected range with color indicators
+- Click an anomaly to see correlated changes across metrics, logs, and traces
+
+---
+
+## Quick Wins (Can Ship This Week)
+
+1. **Auto-refresh** - Add a simple `setInterval` refresh with dropdown selector in toolbar
+2. **Full markdown for text widget** - Replace custom formatting with a markdown renderer
+3. **Legend show/hide** - Add click handler on legend items to toggle series
+4. **Stacked area chart** - Simple extension of existing line chart with fill
+5. **Chart zoom** - Enable brush selection on time series charts
+
+---
+
+## Recommended Implementation Order
+
+1. **Quick Wins** - Auto-refresh, markdown, legend toggle, stacked area, chart zoom
+2. **Phase 1.1** - More chart types (Area, Pie, Table, Gauge)
+3. **Phase 1.2** - Template variables (highest-impact feature for dashboard usability)
+4. **Phase 1.4** - Multiple queries per chart
+5. **Phase 1.6** - Threshold lines & color coding
+6. **Phase 2.1** - Log stream widget (leverages all-in-one platform)
+7. **Phase 2.2** - Trace list widget
+8. **Phase 2.3** - Click-to-correlate (major differentiator)
+9. **Phase 2.4** - Annotations / event overlays
+10. **Phase 2.5** - Alert integration
+11. **Phase 3.1** - Public/shared dashboards
+12. **Phase 3.2** - JSON import/export
+13. **Phase 3.4** - Row/section grouping
+14. **Phase 3.5** - TV/Kiosk mode
+15. **Phase 3.3** - Dashboard versioning
+16. **Phase 2.6** - SLO widget (depends on SLO/SLI from Metrics roadmap)
+17. **Phase 4.2** - Pre-built dashboard templates
+18. **Phase 4.3** - Auto-generated dashboards
+19. **Phase 4.1** - AI-powered dashboard creation
+20. **Phase 4.4** - Customer-facing dashboards on status pages
+21. **Phase 4.5** - Dashboard-as-code SDK
+
+## Verification
+
+For each feature:
+1. Unit tests for new widget types, template variable resolution, CSV export logic
+2. Integration tests for new API endpoints (annotations, public dashboards, import/export)
+3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/dashboards`
+4. Visual regression testing for new chart types (ensure correct rendering across browsers)
+5. Performance testing: verify dashboards with 20+ widgets and auto-refresh don't cause excessive API load
+6. Test template variables with edge cases: empty results, special characters, multi-value selections
+7. Verify public dashboards don't leak private data
--- a/Internal/Roadmap/Logs.md
+++ b/Internal/Roadmap/Logs.md
@@ -20,125 +20,41 @@ The following features have been implemented and removed from this plan:
 - **Phase 2.2** - Log Analytics View (LogsAnalyticsView with timeseries, toplist, table charts; analytics endpoint)
 - **Phase 2.3** - Column Customization (ColumnSelector with dynamic columns from log attributes)
 - **Phase 5.8** - Store Missing OpenTelemetry Log Fields (observedTimeUnixNano, droppedAttributesCount, flags columns + ingestion + migration)
+- **Phase 3.1** - Log Context / Surrounding Logs (Context tab in LogDetailsPanel, context endpoint in TelemetryAPI)
+- **Phase 3.2** - Log Pipelines (LogPipeline + LogPipelineProcessor models, GrokParser/AttributeRemapper/SeverityRemapper/CategoryProcessor, pipeline execution service)
+- **Phase 3.3** - Drop Filters (LogDropFilter model, LogDropFilterService, dashboard UI for configuration)
+- **Phase 3.4** - Export to CSV/JSON (Export button in toolbar, LogExport utility with CSV and JSON support)
+- **Phase 4.2** - Keyboard Shortcuts (j/k navigation, Enter expand/collapse, Esc close, / focus search, Ctrl+Enter apply filters, ? help)
+- **Phase 4.3** - Sensitive Data Scrubbing (LogScrubRule model with PII patterns: Email, CreditCard, SSN, PhoneNumber, IPAddress, custom regex)

 ## Gap Analysis Summary

 | Feature | OneUptime | Datadog | New Relic | Priority |
 |---------|-----------|---------|-----------|----------|
 | Log Patterns (ML clustering) | None | Auto-clustering + Pattern Inspector | ML clustering + anomaly | **P1** |
-| Log context (surrounding logs) | None | Before/after from same host/service | Automatic via APM agent | **P2** |
-| Log Pipelines (server-side processing) | None (raw storage only) | 270+ OOTB, 14+ processor types | Grok parsing, built-in rules | **P2** |
 | Log-based Metrics | None | Count + Distribution, 15-month retention | Via NRQL | **P2** |
-| Drop Filters (pre-storage filtering) | None | Exclusion filters with sampling | Drop rules per NRQL | **P2** |
-| Export to CSV/JSON | None | CSV up to 100K rows | CSV/JSON up to 5K | **P2** |
-| Keyboard shortcuts | None | Full keyboard nav | Basic | **P3** |
-| Sensitive Data Scrubbing | None | Multi-layer (SaaS + agent + pipeline) | Auto-obfuscation + custom rules | **P3** |
 | Data retention config UI | Referenced but no UI | Multi-tier (Standard/Flex/Archive) | Partitions + Live Archives | **P3** |

 ---

-## Phase 3: Processing & Operations (P2) — Platform Capabilities
-
-### 3.1 Log Context (Surrounding Logs)
-
-**Current**: Clicking a log shows only that log's details.
-**Target**: A "Context" tab in the log detail panel showing N logs before/after from the same service.
-
-**Implementation**:
-
- When a log is expanded, add a "Context" tab that queries ClickHouse:
-  ```sql
-  (SELECT * FROM LogItem WHERE projectId={pid} AND serviceId={sid} AND time < {logTime} ORDER BY time DESC LIMIT 5)
-  UNION ALL
-  (SELECT * FROM LogItem WHERE projectId={pid} AND serviceId={sid} AND time >= {logTime} ORDER BY time ASC LIMIT 6)
-  ```
- Display as a mini log list with the current log highlighted
- Add to `LogDetailsPanel.tsx` as a tabbed section alongside the existing body/attributes view
-
-**Files to modify**:
- `Common/Server/API/TelemetryAPI.ts` (add context endpoint)
- `Common/UI/Components/LogsViewer/components/LogDetailsPanel.tsx` (add tabs + context view)
-
-### 3.2 Log Pipelines (Server-Side Processing)
-
-**Current**: Logs are stored raw as received (after OTLP normalization).
-**Target**: Configurable processing pipelines that transform logs at ingest time.
-
-**Implementation**:
-
- Create `LogPipeline` and `LogPipelineProcessor` PostgreSQL models
- Pipeline has: name, filter (which logs it applies to), enabled flag, sort order
- Processor types (start with these 4):
-  - **Grok Parser**: Parse body text into structured attributes using Grok patterns
-  - **Attribute Remapper**: Rename/copy one attribute to another
-  - **Severity Remapper**: Map an attribute value to the severity field
-  - **Category Processor**: Assign a new attribute value based on if/else conditions
- Processing runs in the telemetry ingestion worker (`Telemetry/Jobs/TelemetryIngest/ProcessTelemetry.ts`) after normalization but before ClickHouse insert
- Pipeline configuration UI under Settings > Log Pipelines
-
-**Files to modify**:
- `Common/Models/DatabaseModels/LogPipeline.ts` (new)
- `Common/Models/DatabaseModels/LogPipelineProcessor.ts` (new)
- `Telemetry/Services/LogPipelineService.ts` (new - pipeline execution engine)
- `Telemetry/Services/OtelLogsIngestService.ts` (hook pipeline execution before insert)
- Dashboard: new Settings page for pipeline configuration
-
-### 3.3 Drop Filters (Pre-Storage Filtering)
-
-**Current**: All ingested logs are stored.
-**Target**: Configurable rules to drop or sample logs before storage.
-
-**Implementation**:
-
- Create `LogDropFilter` PostgreSQL model: name, filter query, action (drop or sample at N%), enabled
- Evaluate drop filters in the ingestion pipeline before ClickHouse insert
- UI under Settings > Log Configuration > Drop Filters
- Show estimated volume savings based on recent log volume
-
-### 3.4 Export to CSV/JSON
-
-**Current**: No export capability.
-**Target**: Download current filtered log results as CSV or JSON.
-
-**Implementation**:
-
- Add "Export" button in the toolbar
- Client-side: serialize current page of logs to CSV/JSON and trigger browser download
- Server-side (for large exports): new endpoint that streams results to a downloadable file (up to 10K rows)
-
-**Files to modify**:
- `Common/UI/Components/LogsViewer/components/LogsViewerToolbar.tsx` (add export button)
- `Common/UI/Utils/LogExport.ts` (new - CSV/JSON serialization)
- `Common/Server/API/TelemetryAPI.ts` (add export endpoint for large exports)
-
 ---

-## Phase 4: Advanced Features (P3) — Differentiation
+## Remaining Features

+### Log Patterns (ML Clustering) — P1

-### 4.2 Keyboard Shortcuts
+**Current**: No pattern detection.
+**Target**: Auto-cluster similar log messages and surface pattern groups with anomaly detection.

- `j`/`k` to navigate between log rows
- `Enter` to expand/collapse selected log
- `Escape` to close detail panel
- `/` to focus search bar
- `Ctrl+Enter` to apply filters
+### Log-based Metrics — P2

-### 4.3 Sensitive Data Scrubbing
+**Current**: No log-to-metric conversion.
+**Target**: Create count/distribution metrics from log queries with long-term retention.

- Auto-detect common PII patterns (credit cards, SSNs, emails) at ingest time
- Configurable scrubbing rules: mask, hash, or redact
- UI under Settings > Data Privacy
+### Data Retention Config UI — P3

---
-
-## Recommended Implementation Order
-
-1. **Phase 3.4** - Export CSV/JSON (small effort, table-stakes feature)
-2. **Phase 3.1** - Log Context (moderate effort, high debugging value)
-3. **Phase 3.2** - Log Pipelines (large effort, platform capability)
-4. **Phase 3.3** - Drop Filters (moderate effort, cost optimization)
-5. **Phase 4.x** - Patterns, Shortcuts, Data Scrubbing (future)
+**Current**: `retainTelemetryDataForDays` exists on the service model and is displayed in usage history, but there is no dedicated UI to configure retention settings.
+**Target**: Settings page for configuring per-service log data retention.

 ## Phase 5: ClickHouse Storage & Query Optimizations (P0) — Performance Foundation

@@ -205,18 +121,22 @@ These optimizations address fundamental storage and indexing gaps in the telemet
 | 5.3 DateTime64 time column | Sub-second log ordering | Correctness fix | Medium |
 | 5.7 Histogram projections | Histogram and severity aggregation | 5-10x | Medium |

-### 5.x Recommended Remaining Order
+---
+
+## Recommended Remaining Implementation Order

 1. **5.3** — DateTime64 upgrade (correctness)
 2. **5.7** — Projections (performance polish)
+3. **Log-based Metrics** (platform capability)
+4. **Data Retention Config UI** (operational)
+5. **Log Patterns / ML Clustering** (advanced, larger effort)

 ---

 ## Verification

-For each feature:
-1. Unit tests for new parsers/utilities (LogQueryParser, CSV export, etc.)
-2. Integration tests for new API endpoints (histogram, facets, analytics, context)
+For each remaining feature:
+1. Unit tests for new utilities
+2. Integration tests for new API endpoints
 3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/logs`
 4. Check ClickHouse query performance with `EXPLAIN` for new aggregation queries
-5. Verify real-time/live mode still works correctly with new UI components
--- a/Internal/Roadmap/Metrics.md
+++ b/Internal/Roadmap/Metrics.md
@@ -0,0 +1,468 @@
+# Plan: Bring OneUptime Metrics to Industry Parity and Beyond
+
+## Context
+
+OneUptime's metrics implementation provides OTLP ingestion (HTTP and gRPC), ClickHouse storage with support for Gauge, Sum, Histogram, and ExponentialHistogram metric types, basic aggregations (Avg, Sum, Min, Max, Count), single-attribute GROUP BY, formula support for calculated metrics, threshold-based metric monitors, and a metric explorer with line/bar charts. Auto-discovery creates MetricType metadata (name, description, unit) on first ingest. Per-service data retention with TTL (default 15 days).
+
+This plan identifies the remaining gaps vs DataDog and New Relic, and proposes a phased implementation to close them and build a best-in-class metrics product.
+
+## Completed
+
+The following features have been implemented:
+- **OTLP Ingestion** - HTTP and gRPC metric ingestion with async queue-based batch processing
+- **Metric Types** - Gauge, Sum, Histogram, ExponentialHistogram support
+- **ClickHouse Storage** - MergeTree with `sipHash64(projectId) % 16` partitioning, per-service TTL
+- **Aggregations** - Avg, Sum, Min, Max, Count
+- **Single-Attribute GROUP BY** - Group by one attribute at a time
+- **Formulas** - Calculated metrics using aliases (e.g., `a / b * 100`)
+- **Metric Explorer** - Time range selection, multiple queries with aliases, URL state persistence
+- **Threshold-Based Monitors** - Static threshold alerting on aggregated metric values
+- **MetricType Auto-Discovery** - Name, description, unit captured on first ingest
+- **Attribute Storage** - Full JSON with extracted `attributeKeys` array for fast enumeration
+- **BloomFilter index** on `name`, Set index on `serviceType`
+
+## Gap Analysis Summary
+
+| Feature | OneUptime | DataDog | New Relic | Priority |
+|---------|-----------|---------|-----------|----------|
+| Percentile aggregations (p50/p75/p90/p95/p99) | None | DDSketch distributions | NRQL percentile() | **P0** |
+| Rate/derivative calculations | None | Native Rate type + .as_rate() | rate() NRQL function | **P0** |
+| Multi-attribute GROUP BY | Single attribute only | Multiple tags | FACET on multiple attrs | **P0** |
+| Rollup/downsampling for long-range queries | None (raw data, 15-day TTL) | Automatic tiered rollups | 30-day raw + 13-month rollups | **P0** |
+| Anomaly detection | Static thresholds only | Watchdog + anomaly monitors | Anomaly detection + sigma bands | **P1** |
+| SLO/SLI tracking | None | Metric-based + Time Slice SLOs | One-click setup + error budgets | **P1** |
+| Heatmap visualization | None | Purpose-built for distributions | Built-in chart type | **P1** |
+| Time-over-time comparison | None | Yes | COMPARE WITH in NRQL | **P1** |
+| Summary metric type | Not supported | N/A (uses Distribution) | Yes | **P1** |
+| Query language | Form-based UI only | Graphing editor + NLQ | Full NRQL language | **P2** |
+| Predictive alerting | None | Watchdog forecasting | GA predictive alerting | **P2** |
+| Metric correlations | None | Auto-surfaces related metrics | Applied Intelligence correlation | **P2** |
+| Golden Signals dashboards | None | Available via APM | Pre-built with default alerts | **P2** |
+| Cardinality management | None | Metrics Without Limits + Explorer | Budget system + pruning rules | **P2** |
+| More chart types | Line and bar only | 12+ types | 10+ types with conditional coloring | **P2** |
+| Dashboard templates | None | Pre-built integration dashboards | Pre-built entity dashboards | **P2** |
+| Units on charts | Stored but not rendered | Auto-formatted by unit type | Y-axis unit customization | **P2** |
+| Natural language querying | None | NLQ translates English to queries | None | **P3** |
+| Metric cost/volume management | None | Cost attribution dashboards | Volume dashboards | **P3** |
+
+---
+
+## Phase 1: Foundation (P0) — Close Critical Gaps
+
+These are table-stakes features without which the metrics product is fundamentally limited.
+
+### 1.1 Percentile Aggregations (p50, p75, p90, p95, p99)
+
+**Current**: Only Avg, Sum, Min, Max, Count aggregations.
+**Target**: Support percentile aggregations on all metric data, especially histograms and distributions.
+
+**Implementation**:
+
+- Add `P50`, `P75`, `P90`, `P95`, `P99` to the `AggregationType` enum
+- For raw metric values: use ClickHouse `quantile(0.50)(value)`, `quantile(0.95)(value)`, etc.
+- For histogram data (with `bucketCounts` and `explicitBounds`): implement approximate percentile calculation from bucket data using linear interpolation between bucket boundaries
+- Update the metric query builder to include percentile options in the aggregation dropdown
+- Update chart rendering to display percentile series
+
+**Files to modify**:
+- `Common/Types/BaseDatabase/AggregationType.ts` (add P50, P75, P90, P95, P99)
+- `Common/Server/Services/MetricService.ts` (generate quantile SQL)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx` (add to dropdown)
+
+### 1.2 Rate/Derivative Calculations
+
+**Current**: No rate or delta computation. Raw cumulative counters are meaningless without rate calculation.
+**Target**: Compute per-second rates and deltas from counter/sum metrics.
+
+**Implementation**:
+
+- Add `Rate` and `Delta` as aggregation options
+- For cumulative sums: compute `(value_t - value_t-1) / (time_t - time_t-1)` using ClickHouse `runningDifference()`
+- Handle counter resets (when value decreases, treat as reset and skip that interval)
+- For delta temporality sums: rate is simply `value / interval_seconds`
+- Display rate with appropriate units (e.g., "req/s", "bytes/s")
+
+**Files to modify**:
+- `Common/Types/BaseDatabase/AggregationType.ts` (add Rate, Delta)
+- `Common/Server/Services/MetricService.ts` (generate rate SQL with runningDifference)
+- `Common/Types/Metrics/MetricsQuery.ts` (support rate in query config)
+
+### 1.3 Multi-Attribute GROUP BY
+
+**Current**: Single `groupByAttribute: string` field.
+**Target**: Group by multiple attributes simultaneously (e.g., by region AND status_code).
+
+**Implementation**:
+
+- Change `groupByAttribute` from `string` to `string[]` in `MetricsQuery`
+- Update ClickHouse query generation to GROUP BY multiple extracted JSON attributes
+- Update chart rendering to handle multi-dimensional grouping (composite legend labels)
+- Update the UI to allow selecting multiple group-by attributes
+
+**Files to modify**:
+- `Common/Types/Metrics/MetricsQuery.ts` (change type)
+- `Common/Server/Services/MetricService.ts` (update query generation)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryConfig.tsx` (multi-select UI)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (composite legends)
+
+### 1.4 Rollups / Downsampling
+
+**Current**: Raw data only with 15-day default TTL. No rollups means long-range queries are slow and historical analysis is limited.
+**Target**: Pre-aggregated rollups at multiple resolutions with extended retention.
+
+**Implementation**:
+
+- Create ClickHouse materialized views for automatic rollup:
+  ```
+  Raw Data (1s resolution) -> 15-day retention
+    |-> Materialized View -> 1-min rollups -> 90-day retention
+    |-> Materialized View -> 1-hour rollups -> 13-month retention
+    |-> Materialized View -> 1-day rollups -> 3-year retention
+  ```
+- Each rollup table stores: min, max, sum, count, avg, and quantile sketches per metric name + attributes
+- Route queries based on time range:
+  - < 6 hours: raw data
+  - 6 hours - 7 days: 1-min rollups
+  - 7 days - 30 days: 1-hour rollups
+  - 30+ days: 1-day rollups
+- Automatic query routing in the metric service layer
+
+**Files to modify**:
+- `Common/Models/AnalyticsModels/MetricRollup1Min.ts` (new)
+- `Common/Models/AnalyticsModels/MetricRollup1Hour.ts` (new)
+- `Common/Models/AnalyticsModels/MetricRollup1Day.ts` (new)
+- `Common/Server/Services/MetricService.ts` (query routing by time range)
+- `Worker/DataMigrations/` (new migration to create materialized views)
+
+---
+
+## Phase 2: Visualization & UX (P1) — Match Industry Standard
+
+### 2.1 More Chart Types
+
+**Current**: Line and bar charts only.
+**Target**: Add Heatmap, Stacked Area, Pie/Donut, Scatter, Single-Value Billboard, and Gauge.
+
+**Implementation**:
+
+- **Heatmap**: Essential for histogram/distribution data. Use a heatmap library that renders time on X-axis, bucket values on Y-axis, and color intensity for count
+- **Stacked Area**: Extension of existing line chart with fill and stacking
+- **Pie/Donut**: For showing proportional breakdowns (e.g., request distribution by service)
+- **Scatter**: For correlation analysis between two metrics
+- **Billboard**: Large single-value display with configurable thresholds for color coding (green/yellow/red)
+- **Gauge**: Circular gauge showing a value against a min/max range
+
+**Files to modify**:
+- `Common/Types/Dashboard/Chart/ChartType.ts` (add new chart types)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render new chart types)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricCharts.tsx` (chart type selector)
+
+### 2.2 Time-Over-Time Comparison
+
+**Current**: No comparison capability.
+**Target**: Overlay current metric data with data from a previous period (1h ago, 1d ago, 1w ago).
+
+**Implementation**:
+
+- Add a "Compare with" dropdown in the metric explorer toolbar (options: 1 hour ago, 1 day ago, 1 week ago, custom)
+- Execute the same query twice with shifted time ranges
+- Render the comparison series as a dashed/translucent overlay on the same chart
+- Show the delta (absolute and percentage) in tooltips
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricExplorer.tsx` (add compare dropdown)
+- `Common/Types/Metrics/MetricsQuery.ts` (add compareWith field)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render comparison series)
+
+### 2.3 Render Metric Units on Charts
+
+**Current**: Units stored in MetricType but not rendered on chart axes.
+**Target**: Display units on Y-axis labels and tooltips with smart formatting.
+
+**Implementation**:
+
+- Pass `MetricType.unit` through to chart rendering
+- Implement unit-aware formatting:
+  - Bytes: auto-convert to KB/MB/GB/TB
+  - Duration: auto-convert ns/us/ms/s
+  - Percentage: append `%`
+  - Rate: append `/s`
+- Display formatted unit on Y-axis label and in tooltip values
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (Y-axis unit formatting)
+- `Common/Utils/Metrics/UnitFormatter.ts` (new - unit formatting logic)
+
+### 2.4 Dashboard Templates
+
+**Current**: No templates.
+**Target**: Pre-built dashboards for common scenarios that auto-populate based on detected metrics.
+
+**Implementation**:
+
+- Create MetricsViewConfig templates for:
+  - HTTP Service Health (request rate, error rate, latency percentiles)
+  - Database Performance (query duration, connection pool, error rate)
+  - Kubernetes Metrics (CPU, memory, pod restarts, network)
+  - Host Metrics (CPU, memory, disk, network)
+  - Runtime Metrics (GC, heap, threads - per language)
+- Auto-detect which templates are relevant based on ingested metric names
+- "One-click apply" creates a dashboard from the template
+
+**Files to modify**:
+- `Common/Types/Metrics/DashboardTemplates/` (new directory with template definitions)
+- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Templates.tsx` (new - template gallery)
+
+### 2.5 Summary Metric Type Support
+
+**Current**: Summary type not supported.
+**Target**: Ingest and store Summary metrics from OTLP.
+
+**Implementation**:
+
+- Add `Summary` to the metric point type enum
+- Store quantile values from summary data points
+- Display summary quantiles in the metric explorer
+
+**Files to modify**:
+- `Telemetry/Services/OtelMetricsIngestService.ts` (handle summary type)
+- `Common/Models/AnalyticsModels/Metric.ts` (add summary-specific columns if needed)
+
+---
+
+## Phase 3: Alerting & Intelligence (P1-P2) — Smart Monitoring
+
+### 3.1 Anomaly Detection
+
+**Current**: Static threshold alerting only.
+**Target**: Detect metrics deviating from expected patterns using statistical methods.
+
+**Implementation**:
+
+- Start with rolling mean + N standard deviations (configurable sensitivity: low/medium/high)
+- Account for daily/weekly seasonality by comparing to same-time-last-week baselines
+- Store baselines in ClickHouse (periodic computation job, hourly)
+- Baseline table: metric name, service, hour_of_week, mean, stddev
+- On each evaluation: compare current value to baseline, alert if deviation > configured sigma
+- Surface anomalies as visual highlights on metric charts (shaded band showing expected range)
+
+**Files to modify**:
+- `Common/Models/AnalyticsModels/MetricBaseline.ts` (new - baseline storage)
+- `Worker/Jobs/Metrics/ComputeMetricBaselines.ts` (new - periodic baseline computation)
+- `Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts` (add anomaly detection)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render anomaly bands)
+
+### 3.2 SLO/SLI Tracking
+
+**Current**: No SLO support.
+**Target**: Define Service Level Objectives based on metric queries, track attainment over rolling windows, calculate error budgets.
+
+**Implementation**:
+
+- Create `SLO` PostgreSQL model:
+  - Name, description, target percentage (e.g., 99.9%)
+  - SLI definition: good events query / total events query (both metric queries)
+  - Time window: 7-day, 28-day, or 30-day rolling
+  - Alert thresholds: error budget remaining %, burn rate
+- SLO dashboard page showing:
+  - Current attainment vs target (e.g., 99.85% / 99.9%)
+  - Error budget remaining (absolute and percentage)
+  - Burn rate chart (current burn rate vs sustainable burn rate)
+  - SLI time series chart
+- Alert when error budget drops below threshold or burn rate exceeds sustainable rate
+- Integrate with existing monitor/incident system
+
+**Files to modify**:
+- `Common/Models/DatabaseModels/SLO.ts` (new)
+- `Common/Server/Services/SLOService.ts` (new - SLI computation, budget calculation)
+- `Worker/Jobs/SLO/EvaluateSLOs.ts` (new - periodic SLO evaluation)
+- `App/FeatureSet/Dashboard/src/Pages/SLO/` (new - SLO list, detail, creation pages)
+
+### 3.3 Metric Correlations
+
+**Current**: No correlation capability.
+**Target**: When an anomaly is detected, automatically identify other metrics that changed around the same time.
+
+**Implementation**:
+
+- When an anomaly is detected on a metric, query all metrics for the same service/project in the surrounding time window (e.g., +/- 30 minutes)
+- Compute Pearson correlation coefficient between the anomalous metric and each candidate
+- Rank by absolute correlation value
+- Surface top 5-10 correlated metrics in the alert/incident view
+- Show correlation chart: anomalous metric overlaid with top correlated metrics
+
+**Files to modify**:
+- `Common/Server/Services/MetricCorrelationService.ts` (new)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/CorrelatedMetrics.tsx` (new - correlation view)
+
+---
+
+## Phase 4: Scale & Power Features (P2-P3) — Differentiation
+
+### 4.1 Cardinality Management
+
+**Current**: No cardinality visibility or controls.
+**Target**: Track unique series count, alert on spikes, allow attribute allowlist/blocklist.
+
+**Implementation**:
+
+- Track unique series count per metric name (via periodic ClickHouse `uniq()` queries)
+- Store in a dedicated cardinality tracking table
+- Dashboard showing: top metrics by cardinality, cardinality trend over time, per-attribute breakdown
+- Allow configuring attribute allowlists/blocklists per metric (applied at ingest time)
+- Alert when cardinality exceeds configured budget
+
+**Files to modify**:
+- `Worker/Jobs/Metrics/TrackMetricCardinality.ts` (new - periodic cardinality computation)
+- `Common/Models/DatabaseModels/MetricCardinalityConfig.ts` (new - allowlist/blocklist)
+- `Telemetry/Services/OtelMetricsIngestService.ts` (apply attribute filtering)
+- `App/FeatureSet/Dashboard/src/Pages/Settings/MetricCardinality.tsx` (new - cardinality dashboard)
+
+### 4.2 Query Language
+
+**Current**: Form-based UI only.
+**Target**: Text-based metrics query language inspired by PromQL/NRQL for advanced users.
+
+**Implementation**:
+
+- Define a grammar supporting:
+  ```
+  metric_name{attribute="value", attribute2=~"regex"}
+    | aggregation(duration)
+    by (attribute1, attribute2)
+  ```
+- Build a parser that translates to the existing ClickHouse query builder
+- Offer both UI builder and text modes (toggle like New Relic's basic/advanced)
+- Syntax highlighting and autocomplete in the text editor (metric names, attribute keys, functions)
+- Functions: `rate()`, `delta()`, `avg()`, `sum()`, `min()`, `max()`, `p50()`, `p95()`, `p99()`, `count()`, `topk()`, `bottomk()`
+
+**Files to modify**:
+- `Common/Utils/Metrics/MetricsQueryLanguage.ts` (new - parser and translator)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricQueryEditor.tsx` (new - text editor with autocomplete)
+
+### 4.3 Golden Signals Dashboards
+
+**Current**: No auto-generated dashboards.
+**Target**: Auto-generated dashboards showing Latency, Traffic, Errors, Saturation for each service.
+
+**Implementation**:
+
+- Detect common OpenTelemetry metric names per service:
+  - Latency: `http.server.duration`, `http.server.request.duration`
+  - Traffic: `http.server.request.count`, `http.server.active_requests`
+  - Errors: `http.server.request.count` where status_code >= 500
+  - Saturation: `process.runtime.*.memory`, `system.cpu.utilization`
+- Auto-create a Golden Signals dashboard for each service with detected metrics
+- Include default alert thresholds
+
+**Files to modify**:
+- `Worker/Jobs/Metrics/GenerateGoldenSignalsDashboards.ts` (new)
+- `Common/Utils/Metrics/GoldenSignalsDetector.ts` (new - metric name pattern matching)
+
+### 4.4 Predictive Alerting
+
+**Current**: No forecasting capability.
+**Target**: Forecast metric values and alert before thresholds are breached.
+
+**Implementation**:
+
+- Use linear regression or Holt-Winters on recent data to project forward
+- Alert if projected value crosses threshold within configurable forecast window (e.g., "disk full in 4 hours")
+- Particularly valuable for capacity planning metrics (disk, memory, connection pools)
+- Show forecast as a dashed line extension on metric charts
+
+**Files to modify**:
+- `Common/Server/Utils/Monitor/Criteria/MetricMonitorCriteria.ts` (add predictive evaluation)
+- `Common/Utils/Metrics/Forecasting.ts` (new - regression/Holt-Winters)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render forecast line)
+
+---
+
+## ClickHouse Storage Improvements
+
+### S.1 Fix Sort Key Order (CRITICAL)
+
+**Current**: Sort key is `(projectId, time, serviceId)`.
+**Target**: Change to `(projectId, name, serviceId, time)`.
+
+**Impact**: ~100x improvement for name-filtered queries. Virtually every metric query filters by `name`, but currently ClickHouse must scan all metric names within the time range.
+
+**Migration**: Requires creating `MetricItem_v2` with new sort key and migrating data (ClickHouse doesn't support `ALTER TABLE MODIFY ORDER BY`).
+
+**Files to modify**:
+- `Common/Models/AnalyticsModels/Metric.ts` (change sort key)
+- `Worker/DataMigrations/` (new migration - create v2 table, backfill, swap)
+
+### S.2 Upgrade time to DateTime64 (HIGH)
+
+**Current**: `DateTime` with second precision.
+**Target**: `DateTime64(3)` or `DateTime64(6)` for sub-second precision.
+
+**Impact**: Correct sub-second metric ordering. Removes need for separate `timeUnixNano`/`startTimeUnixNano` columns.
+
+**Files to modify**:
+- `Common/Models/AnalyticsModels/Metric.ts` (change column type)
+- `Common/Types/AnalyticsDatabase/TableColumnType.ts` (add DateTime64 type if not present)
+- `Common/Server/Utils/AnalyticsDatabase/StatementGenerator.ts` (handle DateTime64)
+- `Worker/DataMigrations/` (migration)
+
+### S.3 Add Skip Index on metricPointType (MEDIUM)
+
+**Current**: No index support for metric type filtering.
+**Target**: Set skip index on `metricPointType`.
+
+**Files to modify**:
+- `Common/Models/AnalyticsModels/Metric.ts` (add skip index)
+
+### S.4 Evaluate Map Type for Attributes (MEDIUM)
+
+**Current**: Attributes stored as JSON.
+**Target**: Evaluate `Map(LowCardinality(String), String)` for faster attribute-based filtering.
+
+### S.5 Upgrade count/bucketCounts to Int64 (LOW)
+
+**Current**: `Int32` for count and `Array(Int32)` for bucketCounts.
+**Target**: `Int64` / `Array(Int64)` to prevent overflow in high-throughput systems.
+
+---
+
+## Quick Wins (Can Ship This Week)
+
+1. **Display units on chart Y-axes** - Data exists in MetricType, just needs wiring to chart rendering
+2. **Add p50/p95/p99 to aggregation dropdown** - ClickHouse `quantile()` is straightforward to add
+3. **Extend default retention** - 15 days is too short; increase default to 30 days
+4. **Multi-attribute GROUP BY** - Change `groupByAttribute: string` to `groupByAttribute: string[]`
+5. **Add stacked area chart type** - Simple extension of existing line chart
+6. **Add skip index on metricPointType** - Low effort, faster type-filtered queries
+
+---
+
+## Recommended Implementation Order
+
+1. **Quick Wins** - Ship units on charts, p50/p95/p99, multi-attribute GROUP BY, stacked area
+2. **Phase 1.1** - Percentile aggregations (full implementation beyond quick win)
+3. **Phase 1.2** - Rate/derivative calculations
+4. **S.1** - Fix sort key order (critical performance improvement)
+5. **Phase 1.4** - Rollups/downsampling (enables long-range queries)
+6. **Phase 2.1** - More chart types (heatmap, pie, gauge, billboard)
+7. **Phase 2.2** - Time-over-time comparison
+8. **Phase 1.3** - Multi-attribute GROUP BY (full implementation)
+9. **S.2** - Upgrade time to DateTime64
+10. **Phase 3.1** - Anomaly detection
+11. **Phase 3.2** - SLO/SLI tracking
+12. **Phase 2.4** - Dashboard templates
+13. **Phase 4.1** - Cardinality management
+14. **Phase 4.2** - Query language
+15. **Phase 4.3** - Golden Signals dashboards
+16. **Phase 4.4** - Predictive alerting
+17. **Phase 3.3** - Metric correlations
+
+## Verification
+
+For each feature:
+1. Unit tests for new aggregation types, rate calculations, unit formatting, query language parser
+2. Integration tests for new API endpoints (percentiles, rollup queries, SLO evaluation)
+3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/metrics`
+4. Check ClickHouse query performance with `EXPLAIN` for new query patterns
+5. Verify rollup accuracy by comparing rollup results to raw data results for overlapping time ranges
+6. Load test cardinality tracking and anomaly detection jobs to ensure they don't impact ingestion
--- a/Internal/Roadmap/Traces.md
+++ b/Internal/Roadmap/Traces.md
@@ -0,0 +1,412 @@
+# Plan: Bring OneUptime Traces to Industry Parity and Beyond
+
+## Context
+
+OneUptime's trace implementation provides OTLP-native ingestion (HTTP and gRPC), ClickHouse storage with a full OpenTelemetry span model (events, links, status, attributes, resources, scope), a Gantt/waterfall visualization, trace-to-log and trace-to-exception correlation, a basic service dependency graph, queue-based async ingestion, and per-service data retention with TTL. ClickHouse schema has been optimized with BloomFilter indexes on traceId/spanId/parentSpanId, Set indexes on statusCode/kind/hasException, TokenBF on name, and ZSTD compression on key columns.
+
+This plan identifies the remaining gaps vs DataDog, NewRelic, Honeycomb, and Grafana Tempo, and proposes a phased implementation to close them and surpass competition.
+
+## Completed
+
+The following features have been implemented:
+- **OTLP Ingestion** - HTTP and gRPC trace ingestion with async queue-based processing
+- **ClickHouse Storage** - MergeTree with `sipHash64(projectId) % 16` partitioning, per-service TTL
+- **Gantt/Waterfall View** - Hierarchical span visualization with color-coded services, time-unit auto-scaling, error indicators
+- **Trace-to-Log Correlation** - Log model has traceId/spanId columns; SpanViewer shows associated logs
+- **Trace-to-Exception Correlation** - ExceptionInstance model links to traceId/spanId with stack trace parsing and fingerprinting
+- **Span Detail Panel** - Side-over with tabs for Basic Info, Logs, Attributes, Events, Exceptions
+- **BloomFilter indexes** on traceId, spanId, parentSpanId
+- **Set indexes** on statusCode, kind, hasException
+- **TokenBF index** on name
+- **ZSTD compression** on time/ID/attribute columns
+- **hasException boolean column** for fast error span filtering
+- **links default value** corrected to `[]`
+
+## Gap Analysis Summary
+
+| Feature | OneUptime | DataDog | NewRelic | Tempo/Honeycomb | Priority |
+|---------|-----------|---------|----------|-----------------|----------|
+| Trace analytics / aggregation engine | None | Trace Explorer with COUNT/percentiles | NRQL on span data | TraceQL rate/count/quantile | **P0** |
+| RED metrics from traces | None | Auto-computed on 100% traffic | Derived golden signals | Metrics-generator to Prometheus | **P0** |
+| Trace-based alerting | None | APM Monitors (p50-p99, error rate, Apdex) | NRQL alert conditions | Via Grafana alerting / Triggers | **P0** |
+| Sampling controls | None (100% ingestion) | Head-based adaptive + retention filters | Infinite Tracing (tail-based) | Refinery (rules/dynamic/tail) | **P0** |
+| Flame graph view | None | Yes (default view) | No | No | **P1** |
+| Latency breakdown / critical path | None | Per-hop latency, bottleneck detection | No | BubbleUp (Honeycomb) | **P1** |
+| In-trace search | None | Yes | No | No | **P1** |
+| Per-trace service map | None | Yes (Map view) | No | No | **P1** |
+| Trace-to-metric exemplars | None | Pivot from metric graph to traces | Metric-to-trace linking | Prometheus exemplars | **P1** |
+| Custom metrics from spans | None | Generate count/distribution/gauge from tags | Via NRQL | SLOs from span data | **P2** |
+| Structural trace queries | None | Trace Queries (multi-span relationships) | Via NRQL | TraceQL spanset pipelines | **P2** |
+| Trace comparison / diffing | None | Partial | Side-by-side comparison | compare() in TraceQL | **P2** |
+| AI/ML on traces | None | Watchdog (auto anomaly + RCA) | NRAI | BubbleUp (pattern detection) | **P3** |
+| RUM correlation | None | Frontend-to-backend trace linking | Yes | Faro / frontend observability | **P3** |
+| Continuous profiling | None | Code Hotspots (span-to-profile) | Partial | Pyroscope | **P3** |
+
+---
+
+## Phase 1: Analytics & Alerting Foundation (P0) — Highest Impact
+
+Without these, users cannot answer basic questions like "is my service healthy?" from trace data.
+
+### 1.1 Trace Analytics / Aggregation Engine
+
+**Current**: Can list/filter individual spans and view individual traces. No way to aggregate or compute statistics.
+**Target**: Full trace analytics supporting COUNT, AVG, SUM, MIN, MAX, P50/P75/P90/P95/P99 aggregations with GROUP BY on any span attribute and time-series bucketing.
+
+**Implementation**:
+
+- Build a trace analytics API endpoint that translates query configs into ClickHouse aggregation queries
+- Use ClickHouse's native functions: `quantile(0.99)(durationUnixNano)`, `countIf(statusCode = 2)`, `toStartOfInterval(startTime, INTERVAL 1 MINUTE)`
+- Support GROUP BY on service, span name, kind, status, and any custom attribute (via JSON extraction)
+- Frontend: Add an "Analytics" tab to the Traces page with chart types (timeseries, top list, table) similar to the existing LogsAnalyticsView
+- Support switching between "List" view (current) and "Analytics" view
+
+**Files to modify**:
+- `Common/Server/API/TelemetryAPI.ts` (add trace analytics endpoint)
+- `Common/Server/Services/SpanService.ts` (add aggregation query methods)
+- `Common/Types/Traces/TraceAnalyticsQuery.ts` (new - query interface)
+- `App/FeatureSet/Dashboard/src/Pages/Traces/Index.tsx` (add analytics view toggle)
+- `App/FeatureSet/Dashboard/src/Components/Traces/TraceAnalyticsView.tsx` (new - analytics UI)
+
+### 1.2 RED Metrics from Traces (Request Rate, Error Rate, Duration)
+
+**Current**: No automatic computation of service-level metrics from trace data.
+**Target**: Auto-computed per-service, per-operation RED metrics displayed on a Service Overview page.
+
+**Implementation**:
+
+- Create a ClickHouse materialized view that aggregates spans into per-service, per-operation metrics at 1-minute intervals:
+  ```sql
+  CREATE MATERIALIZED VIEW span_red_metrics
+  ENGINE = AggregatingMergeTree()
+  ORDER BY (projectId, serviceId, name, minute)
+  AS SELECT
+    projectId, serviceId, name,
+    toStartOfMinute(startTime) AS minute,
+    countState() AS request_count,
+    countIfState(statusCode = 2) AS error_count,
+    quantileState(0.50)(durationUnixNano) AS p50_duration,
+    quantileState(0.95)(durationUnixNano) AS p95_duration,
+    quantileState(0.99)(durationUnixNano) AS p99_duration
+  FROM SpanItem
+  GROUP BY projectId, serviceId, name, minute
+  ```
+- Build a Service Overview page showing: request rate chart, error rate chart, p50/p95/p99 latency charts
+- Add an API endpoint to query the materialized view
+
+**Files to modify**:
+- `Common/Models/AnalyticsModels/SpanRedMetrics.ts` (new - materialized view model)
+- `Telemetry/Services/SpanRedMetricsService.ts` (new - query service)
+- `App/FeatureSet/Dashboard/src/Pages/Service/View/Overview.tsx` (new or enhanced - RED dashboard)
+- `Worker/DataMigrations/` (new migration to create materialized view)
+
+### 1.3 Trace-Based Alerting
+
+**Current**: No ability to alert on trace data.
+**Target**: Create alerts on p50/p75/p90/p95/p99 latency thresholds, error rate thresholds, and request rate anomalies per service/operation.
+
+**Implementation**:
+
+- Extend the existing monitor system to add a `TraceMonitor` type
+- Monitor evaluates against the RED metrics materialized view (depends on 1.2)
+- Alert conditions: latency exceeds threshold, error rate exceeds threshold, request rate drops below threshold
+- Integrate with existing OneUptime alerting/incident system
+- UI: Add "Trace Monitor" as a new monitor type in the monitor creation wizard
+
+**Files to modify**:
+- `Common/Types/Monitor/MonitorType.ts` (add Trace monitor type)
+- `Common/Types/Monitor/MonitorStepTraceMonitor.ts` (new - trace monitor config)
+- `Common/Server/Utils/Monitor/Criteria/TraceMonitorCriteria.ts` (new - evaluation logic)
+- `App/FeatureSet/Dashboard/src/Components/Form/Monitor/TraceMonitor/` (new - monitor form UI)
+
+### 1.4 Head-Based Probabilistic Sampling
+
+**Current**: Ingests 100% of received traces.
+**Target**: Configurable per-service probabilistic sampling with rules to always keep errors and slow traces.
+
+**Implementation**:
+
+- Create `TraceSamplingRule` PostgreSQL model: service filter, sample rate (0-100%), conditions to always keep (error status, duration > threshold)
+- Evaluate sampling rules in `OtelTracesIngestService.ts` before ClickHouse insert
+- Use deterministic sampling based on traceId hash (so all spans from the same trace are kept or dropped together)
+- UI under Settings > Trace Configuration > Sampling Rules
+- Show estimated storage savings
+
+**Files to modify**:
+- `Common/Models/DatabaseModels/TraceSamplingRule.ts` (new)
+- `Telemetry/Services/OtelTracesIngestService.ts` (add sampling logic)
+- Dashboard: new Settings page for sampling configuration
+
+---
+
+## Phase 2: Visualization & Debugging UX (P1) — Industry-Standard Features
+
+### 2.1 Flame Graph View
+
+**Current**: Only Gantt/waterfall view.
+**Target**: Flame graph visualization showing proportional time spent in each span, with service color coding.
+
+**Implementation**:
+
+- Build a flame graph component that renders spans as horizontally stacked rectangles proportional to duration
+- Allow switching between Waterfall and Flame Graph views in TraceExplorer
+- Color-code by service (consistent with waterfall view)
+- Click a span rectangle to focus/zoom into that subtree
+- Show tooltip with span name, service, duration, self-time on hover
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Traces/FlameGraph.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view toggle)
+
+### 2.2 Latency Breakdown / Critical Path Analysis
+
+**Current**: Shows individual span durations but no automated analysis.
+**Target**: Compute and display critical path, self-time vs child-time, and bottleneck identification.
+
+**Implementation**:
+
+- Compute critical path: the longest sequential chain of spans through the trace (accounts for parallelism)
+- Calculate "self time" per span: `span.duration - sum(child.duration)` (clamped to 0 for overlapping children)
+- Display latency breakdown by service: percentage of total trace time spent in each service
+- Highlight bottleneck spans (spans contributing most to critical path duration)
+- Add "Critical Path" toggle in TraceExplorer that highlights the critical path spans
+
+**Files to modify**:
+- `Common/Utils/Traces/CriticalPath.ts` (new - critical path algorithm)
+- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (show self-time)
+- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add critical path view)
+
+### 2.3 In-Trace Span Search
+
+**Current**: TraceExplorer shows all spans with service filtering and error toggle, but no text search.
+**Target**: Search box to filter spans by name, attribute values, or status within the current trace.
+
+**Implementation**:
+
+- Add a search input in TraceExplorer toolbar
+- Client-side filtering: match span name, service name, attribute keys/values against search text
+- Highlight matching spans in the waterfall/flame graph
+- Show match count (e.g., "3 of 47 spans")
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add search bar and filtering)
+
+### 2.4 Per-Trace Service Flow Map
+
+**Current**: Service dependency graph exists globally but not per-trace.
+**Target**: Per-trace visualization showing the path of a request through services with latency annotations.
+
+**Implementation**:
+
+- Build a directed graph from the spans in a single trace (services as nodes, calls as edges)
+- Annotate edges with call count and latency
+- Color-code nodes by error status
+- Add as a new view tab alongside Waterfall and Flame Graph
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Traces/TraceServiceMap.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Components/Traces/TraceExplorer.tsx` (add view tab)
+
+### 2.5 Span Link Navigation
+
+**Current**: Links data is stored in spans but not navigable in the UI.
+**Target**: Clickable links in the span detail panel that navigate to related traces/spans.
+
+**Implementation**:
+
+- In the SpanViewer detail panel, render the `links` array as clickable items
+- Each link shows the linked traceId, spanId, and relationship type
+- Clicking navigates to the linked trace view
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Span/SpanViewer.tsx` (render clickable links)
+
+---
+
+## Phase 3: Advanced Analytics & Correlation (P2) — Power Features
+
+### 3.1 Trace-to-Metric Exemplars
+
+**Current**: Metric model has no traceId/spanId fields.
+**Target**: Link metric data points to trace IDs; show exemplar dots on metric charts that navigate to traces.
+
+**Implementation**:
+
+- Add optional `traceId` and `spanId` columns to the Metric ClickHouse model
+- During metric ingestion, extract exemplar trace/span IDs from OTLP exemplar fields
+- On metric charts, render exemplar dots at data points that have associated traces
+- Clicking an exemplar dot navigates to the trace view
+
+**Files to modify**:
+- `Common/Models/AnalyticsModels/Metric.ts` (add traceId/spanId columns)
+- `Telemetry/Services/OtelMetricsIngestService.ts` (extract exemplars)
+- `App/FeatureSet/Dashboard/src/Components/Metrics/MetricGraph.tsx` (render exemplar dots)
+
+### 3.2 Custom Metrics from Spans
+
+**Current**: No way to create persistent metrics from trace data.
+**Target**: Users define custom metrics from span attributes that are computed via ClickHouse materialized views and available for alerting and dashboards.
+
+**Implementation**:
+
+- Create `SpanDerivedMetric` model: name, filter query (which spans), aggregation (count/avg/p99 of what field), GROUP BY attributes
+- Use ClickHouse materialized views for efficient computation
+- Surface derived metrics in the metric explorer and alerting system
+
+**Files to modify**:
+- `Common/Models/DatabaseModels/SpanDerivedMetric.ts` (new)
+- `Common/Server/Services/SpanDerivedMetricService.ts` (new)
+- Dashboard: UI for defining derived metrics
+
+### 3.3 Structural Trace Queries
+
+**Current**: Can only filter on individual span attributes.
+**Target**: Query traces based on properties of multiple spans and their relationships (e.g., "find traces where service A called service B and B returned an error").
+
+**Implementation**:
+
+- Design a visual query builder for structural queries (easier adoption than a query language)
+- Translate structural queries to ClickHouse subqueries with JOINs on traceId
+- Example: "Find traces where span with service=frontend has child span with service=database AND duration > 500ms"
+  ```sql
+  SELECT DISTINCT s1.traceId FROM SpanItem s1
+  JOIN SpanItem s2 ON s1.traceId = s2.traceId AND s1.spanId = s2.parentSpanId
+  WHERE s1.projectId = {pid}
+    AND JSONExtractString(s1.attributes, 'service.name') = 'frontend'
+    AND JSONExtractString(s2.attributes, 'service.name') = 'database'
+    AND s2.durationUnixNano > 500000000
+  ```
+
+**Files to modify**:
+- `Common/Types/Traces/StructuralTraceQuery.ts` (new - query model)
+- `Common/Server/Services/SpanService.ts` (add structural query execution)
+- `App/FeatureSet/Dashboard/src/Components/Traces/StructuralQueryBuilder.tsx` (new - visual builder)
+
+### 3.4 Trace Comparison / Diffing
+
+**Current**: No way to compare traces.
+**Target**: Side-by-side comparison of two traces of the same operation, highlighting differences in span count, latency, and structure.
+
+**Implementation**:
+
+- Add "Compare" action to trace list (select two traces)
+- Build a diff view showing: added/removed spans, latency differences per span, structural changes
+- Useful for comparing a slow trace to a fast trace of the same operation
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Traces/TraceComparison.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Traces/Compare.tsx` (new page)
+
+---
+
+## Phase 4: Competitive Differentiation (P3) — Long-Term
+
+### 4.1 Rules-Based and Tail-Based Sampling
+
+**Current**: Phase 1 adds head-based probabilistic sampling.
+**Target**: Rules-based sampling (always keep errors/slow traces, sample successes) and eventually tail-based sampling (buffer complete traces, decide after seeing all spans).
+
+**Implementation**:
+
+- Rules engine: configurable conditions (service, status, duration, attributes) with per-rule sample rates
+- Tail-based: buffer spans for a configurable window (30s), assemble complete traces, then apply retention decisions
+- Tail-based is complex; consider integrating with OpenTelemetry Collector's tail sampling processor as an alternative
+
+### 4.2 AI/ML on Trace Data
+
+- **Anomaly detection** on RED metrics (statistical deviation from baseline)
+- **Auto-surfacing correlated attributes** when latency spikes (similar to Honeycomb BubbleUp)
+- **Natural language trace queries** ("show me slow database calls from the last hour")
+- **Automatic root cause analysis** from trace data during incidents
+
+### 4.3 RUM (Real User Monitoring) Correlation
+
+- Browser SDK that propagates W3C trace context from frontend to backend
+- Link frontend page loads, interactions, and web vitals to backend traces
+- Show end-to-end user experience from browser to backend services
+
+### 4.4 Continuous Profiling Integration
+
+- Integrate with a profiling backend (e.g., Pyroscope)
+- Link profile data to span time windows
+- Show "Code Hotspots" within spans (similar to DataDog)
+
+---
+
+## ClickHouse Storage Improvements
+
+### S.1 Migrate `attributes` to Map(String, String) (HIGH)
+
+**Current**: `attributes` is stored as opaque `String` (JSON). Querying by attribute value requires `LIKE` or `JSONExtract()` scans.
+**Target**: `Map(String, String)` type enabling `attributes['http.method'] = 'GET'` without JSON parsing.
+
+**Impact**: Significant query speedup for attribute-based span filtering -- the most common query pattern after time-range filtering.
+
+**Files to modify**:
+- `Common/Models/AnalyticsModels/Span.ts` (change column type)
+- `Common/Server/Utils/AnalyticsDatabase/StatementGenerator.ts` (handle Map type)
+- `Telemetry/Services/OtelTracesIngestService.ts` (write Map format)
+- `Worker/DataMigrations/` (new migration)
+
+### S.2 Add Aggregation Projection (MEDIUM)
+
+**Current**: `projections: []` is empty.
+**Target**: Pre-aggregation projection for common dashboard queries.
+
+```sql
+PROJECTION agg_by_service (
+  SELECT
+    serviceId,
+    toStartOfMinute(startTime) AS minute,
+    count(),
+    avg(durationUnixNano),
+    quantile(0.99)(durationUnixNano)
+  GROUP BY serviceId, minute
+)
+```
+
+**Impact**: 5-10x faster aggregation queries for service overview dashboards.
+
+### S.3 Add Trace-by-ID Projection (LOW)
+
+**Current**: Trace detail view relies on BloomFilter skip index for traceId lookups.
+**Target**: Projection sorted by `(projectId, traceId, startTime)` for faster trace-by-ID queries.
+
+---
+
+## Quick Wins (Can Ship This Week)
+
+1. **In-trace span search** - Add a text filter in TraceExplorer (few hours of work)
+2. **Self-time calculation** - Show "self time" (span duration minus child durations) in SpanViewer
+3. **Span link navigation** - Links data is stored but not clickable in UI
+4. **Top-N slowest operations** - Simple ClickHouse query: `ORDER BY durationUnixNano DESC LIMIT N`
+5. **Error rate by service** - Aggregate `statusCode=2` counts grouped by serviceId
+6. **Trace duration distribution histogram** - Use ClickHouse `histogram()` on durationUnixNano
+7. **Span count per service display** - Already tracked in `servicesInTrace`, just needs better display
+
+---
+
+## Recommended Implementation Order
+
+1. **Phase 1.1** - Trace Analytics Engine (highest impact, unlocks everything else)
+2. **Phase 1.2** - RED Metrics from Traces (prerequisite for alerting, service overview)
+3. **Quick Wins** - Ship in-trace search, self-time, span links, top-N operations
+4. **Phase 1.3** - Trace-Based Alerting (core observability workflow)
+5. **Phase 2.1** - Flame Graph View (industry-standard visualization)
+6. **Phase 2.2** - Critical Path Analysis (key debugging capability)
+7. **Phase 1.4** - Head-Based Sampling (essential for high-volume users)
+8. **S.1** - Migrate attributes to Map type (storage optimization)
+9. **Phase 2.3-2.5** - In-trace search, per-trace map, span links
+10. **Phase 3.1** - Trace-to-Metric Exemplars
+11. **Phase 3.2-3.4** - Custom metrics, structural queries, comparison
+12. **Phase 4.x** - AI/ML, RUM, profiling (long-term)
+
+## Verification
+
+For each feature:
+1. Unit tests for new query builders, critical path algorithm, sampling logic
+2. Integration tests for new API endpoints (analytics, RED metrics, sampling)
+3. Manual verification via the dev server at `https://oneuptimedev.genosyn.com/dashboard/{projectId}/traces`
+4. Check ClickHouse query performance with `EXPLAIN` for new aggregation queries
+5. Verify trace correlation (logs, exceptions, metrics) still works correctly with new features
+6. Load test sampling logic to ensure it doesn't add ingestion latency
--- a/2
+++ b/2
@@ -1 +1 @@
-10.0.32
+10.0.33
@@ -1 +1 @@
 .0.32
 .0.33