feat: add plans for Docker and Kubernetes monitoring implementation

This commit is contained in:
Nawaz Dhandala
2026-03-17 12:05:36 +00:00
parent 7741bebe31
commit 9b380d424d
2 changed files with 819 additions and 0 deletions

View File

@@ -0,0 +1,400 @@
# Plan: Docker Container Monitoring for OneUptime
## Context
OneUptime's infrastructure monitoring currently supports Server/VM monitoring via the InfrastructureAgent (a Go-based agent that collects CPU, memory, disk, and process metrics) and has a commented-out Kubernetes monitor type. There is **no Docker container monitoring** today. Users running containerized workloads — whether on bare-metal Docker hosts, Docker Compose, or Docker Swarm — have no visibility into container-level health, resource consumption, or lifecycle events.
Docker container monitoring is a critical gap: Docker remains the dominant container runtime, and many users run workloads on Docker without Kubernetes. Competitors (Datadog, New Relic, Grafana Cloud) all provide first-class Docker monitoring. This plan proposes a phased implementation to close this gap.
## Gap Analysis Summary
| Feature | OneUptime | Datadog | New Relic | Priority |
|---------|-----------|---------|-----------|----------|
| Container discovery & inventory | None | Auto-discovery via agent | Auto-discovery via infra agent | **P0** |
| Container CPU/memory/network/disk metrics | None | Full metrics via cgroups | Full metrics via cgroups | **P0** |
| Container lifecycle events (start/stop/restart/OOM) | None | Event stream + alerts | Event stream + alerts | **P0** |
| Container health status monitoring | None | Health check integration | Health check integration | **P0** |
| Container log collection | None (generic OTLP only) | Auto-collected per container | Auto-collected per container | **P1** |
| Docker Compose service grouping | None | Auto-detection via labels | Label-based grouping | **P1** |
| Container image vulnerability scanning | None | Integrated via Snyk/Trivy | None | **P2** |
| Docker Swarm service monitoring | None | Full Swarm support | Limited | **P2** |
| Container resource limit alerts | None | OOM/throttle alerts | Threshold alerts | **P1** |
| Container networking (inter-container traffic) | None | Network map + flow data | Limited | **P2** |
| Live container exec / inspect | None | None | None | **P3** |
---
## Phase 1: Foundation (P0) — Container Discovery & Core Metrics
These are table-stakes features required for any Docker monitoring product.
### 1.1 Docker Monitor Type
**Current**: No Docker monitor type exists. Kubernetes is defined but commented out.
**Target**: Add a `Docker` monitor type with full UI integration.
**Implementation**:
- Add `Docker = "Docker"` to the `MonitorType` enum
- Add Docker to the "Infrastructure" monitor type category alongside Server and SNMP
- Add monitor type props (title: "Docker Container", description, icon: `IconProp.Cube`)
- Create `DockerMonitorResponse` interface for container metric reporting
- Add Docker to `getActiveMonitorTypes()` and relevant helper methods
**Files to modify**:
- `Common/Types/Monitor/MonitorType.ts` (add enum value, category, props)
- `Common/Types/Monitor/DockerMonitor/DockerMonitorResponse.ts` (new)
- `Common/Types/Monitor/DockerMonitor/DockerContainerMetrics.ts` (new)
### 1.2 Container Metrics Collection in InfrastructureAgent
**Current**: The Go-based InfrastructureAgent collects host-level CPU, memory, disk, and process metrics.
**Target**: Extend the agent to discover and collect metrics from all running Docker containers on the host.
**Implementation**:
- Add a Docker collector module to the InfrastructureAgent that uses the Docker Engine API (via `/var/run/docker.sock` or configurable endpoint)
- Discover all running containers via `GET /containers/json`
- For each container, collect metrics via `GET /containers/{id}/stats?stream=false`:
- **CPU**: `cpu_stats.cpu_usage.total_usage`, `cpu_stats.system_cpu_usage`, per-core usage, throttled periods/time
- **Memory**: `memory_stats.usage`, `memory_stats.limit`, `memory_stats.stats.cache`, RSS, swap, working set, OOM kill count
- **Network**: `networks.*.rx_bytes`, `tx_bytes`, `rx_packets`, `tx_packets`, `rx_errors`, `tx_errors`, `rx_dropped`, `tx_dropped` (per interface)
- **Block I/O**: `blkio_stats.io_service_bytes_recursive` (read/write bytes), `io_serviced_recursive` (read/write ops)
- **PIDs**: `pids_stats.current`, `pids_stats.limit`
- Collect container metadata: name, image, image ID, labels, created time, status, health check status, restart count, ports, mounts, environment (filtered for sensitive values)
- Report interval: configurable, default 30 seconds (matching existing server monitor interval)
**Files to modify**:
- `InfrastructureAgent/collector/docker.go` (new - Docker metrics collector)
- `InfrastructureAgent/model/docker.go` (new - Docker metric data models)
- `InfrastructureAgent/agent.go` (add Docker collection to the main loop)
- `InfrastructureAgent/config.go` (add Docker-related configuration: socket path, collection enabled/disabled)
### 1.3 Container Inventory & Discovery
**Current**: No container awareness.
**Target**: Auto-discover containers on monitored hosts and maintain a live inventory.
**Implementation**:
- Create a `DockerContainer` PostgreSQL model to store discovered containers:
- `containerId` (Docker container ID)
- `containerName`
- `imageName`, `imageId`, `imageTag`
- `status` (running, paused, stopped, restarting, dead, created)
- `healthStatus` (healthy, unhealthy, starting, none)
- `labels` (JSON)
- `createdAt` (container creation time)
- `startedAt`
- `hostMonitorId` (reference to the Server monitor for the host)
- `projectId`
- `restartCount`
- `ports` (JSON - exposed ports mapping)
- `mounts` (JSON - volume mounts)
- `cpuLimit`, `memoryLimit` (resource constraints)
- On each agent report, upsert container records (create new, update existing, mark removed containers as stopped)
- Container inventory page in the dashboard showing all containers across all monitored hosts
**Files to modify**:
- `Common/Models/DatabaseModels/DockerContainer.ts` (new)
- `Common/Server/Services/DockerContainerService.ts` (new)
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainers.tsx` (new - container list page)
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainerDetail.tsx` (new - single container detail)
### 1.4 Container Lifecycle Events
**Current**: No container event tracking.
**Target**: Capture and surface container lifecycle events (start, stop, restart, OOM kill, health check failures).
**Implementation**:
- In the InfrastructureAgent, subscribe to Docker events via `GET /events?filters={"type":["container"]}` (long-poll/streaming)
- Capture events: `start`, `stop`, `die`, `kill`, `oom`, `restart`, `pause`, `unpause`, `health_status`
- Include exit code, OOM killed flag, and signal information for `die` events
- Report events to OneUptime alongside metric data
- Store events in the existing telemetry pipeline (as structured logs or a dedicated events table)
- Surface events as an overlay on container metric charts (vertical markers)
- Enable alerting on lifecycle events (e.g., alert on OOM kill, alert on restart count > N in time window)
**Files to modify**:
- `InfrastructureAgent/collector/docker_events.go` (new - event listener)
- `Common/Types/Monitor/DockerMonitor/DockerContainerEvent.ts` (new)
- `Worker/Jobs/Monitors/DockerContainerMonitor.ts` (new - process container reports, evaluate criteria)
---
## Phase 2: Alerting & Monitoring Rules (P0-P1) — Actionable Monitoring
### 2.1 Container Health Check Monitoring
**Current**: No health check awareness.
**Target**: Monitor Docker health check status and alert on unhealthy containers.
**Implementation**:
- Extract health check status from container inspect data (`State.Health.Status`, `State.Health.FailingStreak`, `State.Health.Log`)
- Add monitor criteria for health check status:
- Alert when container health transitions to `unhealthy`
- Alert when failing streak exceeds threshold
- Surface health check log output in alert details
- Add health status column to the container inventory table
**Files to modify**:
- `Common/Server/Utils/Monitor/Criteria/DockerContainerCriteria.ts` (new)
- `Common/Types/Monitor/CriteriaFilter.ts` (add Docker-specific filter types)
### 2.2 Container Resource Threshold Alerts
**Current**: Server monitor supports CPU/memory threshold alerts at the host level.
**Target**: Per-container resource threshold alerting with limit-aware thresholds.
**Implementation**:
- Add monitor criteria for container-level metrics:
- CPU usage % (of container limit or host total)
- Memory usage % (of container limit)
- Memory usage absolute (approaching container limit)
- Network error rate
- Block I/O throughput
- Restart count in time window
- PID count approaching limit
- **Limit-aware alerting**: When a container has resource limits set, calculate usage as a percentage of the limit rather than host total
- E.g., container with 2GB memory limit using 1.8GB = 90% (alert), not 1.8/64GB = 2.8% (misleading)
- Support compound criteria (e.g., CPU > 80% AND memory > 90% for 5 minutes)
**Files to modify**:
- `Common/Server/Utils/Monitor/Criteria/DockerContainerCriteria.ts` (extend)
- `Worker/Jobs/Monitors/DockerContainerMonitor.ts` (criteria evaluation)
### 2.3 Container Auto-Restart Detection
**Current**: No restart tracking.
**Target**: Detect and alert on container restart loops (CrashLoopBackOff equivalent for Docker).
**Implementation**:
- Track restart count per container over sliding time windows (5 min, 15 min, 1 hour)
- Alert when restart count exceeds configurable threshold
- Include container exit code and last log lines in the alert context
- Dashboard widget showing containers with highest restart frequency
**Files to modify**:
- `Worker/Jobs/Monitors/DockerContainerMonitor.ts` (add restart loop detection)
---
## Phase 3: Visualization & UX (P1) — Container Dashboard
### 3.1 Container Overview Dashboard
**Current**: No container UI.
**Target**: Dedicated container monitoring pages with rich visualizations.
**Implementation**:
- **Container List Page**: Table with columns for name, image, status, health, CPU%, memory%, network I/O, uptime, restart count. Sortable, filterable, searchable
- **Container Detail Page**: Single-container view with:
- Header: container name, image, status badge, health badge, uptime
- Metrics charts: CPU, memory, network, block I/O (time series, matching existing metric chart style)
- Events timeline: lifecycle events overlaid on charts
- Container metadata: labels, ports, mounts, environment variables (filtered), resource limits
- Processes: top processes inside the container (if available via `docker top`)
- Logs: recent container logs (linked to log management if available)
- **Host-Container Relationship**: From the existing Server monitor detail page, add a "Containers" tab showing all containers on that host
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainers.tsx` (new - list view)
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainerDetail.tsx` (new - detail view)
- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerMetricsCharts.tsx` (new)
- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerEventsTimeline.tsx` (new)
- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerMetadataPanel.tsx` (new)
### 3.2 Container Map / Topology View
**Current**: No topology visualization.
**Target**: Visual map showing containers, their host, and network relationships.
**Implementation**:
- Show containers grouped by host
- Color-code by status (green=healthy, yellow=warning, red=unhealthy/stopped)
- Show network links between containers on the same Docker network
- Click to drill into container detail
- Show Docker Compose project grouping via labels
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerTopology.tsx` (new)
---
## Phase 4: Container Log Collection (P1) — Unified Observability
### 4.1 Automatic Container Log Collection
**Current**: Log collection requires explicit OTLP/Fluentd/Syslog integration per application.
**Target**: Automatically collect logs from all Docker containers via the InfrastructureAgent.
**Implementation**:
- Add a log collector to the InfrastructureAgent using `GET /containers/{id}/logs?stdout=true&stderr=true&follow=true&tail=100`
- Automatically enrich logs with container metadata:
- `container.id`, `container.name`, `container.image.name`, `container.image.tag`
- Host information (hostname, OS)
- Docker labels as log attributes
- Forward logs to OneUptime's telemetry ingestion endpoint (OTLP format)
- Configurable:
- Enable/disable per container (via label `oneuptime.logs.enabled=true/false`)
- Max log line size
- Log rate limiting (to prevent noisy container flooding)
- Include/exclude containers by name pattern or label selector
**Files to modify**:
- `InfrastructureAgent/collector/docker_logs.go` (new - log collector)
- `InfrastructureAgent/config.go` (add log collection config)
### 4.2 Container Log Correlation
**Current**: No automatic correlation between container logs and container metrics.
**Target**: Link container logs to container metrics and events for unified troubleshooting.
**Implementation**:
- Automatically tag container logs with `container.id` and `container.name` attributes
- In the container detail page, add a "Logs" tab that pre-filters the log viewer to the container's logs
- When viewing a metric anomaly or event, show a link to "View logs around this time"
- In the log detail view, show a link to "View container metrics" when `container.id` attribute is present
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainerDetail.tsx` (add Logs tab)
- `App/FeatureSet/Dashboard/src/Components/Logs/LogDetailsPanel.tsx` (add container link)
---
## Phase 5: Docker Compose & Swarm Support (P1-P2) — Multi-Container Orchestration
### 5.1 Docker Compose Project Grouping
**Current**: Containers are flat, no grouping.
**Target**: Automatically detect Docker Compose projects and group containers by service.
**Implementation**:
- Detect Compose projects via standard labels:
- `com.docker.compose.project` (project name)
- `com.docker.compose.service` (service name)
- `com.docker.compose.container-number` (replica number)
- `com.docker.compose.oneoff` (one-off vs service container)
- Create a Compose project view showing:
- Project name with list of services
- Per-service status (all replicas healthy, degraded, down)
- Per-service aggregated metrics (total CPU, memory across replicas)
- Service dependency visualization (if depends_on info is available via labels)
- Alert at the service level (e.g., "all replicas of service X are down")
**Files to modify**:
- `Common/Models/DatabaseModels/DockerComposeProject.ts` (new)
- `Common/Server/Services/DockerComposeProjectService.ts` (new)
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerComposeProjects.tsx` (new)
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerComposeProjectDetail.tsx` (new)
### 5.2 Docker Swarm Monitoring
**Current**: No Swarm support.
**Target**: Monitor Docker Swarm services, tasks, and nodes.
**Implementation**:
- Detect if the Docker host is a Swarm manager node
- Collect Swarm-specific data:
- Services: `GET /services` (desired/running replicas, update status)
- Tasks: `GET /tasks` (task state, assigned node, error messages)
- Nodes: `GET /nodes` (availability, status, resource capacity)
- Surface Swarm service health: desired replicas vs running replicas
- Alert when service degraded (running < desired) or task failures
- Swarm-specific dashboard showing cluster overview
**Files to modify**:
- `InfrastructureAgent/collector/docker_swarm.go` (new)
- `Common/Types/Monitor/DockerMonitor/DockerSwarmMetrics.ts` (new)
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerSwarm.tsx` (new)
---
## Phase 6: Advanced Features (P2-P3) — Differentiation
### 6.1 Container Image Analysis
**Current**: No image awareness beyond name/tag.
**Target**: Track image versions across containers and optionally scan for known vulnerabilities.
**Implementation**:
- Maintain an image registry (image name, tag, digest, size, creation date)
- Show which containers are running outdated images (when a newer tag is available locally)
- Optional: integrate with Trivy or Grype for vulnerability scanning of local images
- Dashboard showing image inventory, version distribution, and vulnerability summary
### 6.2 Container Resource Recommendations
**Current**: No resource guidance.
**Target**: Recommend CPU/memory limits based on observed usage patterns.
**Implementation**:
- Analyze historical container metrics (p95 CPU, p99 memory over 7 days)
- Compare actual usage to configured limits
- Flag over-provisioned containers (limit >> usage) and under-provisioned containers (usage approaching limit)
- Generate recommendations: "Container X uses max 256MB, but has a 4GB limit — consider reducing to 512MB"
### 6.3 Container Diff / Change Detection
**Current**: No change tracking.
**Target**: Detect when container configuration changes (image update, env var change, port mapping change).
**Implementation**:
- Store container configuration snapshots on each agent report
- Diff against previous snapshot and generate change events
- Alert on unexpected configuration changes
- Show change history in the container detail page
---
## Quick Wins (Can Ship First)
1. **Add Docker monitor type** — Add the enum value, category, and props (no collection yet, but enables the UI scaffolding)
2. **Basic container discovery** — Extend InfrastructureAgent to list running containers and report names, images, status
3. **Container CPU/memory metrics** — Collect basic cgroup stats via Docker stats API
4. **Container inventory page** — Simple table showing discovered containers across hosts
---
## Recommended Implementation Order
1. **Phase 1.1** — Docker monitor type (enum, UI scaffolding)
2. **Phase 1.2** — Container metrics collection in InfrastructureAgent
3. **Phase 1.3** — Container inventory & discovery
4. **Phase 3.1** — Container overview dashboard (list + detail pages)
5. **Phase 1.4** — Container lifecycle events
6. **Phase 2.1** — Container health check monitoring
7. **Phase 2.2** — Container resource threshold alerts
8. **Phase 4.1** — Automatic container log collection
9. **Phase 5.1** — Docker Compose project grouping
10. **Phase 2.3** — Container auto-restart detection
11. **Phase 3.2** — Container map / topology view
12. **Phase 4.2** — Container log correlation
13. **Phase 5.2** — Docker Swarm monitoring
14. **Phase 6.1** — Container image analysis
15. **Phase 6.2** — Container resource recommendations
16. **Phase 6.3** — Container diff / change detection
## Verification
For each feature:
1. Unit tests for new Docker metric collection, parsing, and criteria evaluation
2. Integration tests for container discovery, metric ingestion, and alerting APIs
3. Manual verification with a Docker host running multiple containers (various states: healthy, unhealthy, restarting, OOM)
4. Test with Docker Compose multi-service applications
5. Performance test: verify agent overhead is minimal (< 1% CPU) when monitoring 50+ containers
6. Verify container metrics accuracy by comparing agent-reported values to `docker stats` output
7. Test graceful handling of Docker daemon unavailability (agent should not crash, should report connection failure)

View File

@@ -0,0 +1,419 @@
# Plan: Kubernetes Monitoring for OneUptime
## Context
OneUptime has foundational infrastructure for Kubernetes monitoring: OTLP ingestion (HTTP and gRPC), ClickHouse metric/log/trace storage, telemetry-based monitors (Metrics, Logs, Traces), and a Helm chart for deploying OneUptime itself on Kubernetes. A `Kubernetes` monitor type exists in the `MonitorType` enum but is currently disabled and has no implementation. The OpenTelemetry Collector config supports OTLP receivers but has no Kubernetes-specific receivers (kubelet, kube-state-metrics, Prometheus). Server monitoring exists but is limited to basic VM-level checks.
This plan proposes a phased implementation to deliver first-class Kubernetes monitoring — from cluster health and workload observability to intelligent alerting — leveraging OneUptime's all-in-one observability platform (metrics, logs, traces, incidents, status pages).
## Completed
- **OTLP Metric Ingestion** - HTTP and gRPC metric ingestion with async queue-based batch processing
- **ClickHouse Metric Storage** - MergeTree with partitioning, per-service TTL
- **Telemetry-Based Monitors** - Metric, Log, Trace, and Exception monitors with configurable criteria
- **Helm Chart** - OneUptime deploys on Kubernetes with KEDA auto-scaling support
- **OpenTelemetry Collector** - Deployed via Helm, accepts OTLP on ports 4317/4318
- **MonitorType.Kubernetes** - Enum value defined (but disabled and unimplemented)
## Gap Analysis Summary
| Feature | OneUptime | DataDog | New Relic | Grafana/Prometheus | Priority |
|---------|-----------|---------|-----------|-------------------|----------|
| K8s metric collection (kubelet, kube-state-metrics) | None | Agent auto-discovery | K8s integration | Prometheus + kube-state-metrics | **P0** |
| Cluster overview dashboard | None | Out-of-box | Pre-built | Pre-built via mixins | **P0** |
| Pod/Container resource metrics | None | Live Containers | K8s cluster explorer | cAdvisor + Grafana | **P0** |
| Node health monitoring | None | Host Map + agent | Infrastructure UI | node-exporter + Grafana | **P0** |
| Kubernetes event ingestion | None | Auto-collected | K8s events integration | Eventrouter/Exporter | **P0** |
| Workload health alerts (CrashLoopBackOff, OOMKilled, etc.) | None | Auto-monitors | Pre-built alerts | PrometheusRule CRDs | **P1** |
| Namespace/workload cost attribution | None | Container cost allocation | None | Kubecost integration | **P1** |
| K8s resource inventory (deployments, services, ingresses) | None | Orchestrator Explorer | Cluster explorer | None native | **P1** |
| HPA/VPA monitoring | None | Yes | Partial | Prometheus metrics | **P1** |
| Multi-cluster support | None | Yes | Yes | Thanos/Cortex | **P2** |
| K8s log collection (pod stdout/stderr) | Via Fluentd example | DaemonSet agent | Fluent Bit integration | Loki + Promtail | **P2** |
| Service mesh observability (Istio, Linkerd) | None | Yes | Yes | Partial | **P2** |
| Network policy monitoring | None | NPM | None | Cilium Hubble | **P3** |
| eBPF-based deep observability | None | Universal Service Monitoring | Pixie | Cilium/Tetragon | **P3** |
---
## Phase 1: Foundation (P0) — Kubernetes Metric Collection & Visibility
Without these, OneUptime cannot monitor any Kubernetes cluster. This phase makes K8s metrics flow into the platform and provides basic visibility.
### 1.1 OpenTelemetry Collector Kubernetes Receivers
**Current**: OTel Collector only has OTLP receivers. No Kubernetes-specific metric collection.
**Target**: Pre-configured OTel Collector with receivers for kubelet, kube-state-metrics, and Kubernetes events.
**Implementation**:
- Add `kubeletstats` receiver to the OTel Collector config for node and pod resource metrics:
- CPU, memory, filesystem, network per node and per pod/container
- Collection interval: 30s
- Auth via serviceAccount token
- Add `k8s_cluster` receiver for cluster-level metrics from the Kubernetes API:
- Deployment, ReplicaSet, StatefulSet, DaemonSet replica counts and status
- Pod phase, container states (waiting/running/terminated with reasons)
- Node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
- Namespace resource quotas and limit ranges
- HPA current/desired replicas
- Add `k8sobjects` receiver for Kubernetes events:
- Watch Events API for Warning and Normal events
- Ingest as logs with structured attributes (reason, involvedObject, message)
- Add `k8s_events` receiver as alternative lightweight event collection
- Configure `k8sattributes` processor to enrich all telemetry with K8s metadata:
- Pod name, namespace, node, deployment, replicaset, labels, annotations
- Provide Helm values to enable/disable K8s monitoring and configure which namespaces to monitor
**Files to modify**:
- `OTelCollector/otel-collector-config.template.yaml` (add kubeletstats, k8s_cluster, k8sobjects receivers and k8sattributes processor)
- `HelmChart/Public/oneuptime/templates/otel-collector.yaml` (restore and configure OTel Collector deployment with proper RBAC)
- `HelmChart/Public/oneuptime/templates/otel-collector-rbac.yaml` (new - ClusterRole, ClusterRoleBinding, ServiceAccount for K8s API access)
- `HelmChart/Public/oneuptime/values.yaml` (add kubernetesMonitoring config section)
### 1.2 Kubernetes Cluster Overview Dashboard Template
**Current**: No pre-built Kubernetes dashboards.
**Target**: Auto-generated cluster overview dashboard showing key health indicators.
**Implementation**:
- Create a dashboard template with the following panels:
- **Cluster Summary**: Total nodes, pods (running/pending/failed), namespaces, deployments
- **Node Health**: CPU and memory utilization per node, node conditions
- **Pod Status**: Pod phase distribution (Running/Pending/Succeeded/Failed/Unknown)
- **Resource Utilization**: Cluster-wide CPU and memory usage vs capacity (requests, limits, actual)
- **Top Consumers**: Top 10 pods by CPU usage, top 10 by memory usage
- **Recent Events**: Kubernetes Warning events stream
- **Container Restarts**: Pods with highest restart counts
- Auto-detect K8s metrics and offer dashboard creation during onboarding
- Use template variables for namespace and node filtering
**Files to modify**:
- `Common/Types/Dashboard/Templates/KubernetesCluster.ts` (new - cluster overview template)
- `Common/Types/Dashboard/Templates/KubernetesWorkload.ts` (new - per-namespace workload template)
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Templates.tsx` (add K8s templates to gallery)
### 1.3 Pod and Container Resource Metrics
**Current**: No container-level visibility.
**Target**: Detailed resource metrics for every pod and container with drill-down.
**Implementation**:
- Ensure the following kubeletstats metrics are ingested and queryable:
- `k8s.pod.cpu.utilization`, `k8s.pod.memory.usage`, `k8s.pod.memory.rss`
- `k8s.pod.network.io` (rx/tx bytes), `k8s.pod.filesystem.usage`
- `container.cpu.utilization`, `container.memory.usage`, `container.restarts`
- Create a "Kubernetes" section in the dashboard navigation:
- Cluster > Namespace > Workload > Pod > Container drill-down hierarchy
- Pod detail page showing: resource usage over time, container statuses, events, logs (linked), traces (linked)
- Calculate resource efficiency: actual usage vs requests vs limits
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/` (new directory)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/ClusterOverview.tsx` (new)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Namespaces.tsx` (new)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Pods.tsx` (new)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/PodDetail.tsx` (new)
### 1.4 Node Health Monitoring
**Current**: No node-level metrics.
**Target**: Per-node resource utilization, conditions, and capacity tracking.
**Implementation**:
- Ingest node metrics via kubeletstats receiver:
- `k8s.node.cpu.utilization`, `k8s.node.memory.usage`, `k8s.node.memory.available`
- `k8s.node.filesystem.usage`, `k8s.node.filesystem.capacity`
- `k8s.node.network.io`, `k8s.node.condition` (Ready, MemoryPressure, etc.)
- Node list page: table with all nodes showing CPU%, memory%, disk%, conditions, pod count
- Node detail page: time-series charts for resource usage, pod list on node, events
- Node capacity planning: show allocatable vs requested vs used per node
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Nodes.tsx` (new)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/NodeDetail.tsx` (new)
### 1.5 Kubernetes Event Ingestion
**Current**: No Kubernetes event collection.
**Target**: Ingest and surface Kubernetes events as structured logs with correlation to resources.
**Implementation**:
- Configure `k8sobjects` receiver to watch Kubernetes Events
- Map events to structured log entries:
- `severity` from event type (Warning -> WARN, Normal -> INFO)
- `body` from event message
- Attributes: `k8s.event.reason`, `k8s.event.count`, `k8s.object.kind`, `k8s.object.name`, `k8s.namespace.name`
- Create a dedicated "Kubernetes Events" view:
- Filterable by namespace, event reason, object kind
- Timeline visualization showing event frequency
- Link events to related pods/deployments/nodes
- Alert on specific event patterns (e.g., repeated FailedScheduling, FailedMount)
**Files to modify**:
- `OTelCollector/otel-collector-config.template.yaml` (add k8sobjects receiver)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Events.tsx` (new)
---
## Phase 2: Intelligent Alerting & Workload Health (P1) — Actionable Monitoring
### 2.1 Kubernetes-Aware Alert Templates
**Current**: Generic metric threshold alerts only. Users must manually configure alerts for K8s failure modes.
**Target**: Pre-built alert templates for common Kubernetes failure patterns.
**Implementation**:
- Create alert templates for critical K8s conditions:
- **CrashLoopBackOff**: Alert when `k8s.container.restarts` increases rapidly (> N restarts in M minutes)
- **OOMKilled**: Alert on container termination reason = OOMKilled
- **Pod Pending**: Alert when pods remain in Pending phase for > N minutes
- **Node NotReady**: Alert when node condition transitions to NotReady
- **High Resource Utilization**: Alert when node CPU > 90% or memory > 85% sustained
- **Deployment Replica Mismatch**: Alert when available replicas < desired replicas for > N minutes
- **PVC Disk Full**: Alert when PV usage > 90% capacity
- **Failed Scheduling**: Alert on repeated FailedScheduling events
- **Image Pull Failures**: Alert on ErrImagePull/ImagePullBackOff events
- **Job/CronJob Failures**: Alert when job completion fails
- One-click enable for each alert template during K8s monitoring setup
- Auto-route alerts to the OneUptime incident management system
**Files to modify**:
- `Common/Types/Monitor/Templates/KubernetesAlertTemplates.ts` (new)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/AlertSetup.tsx` (new - guided alert configuration)
- `Worker/Jobs/TelemetryMonitor/MonitorTelemetryMonitor.ts` (support K8s-specific criteria evaluation)
### 2.2 Kubernetes Resource Inventory
**Current**: No visibility into K8s resource state.
**Target**: Live inventory of Kubernetes resources with health status.
**Implementation**:
- Create a `KubernetesResource` model stored in ClickHouse (or PostgreSQL depending on query patterns):
- Kind, name, namespace, labels, annotations, status, conditions, timestamps
- Updated via the `k8s_cluster` receiver or periodic API sync
- Resource pages:
- **Deployments**: List with replica status (ready/desired), last update, strategy
- **StatefulSets**: Ordered pod status, PVC bindings
- **DaemonSets**: Node coverage, desired vs current vs ready
- **Services**: Type (ClusterIP/NodePort/LoadBalancer), endpoints, selector
- **Ingresses**: Host rules, backend services, TLS status
- **ConfigMaps/Secrets**: List with last-modified (secrets show metadata only, never values)
- **PVCs**: Bound PV, capacity, access modes, storage class
- Drill-down from any resource to its associated pods, events, and telemetry
**Files to modify**:
- `Common/Models/AnalyticsModels/KubernetesResource.ts` (new)
- `Telemetry/Services/KubernetesResourceService.ts` (new - sync K8s resources)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Resources/` (new - pages for each resource kind)
### 2.3 HPA and VPA Monitoring
**Current**: No autoscaler visibility.
**Target**: Track HPA/VPA behavior and scaling events.
**Implementation**:
- Ingest HPA metrics from `k8s_cluster` receiver:
- `k8s.hpa.current_replicas`, `k8s.hpa.desired_replicas`, `k8s.hpa.min_replicas`, `k8s.hpa.max_replicas`
- Target metric values vs actual
- HPA overview page:
- List all HPAs with current/desired/min/max replicas
- Time-series chart showing scaling events overlaid with the target metric
- Alert when HPA is at max replicas sustained (capacity ceiling)
- Alert when scale-up frequency is abnormally high (thrashing)
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Autoscaling.tsx` (new)
### 2.4 Namespace Resource Quota Monitoring
**Current**: No quota tracking.
**Target**: Track resource quota usage per namespace and alert on approaching limits.
**Implementation**:
- Ingest quota metrics from `k8s_cluster` receiver:
- `k8s.resource_quota.hard.cpu`, `k8s.resource_quota.used.cpu`
- `k8s.resource_quota.hard.memory`, `k8s.resource_quota.used.memory`
- `k8s.resource_quota.hard.pods`, `k8s.resource_quota.used.pods`
- Namespace detail page showing quota utilization gauges
- Alert when any quota usage exceeds 80% (configurable threshold)
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/NamespaceDetail.tsx` (new)
---
## Phase 3: Advanced Observability (P2) — Correlation & Deep Visibility
### 3.1 Kubernetes Log Collection
**Current**: Users can manually configure Fluentd to send logs. No built-in K8s log collection.
**Target**: Automated pod log collection via OTel Collector with K8s metadata enrichment.
**Implementation**:
- Add `filelog` receiver to OTel Collector for collecting container logs from `/var/log/pods/`:
- Parse container runtime log format (Docker JSON, CRI)
- Extract pod name, namespace, container name from file path
- Enrich with K8s metadata via `k8sattributes` processor
- Deploy OTel Collector as a DaemonSet (in addition to existing Deployment) for log collection
- Helm values to configure:
- Namespace inclusion/exclusion filters
- Log level filtering (e.g., only collect WARN and above)
- Container name exclusion patterns
- Link pod logs in the Kubernetes pod detail page
**Files to modify**:
- `HelmChart/Public/oneuptime/templates/otel-collector-daemonset.yaml` (new - DaemonSet for log collection)
- `OTelCollector/otel-collector-daemonset-config.template.yaml` (new - DaemonSet-specific config with filelog receiver)
- `HelmChart/Public/oneuptime/values.yaml` (add DaemonSet configuration options)
### 3.2 Multi-Cluster Support
**Current**: Single-cluster assumption.
**Target**: Monitor multiple Kubernetes clusters from a single OneUptime project.
**Implementation**:
- Add `cluster` attribute to all K8s metrics via OTel Collector resource processor
- Cluster registration: each cluster gets a unique name and OneUptime API key
- Helm install per cluster with cluster-specific configuration
- Cluster selector in the K8s monitoring UI (template variable)
- Cross-cluster comparison views (e.g., resource utilization across clusters)
- Unified alerting: same alert rules applied across all clusters or cluster-specific
**Files to modify**:
- `OTelCollector/otel-collector-config.template.yaml` (add resource processor with cluster name)
- `HelmChart/Public/oneuptime/values.yaml` (add clusterName config)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Clusters.tsx` (new - multi-cluster view)
### 3.3 Service Mesh Observability
**Current**: No service mesh integration.
**Target**: Ingest and visualize service mesh metrics from Istio, Linkerd, or similar.
**Implementation**:
- Add Prometheus receiver to OTel Collector for scraping service mesh metrics:
- Istio: `istio_requests_total`, `istio_request_duration_milliseconds`, `istio_tcp_connections_opened_total`
- Linkerd: `request_total`, `response_latency_ms`
- Service-to-service traffic map from mesh metrics
- mTLS status visibility
- Circuit breaker and retry metrics
- Dashboard templates for Istio and Linkerd
**Files to modify**:
- `OTelCollector/otel-collector-config.template.yaml` (add prometheus receiver for mesh metrics)
- `Common/Types/Dashboard/Templates/ServiceMesh.ts` (new - mesh dashboard templates)
### 3.4 Kubernetes-to-Telemetry Correlation
**Current**: K8s resources and telemetry (metrics, logs, traces) are separate.
**Target**: Click on any K8s resource to see correlated telemetry.
**Implementation**:
- From any pod/deployment/service page, show:
- **Metrics**: CPU, memory, network filtered to that resource
- **Logs**: Logs from containers in that pod, filtered by K8s metadata attributes
- **Traces**: Traces originating from or passing through that service
- **Events**: Kubernetes events for that resource
- Use `k8sattributes` processor enrichment to correlate:
- `k8s.pod.name`, `k8s.namespace.name`, `k8s.deployment.name` across all signals
- Deep link from incident timeline to K8s resource view
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/PodDetail.tsx` (add telemetry correlation tabs)
- `App/FeatureSet/Dashboard/src/Components/Kubernetes/ResourceTelemetryPanel.tsx` (new - reusable correlation panel)
---
## Phase 4: Intelligence & Differentiation (P3) — Long-Term
### 4.1 Kubernetes Cost Attribution
- Track CPU and memory usage per namespace, workload, and label
- Calculate cost based on node instance pricing (configurable per cluster)
- Show cost trends over time, cost per team/project (via labels)
- Identify idle resources (requested but unused capacity)
- Recommendations: right-size requests/limits based on actual usage
### 4.2 Network Policy Monitoring
- Visualize network policies and their effect on pod communication
- Alert on denied network connections
- Integration with Cilium Hubble or Calico for deep network visibility
- Service dependency map derived from actual network traffic
### 4.3 eBPF-Based Deep Observability
- Kernel-level visibility without application instrumentation
- Automatic service discovery and dependency mapping
- DNS monitoring and latency
- TCP connection tracking and retransmit analysis
- Integration with tools like Tetragon, Pixie, or custom eBPF probes
### 4.4 Kubernetes Compliance and Security Monitoring
- Pod security standards compliance tracking
- RBAC audit logging and visualization
- Image vulnerability scanning status
- Network policy coverage analysis
- CIS Kubernetes Benchmark compliance scoring
### 4.5 GitOps Integration
- Track ArgoCD/Flux deployments as annotations on metric charts
- Correlate deployment events with performance changes
- Show deployment history per workload with rollback status
- Alert when deployment sync fails or drift is detected
---
## Quick Wins (Can Ship This Week)
1. **Enable Kubernetes MonitorType** - Uncomment the Kubernetes entry in `getAllMonitorTypeProps()` and wire it to existing telemetry monitors
2. **Add k8sattributes processor** - Enrich all existing OTLP data with K8s metadata for free
3. **Kubernetes dashboard template** - Create a basic cluster health dashboard using standard OTEL K8s metric names
4. **K8s event alerting** - Use existing log monitors to alert on K8s Warning events once event ingestion is configured
5. **Document OTel Collector K8s setup** - Guide for users to configure their own OTel Collector with K8s receivers pointing to OneUptime
---
## Recommended Implementation Order
1. **Quick Wins** - Enable MonitorType, k8sattributes processor, documentation
2. **Phase 1.1** - OTel Collector K8s receivers (prerequisite for everything else)
3. **Phase 1.5** - Kubernetes event ingestion (high value, uses existing log infrastructure)
4. **Phase 1.2** - Cluster overview dashboard template
5. **Phase 1.3** - Pod and container resource metrics pages
6. **Phase 1.4** - Node health monitoring pages
7. **Phase 2.1** - K8s-aware alert templates (makes monitoring actionable)
8. **Phase 2.2** - Resource inventory pages
9. **Phase 2.4** - Namespace quota monitoring
10. **Phase 2.3** - HPA/VPA monitoring
11. **Phase 3.1** - K8s log collection via DaemonSet
12. **Phase 3.4** - K8s-to-telemetry correlation
13. **Phase 3.2** - Multi-cluster support
14. **Phase 3.3** - Service mesh observability
15. **Phase 4.x** - Cost attribution, network policies, eBPF, compliance, GitOps
## Verification
For each feature:
1. Unit tests for new K8s metric query builders, resource models, and alert template logic
2. Integration tests for OTel Collector K8s receivers (use minikube or kind in CI)
3. Manual verification on a test cluster (minikube/kind) with representative workloads
4. Verify K8s metadata enrichment via `k8sattributes` processor across metrics, logs, and traces
5. Check ClickHouse query performance for K8s-specific queries (namespace filtering, resource correlation)
6. Load test with realistic cluster sizes (100+ nodes, 1000+ pods) to validate metric volume handling
7. Verify RBAC permissions are minimal (principle of least privilege for ClusterRole)
8. Test Helm chart upgrades to ensure K8s monitoring can be enabled without disruption