feat: add plans for Docker and Kubernetes monitoring implementation

2026-04-06 00:32:12 +02:00 · 2026-03-17 12:05:36 +00:00
parent 7741bebe31
commit 9b380d424d
2 changed files with 819 additions and 0 deletions
--- a/Internal/Roadmap/DockerContainerMonitoring.md
+++ b/Internal/Roadmap/DockerContainerMonitoring.md
@@ -0,0 +1,400 @@
+# Plan: Docker Container Monitoring for OneUptime
+
+## Context
+
+OneUptime's infrastructure monitoring currently supports Server/VM monitoring via the InfrastructureAgent (a Go-based agent that collects CPU, memory, disk, and process metrics) and has a commented-out Kubernetes monitor type. There is **no Docker container monitoring** today. Users running containerized workloads — whether on bare-metal Docker hosts, Docker Compose, or Docker Swarm — have no visibility into container-level health, resource consumption, or lifecycle events.
+
+Docker container monitoring is a critical gap: Docker remains the dominant container runtime, and many users run workloads on Docker without Kubernetes. Competitors (Datadog, New Relic, Grafana Cloud) all provide first-class Docker monitoring. This plan proposes a phased implementation to close this gap.
+
+## Gap Analysis Summary
+
+| Feature | OneUptime | Datadog | New Relic | Priority |
+|---------|-----------|---------|-----------|----------|
+| Container discovery & inventory | None | Auto-discovery via agent | Auto-discovery via infra agent | **P0** |
+| Container CPU/memory/network/disk metrics | None | Full metrics via cgroups | Full metrics via cgroups | **P0** |
+| Container lifecycle events (start/stop/restart/OOM) | None | Event stream + alerts | Event stream + alerts | **P0** |
+| Container health status monitoring | None | Health check integration | Health check integration | **P0** |
+| Container log collection | None (generic OTLP only) | Auto-collected per container | Auto-collected per container | **P1** |
+| Docker Compose service grouping | None | Auto-detection via labels | Label-based grouping | **P1** |
+| Container image vulnerability scanning | None | Integrated via Snyk/Trivy | None | **P2** |
+| Docker Swarm service monitoring | None | Full Swarm support | Limited | **P2** |
+| Container resource limit alerts | None | OOM/throttle alerts | Threshold alerts | **P1** |
+| Container networking (inter-container traffic) | None | Network map + flow data | Limited | **P2** |
+| Live container exec / inspect | None | None | None | **P3** |
+
+---
+
+## Phase 1: Foundation (P0) — Container Discovery & Core Metrics
+
+These are table-stakes features required for any Docker monitoring product.
+
+### 1.1 Docker Monitor Type
+
+**Current**: No Docker monitor type exists. Kubernetes is defined but commented out.
+**Target**: Add a `Docker` monitor type with full UI integration.
+
+**Implementation**:
+
+- Add `Docker = "Docker"` to the `MonitorType` enum
+- Add Docker to the "Infrastructure" monitor type category alongside Server and SNMP
+- Add monitor type props (title: "Docker Container", description, icon: `IconProp.Cube`)
+- Create `DockerMonitorResponse` interface for container metric reporting
+- Add Docker to `getActiveMonitorTypes()` and relevant helper methods
+
+**Files to modify**:
+- `Common/Types/Monitor/MonitorType.ts` (add enum value, category, props)
+- `Common/Types/Monitor/DockerMonitor/DockerMonitorResponse.ts` (new)
+- `Common/Types/Monitor/DockerMonitor/DockerContainerMetrics.ts` (new)
+
+### 1.2 Container Metrics Collection in InfrastructureAgent
+
+**Current**: The Go-based InfrastructureAgent collects host-level CPU, memory, disk, and process metrics.
+**Target**: Extend the agent to discover and collect metrics from all running Docker containers on the host.
+
+**Implementation**:
+
+- Add a Docker collector module to the InfrastructureAgent that uses the Docker Engine API (via `/var/run/docker.sock` or configurable endpoint)
+- Discover all running containers via `GET /containers/json`
+- For each container, collect metrics via `GET /containers/{id}/stats?stream=false`:
+  - **CPU**: `cpu_stats.cpu_usage.total_usage`, `cpu_stats.system_cpu_usage`, per-core usage, throttled periods/time
+  - **Memory**: `memory_stats.usage`, `memory_stats.limit`, `memory_stats.stats.cache`, RSS, swap, working set, OOM kill count
+  - **Network**: `networks.*.rx_bytes`, `tx_bytes`, `rx_packets`, `tx_packets`, `rx_errors`, `tx_errors`, `rx_dropped`, `tx_dropped` (per interface)
+  - **Block I/O**: `blkio_stats.io_service_bytes_recursive` (read/write bytes), `io_serviced_recursive` (read/write ops)
+  - **PIDs**: `pids_stats.current`, `pids_stats.limit`
+- Collect container metadata: name, image, image ID, labels, created time, status, health check status, restart count, ports, mounts, environment (filtered for sensitive values)
+- Report interval: configurable, default 30 seconds (matching existing server monitor interval)
+
+**Files to modify**:
+- `InfrastructureAgent/collector/docker.go` (new - Docker metrics collector)
+- `InfrastructureAgent/model/docker.go` (new - Docker metric data models)
+- `InfrastructureAgent/agent.go` (add Docker collection to the main loop)
+- `InfrastructureAgent/config.go` (add Docker-related configuration: socket path, collection enabled/disabled)
+
+### 1.3 Container Inventory & Discovery
+
+**Current**: No container awareness.
+**Target**: Auto-discover containers on monitored hosts and maintain a live inventory.
+
+**Implementation**:
+
+- Create a `DockerContainer` PostgreSQL model to store discovered containers:
+  - `containerId` (Docker container ID)
+  - `containerName`
+  - `imageName`, `imageId`, `imageTag`
+  - `status` (running, paused, stopped, restarting, dead, created)
+  - `healthStatus` (healthy, unhealthy, starting, none)
+  - `labels` (JSON)
+  - `createdAt` (container creation time)
+  - `startedAt`
+  - `hostMonitorId` (reference to the Server monitor for the host)
+  - `projectId`
+  - `restartCount`
+  - `ports` (JSON - exposed ports mapping)
+  - `mounts` (JSON - volume mounts)
+  - `cpuLimit`, `memoryLimit` (resource constraints)
+- On each agent report, upsert container records (create new, update existing, mark removed containers as stopped)
+- Container inventory page in the dashboard showing all containers across all monitored hosts
+
+**Files to modify**:
+- `Common/Models/DatabaseModels/DockerContainer.ts` (new)
+- `Common/Server/Services/DockerContainerService.ts` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainers.tsx` (new - container list page)
+- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainerDetail.tsx` (new - single container detail)
+
+### 1.4 Container Lifecycle Events
+
+**Current**: No container event tracking.
+**Target**: Capture and surface container lifecycle events (start, stop, restart, OOM kill, health check failures).
+
+**Implementation**:
+
+- In the InfrastructureAgent, subscribe to Docker events via `GET /events?filters={"type":["container"]}` (long-poll/streaming)
+- Capture events: `start`, `stop`, `die`, `kill`, `oom`, `restart`, `pause`, `unpause`, `health_status`
+- Include exit code, OOM killed flag, and signal information for `die` events
+- Report events to OneUptime alongside metric data
+- Store events in the existing telemetry pipeline (as structured logs or a dedicated events table)
+- Surface events as an overlay on container metric charts (vertical markers)
+- Enable alerting on lifecycle events (e.g., alert on OOM kill, alert on restart count > N in time window)
+
+**Files to modify**:
+- `InfrastructureAgent/collector/docker_events.go` (new - event listener)
+- `Common/Types/Monitor/DockerMonitor/DockerContainerEvent.ts` (new)
+- `Worker/Jobs/Monitors/DockerContainerMonitor.ts` (new - process container reports, evaluate criteria)
+
+---
+
+## Phase 2: Alerting & Monitoring Rules (P0-P1) — Actionable Monitoring
+
+### 2.1 Container Health Check Monitoring
+
+**Current**: No health check awareness.
+**Target**: Monitor Docker health check status and alert on unhealthy containers.
+
+**Implementation**:
+
+- Extract health check status from container inspect data (`State.Health.Status`, `State.Health.FailingStreak`, `State.Health.Log`)
+- Add monitor criteria for health check status:
+  - Alert when container health transitions to `unhealthy`
+  - Alert when failing streak exceeds threshold
+  - Surface health check log output in alert details
+- Add health status column to the container inventory table
+
+**Files to modify**:
+- `Common/Server/Utils/Monitor/Criteria/DockerContainerCriteria.ts` (new)
+- `Common/Types/Monitor/CriteriaFilter.ts` (add Docker-specific filter types)
+
+### 2.2 Container Resource Threshold Alerts
+
+**Current**: Server monitor supports CPU/memory threshold alerts at the host level.
+**Target**: Per-container resource threshold alerting with limit-aware thresholds.
+
+**Implementation**:
+
+- Add monitor criteria for container-level metrics:
+  - CPU usage % (of container limit or host total)
+  - Memory usage % (of container limit)
+  - Memory usage absolute (approaching container limit)
+  - Network error rate
+  - Block I/O throughput
+  - Restart count in time window
+  - PID count approaching limit
+- **Limit-aware alerting**: When a container has resource limits set, calculate usage as a percentage of the limit rather than host total
+  - E.g., container with 2GB memory limit using 1.8GB = 90% (alert), not 1.8/64GB = 2.8% (misleading)
+- Support compound criteria (e.g., CPU > 80% AND memory > 90% for 5 minutes)
+
+**Files to modify**:
+- `Common/Server/Utils/Monitor/Criteria/DockerContainerCriteria.ts` (extend)
+- `Worker/Jobs/Monitors/DockerContainerMonitor.ts` (criteria evaluation)
+
+### 2.3 Container Auto-Restart Detection
+
+**Current**: No restart tracking.
+**Target**: Detect and alert on container restart loops (CrashLoopBackOff equivalent for Docker).
+
+**Implementation**:
+
+- Track restart count per container over sliding time windows (5 min, 15 min, 1 hour)
+- Alert when restart count exceeds configurable threshold
+- Include container exit code and last log lines in the alert context
+- Dashboard widget showing containers with highest restart frequency
+
+**Files to modify**:
+- `Worker/Jobs/Monitors/DockerContainerMonitor.ts` (add restart loop detection)
+
+---
+
+## Phase 3: Visualization & UX (P1) — Container Dashboard
+
+### 3.1 Container Overview Dashboard
+
+**Current**: No container UI.
+**Target**: Dedicated container monitoring pages with rich visualizations.
+
+**Implementation**:
+
+- **Container List Page**: Table with columns for name, image, status, health, CPU%, memory%, network I/O, uptime, restart count. Sortable, filterable, searchable
+- **Container Detail Page**: Single-container view with:
+  - Header: container name, image, status badge, health badge, uptime
+  - Metrics charts: CPU, memory, network, block I/O (time series, matching existing metric chart style)
+  - Events timeline: lifecycle events overlaid on charts
+  - Container metadata: labels, ports, mounts, environment variables (filtered), resource limits
+  - Processes: top processes inside the container (if available via `docker top`)
+  - Logs: recent container logs (linked to log management if available)
+- **Host-Container Relationship**: From the existing Server monitor detail page, add a "Containers" tab showing all containers on that host
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainers.tsx` (new - list view)
+- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainerDetail.tsx` (new - detail view)
+- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerMetricsCharts.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerEventsTimeline.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerMetadataPanel.tsx` (new)
+
+### 3.2 Container Map / Topology View
+
+**Current**: No topology visualization.
+**Target**: Visual map showing containers, their host, and network relationships.
+
+**Implementation**:
+
+- Show containers grouped by host
+- Color-code by status (green=healthy, yellow=warning, red=unhealthy/stopped)
+- Show network links between containers on the same Docker network
+- Click to drill into container detail
+- Show Docker Compose project grouping via labels
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerTopology.tsx` (new)
+
+---
+
+## Phase 4: Container Log Collection (P1) — Unified Observability
+
+### 4.1 Automatic Container Log Collection
+
+**Current**: Log collection requires explicit OTLP/Fluentd/Syslog integration per application.
+**Target**: Automatically collect logs from all Docker containers via the InfrastructureAgent.
+
+**Implementation**:
+
+- Add a log collector to the InfrastructureAgent using `GET /containers/{id}/logs?stdout=true&stderr=true&follow=true&tail=100`
+- Automatically enrich logs with container metadata:
+  - `container.id`, `container.name`, `container.image.name`, `container.image.tag`
+  - Host information (hostname, OS)
+  - Docker labels as log attributes
+- Forward logs to OneUptime's telemetry ingestion endpoint (OTLP format)
+- Configurable:
+  - Enable/disable per container (via label `oneuptime.logs.enabled=true/false`)
+  - Max log line size
+  - Log rate limiting (to prevent noisy container flooding)
+  - Include/exclude containers by name pattern or label selector
+
+**Files to modify**:
+- `InfrastructureAgent/collector/docker_logs.go` (new - log collector)
+- `InfrastructureAgent/config.go` (add log collection config)
+
+### 4.2 Container Log Correlation
+
+**Current**: No automatic correlation between container logs and container metrics.
+**Target**: Link container logs to container metrics and events for unified troubleshooting.
+
+**Implementation**:
+
+- Automatically tag container logs with `container.id` and `container.name` attributes
+- In the container detail page, add a "Logs" tab that pre-filters the log viewer to the container's logs
+- When viewing a metric anomaly or event, show a link to "View logs around this time"
+- In the log detail view, show a link to "View container metrics" when `container.id` attribute is present
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainerDetail.tsx` (add Logs tab)
+- `App/FeatureSet/Dashboard/src/Components/Logs/LogDetailsPanel.tsx` (add container link)
+
+---
+
+## Phase 5: Docker Compose & Swarm Support (P1-P2) — Multi-Container Orchestration
+
+### 5.1 Docker Compose Project Grouping
+
+**Current**: Containers are flat, no grouping.
+**Target**: Automatically detect Docker Compose projects and group containers by service.
+
+**Implementation**:
+
+- Detect Compose projects via standard labels:
+  - `com.docker.compose.project` (project name)
+  - `com.docker.compose.service` (service name)
+  - `com.docker.compose.container-number` (replica number)
+  - `com.docker.compose.oneoff` (one-off vs service container)
+- Create a Compose project view showing:
+  - Project name with list of services
+  - Per-service status (all replicas healthy, degraded, down)
+  - Per-service aggregated metrics (total CPU, memory across replicas)
+  - Service dependency visualization (if depends_on info is available via labels)
+- Alert at the service level (e.g., "all replicas of service X are down")
+
+**Files to modify**:
+- `Common/Models/DatabaseModels/DockerComposeProject.ts` (new)
+- `Common/Server/Services/DockerComposeProjectService.ts` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerComposeProjects.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerComposeProjectDetail.tsx` (new)
+
+### 5.2 Docker Swarm Monitoring
+
+**Current**: No Swarm support.
+**Target**: Monitor Docker Swarm services, tasks, and nodes.
+
+**Implementation**:
+
+- Detect if the Docker host is a Swarm manager node
+- Collect Swarm-specific data:
+  - Services: `GET /services` (desired/running replicas, update status)
+  - Tasks: `GET /tasks` (task state, assigned node, error messages)
+  - Nodes: `GET /nodes` (availability, status, resource capacity)
+- Surface Swarm service health: desired replicas vs running replicas
+- Alert when service degraded (running < desired) or task failures
+- Swarm-specific dashboard showing cluster overview
+
+**Files to modify**:
+- `InfrastructureAgent/collector/docker_swarm.go` (new)
+- `Common/Types/Monitor/DockerMonitor/DockerSwarmMetrics.ts` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerSwarm.tsx` (new)
+
+---
+
+## Phase 6: Advanced Features (P2-P3) — Differentiation
+
+### 6.1 Container Image Analysis
+
+**Current**: No image awareness beyond name/tag.
+**Target**: Track image versions across containers and optionally scan for known vulnerabilities.
+
+**Implementation**:
+
+- Maintain an image registry (image name, tag, digest, size, creation date)
+- Show which containers are running outdated images (when a newer tag is available locally)
+- Optional: integrate with Trivy or Grype for vulnerability scanning of local images
+- Dashboard showing image inventory, version distribution, and vulnerability summary
+
+### 6.2 Container Resource Recommendations
+
+**Current**: No resource guidance.
+**Target**: Recommend CPU/memory limits based on observed usage patterns.
+
+**Implementation**:
+
+- Analyze historical container metrics (p95 CPU, p99 memory over 7 days)
+- Compare actual usage to configured limits
+- Flag over-provisioned containers (limit >> usage) and under-provisioned containers (usage approaching limit)
+- Generate recommendations: "Container X uses max 256MB, but has a 4GB limit — consider reducing to 512MB"
+
+### 6.3 Container Diff / Change Detection
+
+**Current**: No change tracking.
+**Target**: Detect when container configuration changes (image update, env var change, port mapping change).
+
+**Implementation**:
+
+- Store container configuration snapshots on each agent report
+- Diff against previous snapshot and generate change events
+- Alert on unexpected configuration changes
+- Show change history in the container detail page
+
+---
+
+## Quick Wins (Can Ship First)
+
+1. **Add Docker monitor type** — Add the enum value, category, and props (no collection yet, but enables the UI scaffolding)
+2. **Basic container discovery** — Extend InfrastructureAgent to list running containers and report names, images, status
+3. **Container CPU/memory metrics** — Collect basic cgroup stats via Docker stats API
+4. **Container inventory page** — Simple table showing discovered containers across hosts
+
+---
+
+## Recommended Implementation Order
+
+1. **Phase 1.1** — Docker monitor type (enum, UI scaffolding)
+2. **Phase 1.2** — Container metrics collection in InfrastructureAgent
+3. **Phase 1.3** — Container inventory & discovery
+4. **Phase 3.1** — Container overview dashboard (list + detail pages)
+5. **Phase 1.4** — Container lifecycle events
+6. **Phase 2.1** — Container health check monitoring
+7. **Phase 2.2** — Container resource threshold alerts
+8. **Phase 4.1** — Automatic container log collection
+9. **Phase 5.1** — Docker Compose project grouping
+10. **Phase 2.3** — Container auto-restart detection
+11. **Phase 3.2** — Container map / topology view
+12. **Phase 4.2** — Container log correlation
+13. **Phase 5.2** — Docker Swarm monitoring
+14. **Phase 6.1** — Container image analysis
+15. **Phase 6.2** — Container resource recommendations
+16. **Phase 6.3** — Container diff / change detection
+
+## Verification
+
+For each feature:
+1. Unit tests for new Docker metric collection, parsing, and criteria evaluation
+2. Integration tests for container discovery, metric ingestion, and alerting APIs
+3. Manual verification with a Docker host running multiple containers (various states: healthy, unhealthy, restarting, OOM)
+4. Test with Docker Compose multi-service applications
+5. Performance test: verify agent overhead is minimal (< 1% CPU) when monitoring 50+ containers
+6. Verify container metrics accuracy by comparing agent-reported values to `docker stats` output
+7. Test graceful handling of Docker daemon unavailability (agent should not crash, should report connection failure)
--- a/Internal/Roadmap/KubernetesMonitoring.md
+++ b/Internal/Roadmap/KubernetesMonitoring.md
@@ -0,0 +1,419 @@
+# Plan: Kubernetes Monitoring for OneUptime
+
+## Context
+
+OneUptime has foundational infrastructure for Kubernetes monitoring: OTLP ingestion (HTTP and gRPC), ClickHouse metric/log/trace storage, telemetry-based monitors (Metrics, Logs, Traces), and a Helm chart for deploying OneUptime itself on Kubernetes. A `Kubernetes` monitor type exists in the `MonitorType` enum but is currently disabled and has no implementation. The OpenTelemetry Collector config supports OTLP receivers but has no Kubernetes-specific receivers (kubelet, kube-state-metrics, Prometheus). Server monitoring exists but is limited to basic VM-level checks.
+
+This plan proposes a phased implementation to deliver first-class Kubernetes monitoring — from cluster health and workload observability to intelligent alerting — leveraging OneUptime's all-in-one observability platform (metrics, logs, traces, incidents, status pages).
+
+## Completed
+
+- **OTLP Metric Ingestion** - HTTP and gRPC metric ingestion with async queue-based batch processing
+- **ClickHouse Metric Storage** - MergeTree with partitioning, per-service TTL
+- **Telemetry-Based Monitors** - Metric, Log, Trace, and Exception monitors with configurable criteria
+- **Helm Chart** - OneUptime deploys on Kubernetes with KEDA auto-scaling support
+- **OpenTelemetry Collector** - Deployed via Helm, accepts OTLP on ports 4317/4318
+- **MonitorType.Kubernetes** - Enum value defined (but disabled and unimplemented)
+
+## Gap Analysis Summary
+
+| Feature | OneUptime | DataDog | New Relic | Grafana/Prometheus | Priority |
+|---------|-----------|---------|-----------|-------------------|----------|
+| K8s metric collection (kubelet, kube-state-metrics) | None | Agent auto-discovery | K8s integration | Prometheus + kube-state-metrics | **P0** |
+| Cluster overview dashboard | None | Out-of-box | Pre-built | Pre-built via mixins | **P0** |
+| Pod/Container resource metrics | None | Live Containers | K8s cluster explorer | cAdvisor + Grafana | **P0** |
+| Node health monitoring | None | Host Map + agent | Infrastructure UI | node-exporter + Grafana | **P0** |
+| Kubernetes event ingestion | None | Auto-collected | K8s events integration | Eventrouter/Exporter | **P0** |
+| Workload health alerts (CrashLoopBackOff, OOMKilled, etc.) | None | Auto-monitors | Pre-built alerts | PrometheusRule CRDs | **P1** |
+| Namespace/workload cost attribution | None | Container cost allocation | None | Kubecost integration | **P1** |
+| K8s resource inventory (deployments, services, ingresses) | None | Orchestrator Explorer | Cluster explorer | None native | **P1** |
+| HPA/VPA monitoring | None | Yes | Partial | Prometheus metrics | **P1** |
+| Multi-cluster support | None | Yes | Yes | Thanos/Cortex | **P2** |
+| K8s log collection (pod stdout/stderr) | Via Fluentd example | DaemonSet agent | Fluent Bit integration | Loki + Promtail | **P2** |
+| Service mesh observability (Istio, Linkerd) | None | Yes | Yes | Partial | **P2** |
+| Network policy monitoring | None | NPM | None | Cilium Hubble | **P3** |
+| eBPF-based deep observability | None | Universal Service Monitoring | Pixie | Cilium/Tetragon | **P3** |
+
+---
+
+## Phase 1: Foundation (P0) — Kubernetes Metric Collection & Visibility
+
+Without these, OneUptime cannot monitor any Kubernetes cluster. This phase makes K8s metrics flow into the platform and provides basic visibility.
+
+### 1.1 OpenTelemetry Collector Kubernetes Receivers
+
+**Current**: OTel Collector only has OTLP receivers. No Kubernetes-specific metric collection.
+**Target**: Pre-configured OTel Collector with receivers for kubelet, kube-state-metrics, and Kubernetes events.
+
+**Implementation**:
+
+- Add `kubeletstats` receiver to the OTel Collector config for node and pod resource metrics:
+  - CPU, memory, filesystem, network per node and per pod/container
+  - Collection interval: 30s
+  - Auth via serviceAccount token
+- Add `k8s_cluster` receiver for cluster-level metrics from the Kubernetes API:
+  - Deployment, ReplicaSet, StatefulSet, DaemonSet replica counts and status
+  - Pod phase, container states (waiting/running/terminated with reasons)
+  - Node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
+  - Namespace resource quotas and limit ranges
+  - HPA current/desired replicas
+- Add `k8sobjects` receiver for Kubernetes events:
+  - Watch Events API for Warning and Normal events
+  - Ingest as logs with structured attributes (reason, involvedObject, message)
+- Add `k8s_events` receiver as alternative lightweight event collection
+- Configure `k8sattributes` processor to enrich all telemetry with K8s metadata:
+  - Pod name, namespace, node, deployment, replicaset, labels, annotations
+- Provide Helm values to enable/disable K8s monitoring and configure which namespaces to monitor
+
+**Files to modify**:
+- `OTelCollector/otel-collector-config.template.yaml` (add kubeletstats, k8s_cluster, k8sobjects receivers and k8sattributes processor)
+- `HelmChart/Public/oneuptime/templates/otel-collector.yaml` (restore and configure OTel Collector deployment with proper RBAC)
+- `HelmChart/Public/oneuptime/templates/otel-collector-rbac.yaml` (new - ClusterRole, ClusterRoleBinding, ServiceAccount for K8s API access)
+- `HelmChart/Public/oneuptime/values.yaml` (add kubernetesMonitoring config section)
+
+### 1.2 Kubernetes Cluster Overview Dashboard Template
+
+**Current**: No pre-built Kubernetes dashboards.
+**Target**: Auto-generated cluster overview dashboard showing key health indicators.
+
+**Implementation**:
+
+- Create a dashboard template with the following panels:
+  - **Cluster Summary**: Total nodes, pods (running/pending/failed), namespaces, deployments
+  - **Node Health**: CPU and memory utilization per node, node conditions
+  - **Pod Status**: Pod phase distribution (Running/Pending/Succeeded/Failed/Unknown)
+  - **Resource Utilization**: Cluster-wide CPU and memory usage vs capacity (requests, limits, actual)
+  - **Top Consumers**: Top 10 pods by CPU usage, top 10 by memory usage
+  - **Recent Events**: Kubernetes Warning events stream
+  - **Container Restarts**: Pods with highest restart counts
+- Auto-detect K8s metrics and offer dashboard creation during onboarding
+- Use template variables for namespace and node filtering
+
+**Files to modify**:
+- `Common/Types/Dashboard/Templates/KubernetesCluster.ts` (new - cluster overview template)
+- `Common/Types/Dashboard/Templates/KubernetesWorkload.ts` (new - per-namespace workload template)
+- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Templates.tsx` (add K8s templates to gallery)
+
+### 1.3 Pod and Container Resource Metrics
+
+**Current**: No container-level visibility.
+**Target**: Detailed resource metrics for every pod and container with drill-down.
+
+**Implementation**:
+
+- Ensure the following kubeletstats metrics are ingested and queryable:
+  - `k8s.pod.cpu.utilization`, `k8s.pod.memory.usage`, `k8s.pod.memory.rss`
+  - `k8s.pod.network.io` (rx/tx bytes), `k8s.pod.filesystem.usage`
+  - `container.cpu.utilization`, `container.memory.usage`, `container.restarts`
+- Create a "Kubernetes" section in the dashboard navigation:
+  - Cluster > Namespace > Workload > Pod > Container drill-down hierarchy
+- Pod detail page showing: resource usage over time, container statuses, events, logs (linked), traces (linked)
+- Calculate resource efficiency: actual usage vs requests vs limits
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/` (new directory)
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/ClusterOverview.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Namespaces.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Pods.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/PodDetail.tsx` (new)
+
+### 1.4 Node Health Monitoring
+
+**Current**: No node-level metrics.
+**Target**: Per-node resource utilization, conditions, and capacity tracking.
+
+**Implementation**:
+
+- Ingest node metrics via kubeletstats receiver:
+  - `k8s.node.cpu.utilization`, `k8s.node.memory.usage`, `k8s.node.memory.available`
+  - `k8s.node.filesystem.usage`, `k8s.node.filesystem.capacity`
+  - `k8s.node.network.io`, `k8s.node.condition` (Ready, MemoryPressure, etc.)
+- Node list page: table with all nodes showing CPU%, memory%, disk%, conditions, pod count
+- Node detail page: time-series charts for resource usage, pod list on node, events
+- Node capacity planning: show allocatable vs requested vs used per node
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Nodes.tsx` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/NodeDetail.tsx` (new)
+
+### 1.5 Kubernetes Event Ingestion
+
+**Current**: No Kubernetes event collection.
+**Target**: Ingest and surface Kubernetes events as structured logs with correlation to resources.
+
+**Implementation**:
+
+- Configure `k8sobjects` receiver to watch Kubernetes Events
+- Map events to structured log entries:
+  - `severity` from event type (Warning -> WARN, Normal -> INFO)
+  - `body` from event message
+  - Attributes: `k8s.event.reason`, `k8s.event.count`, `k8s.object.kind`, `k8s.object.name`, `k8s.namespace.name`
+- Create a dedicated "Kubernetes Events" view:
+  - Filterable by namespace, event reason, object kind
+  - Timeline visualization showing event frequency
+  - Link events to related pods/deployments/nodes
+- Alert on specific event patterns (e.g., repeated FailedScheduling, FailedMount)
+
+**Files to modify**:
+- `OTelCollector/otel-collector-config.template.yaml` (add k8sobjects receiver)
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Events.tsx` (new)
+
+---
+
+## Phase 2: Intelligent Alerting & Workload Health (P1) — Actionable Monitoring
+
+### 2.1 Kubernetes-Aware Alert Templates
+
+**Current**: Generic metric threshold alerts only. Users must manually configure alerts for K8s failure modes.
+**Target**: Pre-built alert templates for common Kubernetes failure patterns.
+
+**Implementation**:
+
+- Create alert templates for critical K8s conditions:
+  - **CrashLoopBackOff**: Alert when `k8s.container.restarts` increases rapidly (> N restarts in M minutes)
+  - **OOMKilled**: Alert on container termination reason = OOMKilled
+  - **Pod Pending**: Alert when pods remain in Pending phase for > N minutes
+  - **Node NotReady**: Alert when node condition transitions to NotReady
+  - **High Resource Utilization**: Alert when node CPU > 90% or memory > 85% sustained
+  - **Deployment Replica Mismatch**: Alert when available replicas < desired replicas for > N minutes
+  - **PVC Disk Full**: Alert when PV usage > 90% capacity
+  - **Failed Scheduling**: Alert on repeated FailedScheduling events
+  - **Image Pull Failures**: Alert on ErrImagePull/ImagePullBackOff events
+  - **Job/CronJob Failures**: Alert when job completion fails
+- One-click enable for each alert template during K8s monitoring setup
+- Auto-route alerts to the OneUptime incident management system
+
+**Files to modify**:
+- `Common/Types/Monitor/Templates/KubernetesAlertTemplates.ts` (new)
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/AlertSetup.tsx` (new - guided alert configuration)
+- `Worker/Jobs/TelemetryMonitor/MonitorTelemetryMonitor.ts` (support K8s-specific criteria evaluation)
+
+### 2.2 Kubernetes Resource Inventory
+
+**Current**: No visibility into K8s resource state.
+**Target**: Live inventory of Kubernetes resources with health status.
+
+**Implementation**:
+
+- Create a `KubernetesResource` model stored in ClickHouse (or PostgreSQL depending on query patterns):
+  - Kind, name, namespace, labels, annotations, status, conditions, timestamps
+  - Updated via the `k8s_cluster` receiver or periodic API sync
+- Resource pages:
+  - **Deployments**: List with replica status (ready/desired), last update, strategy
+  - **StatefulSets**: Ordered pod status, PVC bindings
+  - **DaemonSets**: Node coverage, desired vs current vs ready
+  - **Services**: Type (ClusterIP/NodePort/LoadBalancer), endpoints, selector
+  - **Ingresses**: Host rules, backend services, TLS status
+  - **ConfigMaps/Secrets**: List with last-modified (secrets show metadata only, never values)
+  - **PVCs**: Bound PV, capacity, access modes, storage class
+- Drill-down from any resource to its associated pods, events, and telemetry
+
+**Files to modify**:
+- `Common/Models/AnalyticsModels/KubernetesResource.ts` (new)
+- `Telemetry/Services/KubernetesResourceService.ts` (new - sync K8s resources)
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Resources/` (new - pages for each resource kind)
+
+### 2.3 HPA and VPA Monitoring
+
+**Current**: No autoscaler visibility.
+**Target**: Track HPA/VPA behavior and scaling events.
+
+**Implementation**:
+
+- Ingest HPA metrics from `k8s_cluster` receiver:
+  - `k8s.hpa.current_replicas`, `k8s.hpa.desired_replicas`, `k8s.hpa.min_replicas`, `k8s.hpa.max_replicas`
+  - Target metric values vs actual
+- HPA overview page:
+  - List all HPAs with current/desired/min/max replicas
+  - Time-series chart showing scaling events overlaid with the target metric
+  - Alert when HPA is at max replicas sustained (capacity ceiling)
+  - Alert when scale-up frequency is abnormally high (thrashing)
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Autoscaling.tsx` (new)
+
+### 2.4 Namespace Resource Quota Monitoring
+
+**Current**: No quota tracking.
+**Target**: Track resource quota usage per namespace and alert on approaching limits.
+
+**Implementation**:
+
+- Ingest quota metrics from `k8s_cluster` receiver:
+  - `k8s.resource_quota.hard.cpu`, `k8s.resource_quota.used.cpu`
+  - `k8s.resource_quota.hard.memory`, `k8s.resource_quota.used.memory`
+  - `k8s.resource_quota.hard.pods`, `k8s.resource_quota.used.pods`
+- Namespace detail page showing quota utilization gauges
+- Alert when any quota usage exceeds 80% (configurable threshold)
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/NamespaceDetail.tsx` (new)
+
+---
+
+## Phase 3: Advanced Observability (P2) — Correlation & Deep Visibility
+
+### 3.1 Kubernetes Log Collection
+
+**Current**: Users can manually configure Fluentd to send logs. No built-in K8s log collection.
+**Target**: Automated pod log collection via OTel Collector with K8s metadata enrichment.
+
+**Implementation**:
+
+- Add `filelog` receiver to OTel Collector for collecting container logs from `/var/log/pods/`:
+  - Parse container runtime log format (Docker JSON, CRI)
+  - Extract pod name, namespace, container name from file path
+  - Enrich with K8s metadata via `k8sattributes` processor
+- Deploy OTel Collector as a DaemonSet (in addition to existing Deployment) for log collection
+- Helm values to configure:
+  - Namespace inclusion/exclusion filters
+  - Log level filtering (e.g., only collect WARN and above)
+  - Container name exclusion patterns
+- Link pod logs in the Kubernetes pod detail page
+
+**Files to modify**:
+- `HelmChart/Public/oneuptime/templates/otel-collector-daemonset.yaml` (new - DaemonSet for log collection)
+- `OTelCollector/otel-collector-daemonset-config.template.yaml` (new - DaemonSet-specific config with filelog receiver)
+- `HelmChart/Public/oneuptime/values.yaml` (add DaemonSet configuration options)
+
+### 3.2 Multi-Cluster Support
+
+**Current**: Single-cluster assumption.
+**Target**: Monitor multiple Kubernetes clusters from a single OneUptime project.
+
+**Implementation**:
+
+- Add `cluster` attribute to all K8s metrics via OTel Collector resource processor
+- Cluster registration: each cluster gets a unique name and OneUptime API key
+- Helm install per cluster with cluster-specific configuration
+- Cluster selector in the K8s monitoring UI (template variable)
+- Cross-cluster comparison views (e.g., resource utilization across clusters)
+- Unified alerting: same alert rules applied across all clusters or cluster-specific
+
+**Files to modify**:
+- `OTelCollector/otel-collector-config.template.yaml` (add resource processor with cluster name)
+- `HelmChart/Public/oneuptime/values.yaml` (add clusterName config)
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Clusters.tsx` (new - multi-cluster view)
+
+### 3.3 Service Mesh Observability
+
+**Current**: No service mesh integration.
+**Target**: Ingest and visualize service mesh metrics from Istio, Linkerd, or similar.
+
+**Implementation**:
+
+- Add Prometheus receiver to OTel Collector for scraping service mesh metrics:
+  - Istio: `istio_requests_total`, `istio_request_duration_milliseconds`, `istio_tcp_connections_opened_total`
+  - Linkerd: `request_total`, `response_latency_ms`
+- Service-to-service traffic map from mesh metrics
+- mTLS status visibility
+- Circuit breaker and retry metrics
+- Dashboard templates for Istio and Linkerd
+
+**Files to modify**:
+- `OTelCollector/otel-collector-config.template.yaml` (add prometheus receiver for mesh metrics)
+- `Common/Types/Dashboard/Templates/ServiceMesh.ts` (new - mesh dashboard templates)
+
+### 3.4 Kubernetes-to-Telemetry Correlation
+
+**Current**: K8s resources and telemetry (metrics, logs, traces) are separate.
+**Target**: Click on any K8s resource to see correlated telemetry.
+
+**Implementation**:
+
+- From any pod/deployment/service page, show:
+  - **Metrics**: CPU, memory, network filtered to that resource
+  - **Logs**: Logs from containers in that pod, filtered by K8s metadata attributes
+  - **Traces**: Traces originating from or passing through that service
+  - **Events**: Kubernetes events for that resource
+- Use `k8sattributes` processor enrichment to correlate:
+  - `k8s.pod.name`, `k8s.namespace.name`, `k8s.deployment.name` across all signals
+- Deep link from incident timeline to K8s resource view
+
+**Files to modify**:
+- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/PodDetail.tsx` (add telemetry correlation tabs)
+- `App/FeatureSet/Dashboard/src/Components/Kubernetes/ResourceTelemetryPanel.tsx` (new - reusable correlation panel)
+
+---
+
+## Phase 4: Intelligence & Differentiation (P3) — Long-Term
+
+### 4.1 Kubernetes Cost Attribution
+
+- Track CPU and memory usage per namespace, workload, and label
+- Calculate cost based on node instance pricing (configurable per cluster)
+- Show cost trends over time, cost per team/project (via labels)
+- Identify idle resources (requested but unused capacity)
+- Recommendations: right-size requests/limits based on actual usage
+
+### 4.2 Network Policy Monitoring
+
+- Visualize network policies and their effect on pod communication
+- Alert on denied network connections
+- Integration with Cilium Hubble or Calico for deep network visibility
+- Service dependency map derived from actual network traffic
+
+### 4.3 eBPF-Based Deep Observability
+
+- Kernel-level visibility without application instrumentation
+- Automatic service discovery and dependency mapping
+- DNS monitoring and latency
+- TCP connection tracking and retransmit analysis
+- Integration with tools like Tetragon, Pixie, or custom eBPF probes
+
+### 4.4 Kubernetes Compliance and Security Monitoring
+
+- Pod security standards compliance tracking
+- RBAC audit logging and visualization
+- Image vulnerability scanning status
+- Network policy coverage analysis
+- CIS Kubernetes Benchmark compliance scoring
+
+### 4.5 GitOps Integration
+
+- Track ArgoCD/Flux deployments as annotations on metric charts
+- Correlate deployment events with performance changes
+- Show deployment history per workload with rollback status
+- Alert when deployment sync fails or drift is detected
+
+---
+
+## Quick Wins (Can Ship This Week)
+
+1. **Enable Kubernetes MonitorType** - Uncomment the Kubernetes entry in `getAllMonitorTypeProps()` and wire it to existing telemetry monitors
+2. **Add k8sattributes processor** - Enrich all existing OTLP data with K8s metadata for free
+3. **Kubernetes dashboard template** - Create a basic cluster health dashboard using standard OTEL K8s metric names
+4. **K8s event alerting** - Use existing log monitors to alert on K8s Warning events once event ingestion is configured
+5. **Document OTel Collector K8s setup** - Guide for users to configure their own OTel Collector with K8s receivers pointing to OneUptime
+
+---
+
+## Recommended Implementation Order
+
+1. **Quick Wins** - Enable MonitorType, k8sattributes processor, documentation
+2. **Phase 1.1** - OTel Collector K8s receivers (prerequisite for everything else)
+3. **Phase 1.5** - Kubernetes event ingestion (high value, uses existing log infrastructure)
+4. **Phase 1.2** - Cluster overview dashboard template
+5. **Phase 1.3** - Pod and container resource metrics pages
+6. **Phase 1.4** - Node health monitoring pages
+7. **Phase 2.1** - K8s-aware alert templates (makes monitoring actionable)
+8. **Phase 2.2** - Resource inventory pages
+9. **Phase 2.4** - Namespace quota monitoring
+10. **Phase 2.3** - HPA/VPA monitoring
+11. **Phase 3.1** - K8s log collection via DaemonSet
+12. **Phase 3.4** - K8s-to-telemetry correlation
+13. **Phase 3.2** - Multi-cluster support
+14. **Phase 3.3** - Service mesh observability
+15. **Phase 4.x** - Cost attribution, network policies, eBPF, compliance, GitOps
+
+## Verification
+
+For each feature:
+1. Unit tests for new K8s metric query builders, resource models, and alert template logic
+2. Integration tests for OTel Collector K8s receivers (use minikube or kind in CI)
+3. Manual verification on a test cluster (minikube/kind) with representative workloads
+4. Verify K8s metadata enrichment via `k8sattributes` processor across metrics, logs, and traces
+5. Check ClickHouse query performance for K8s-specific queries (namespace filtering, resource correlation)
+6. Load test with realistic cluster sizes (100+ nodes, 1000+ pods) to validate metric volume handling
+7. Verify RBAC permissions are minimal (principle of least privilege for ClusterRole)
+8. Test Helm chart upgrades to ensure K8s monitoring can be enabled without disruption