feat: enhance Kubernetes monitoring plan with comprehensive infrastructure details and implementation phases

This commit is contained in:
Nawaz Dhandala
2026-03-24 18:44:57 +00:00
parent 24db673926
commit e12e3cfc08

View File

@@ -2,7 +2,7 @@
## Context
OneUptime has foundational infrastructure for Kubernetes monitoring: OTLP ingestion (HTTP and gRPC), ClickHouse metric/log/trace storage, telemetry-based monitors (Metrics, Logs, Traces), and a Helm chart for deploying OneUptime itself on Kubernetes. A `Kubernetes` monitor type exists in the `MonitorType` enum but is currently disabled and has no implementation. The OpenTelemetry Collector config supports OTLP receivers but has no Kubernetes-specific receivers (kubelet, kube-state-metrics, Prometheus). Server monitoring exists but is limited to basic VM-level checks.
OneUptime has comprehensive Kubernetes monitoring infrastructure: OTLP ingestion (HTTP and gRPC), ClickHouse metric/log/trace storage, telemetry-based monitors (Metrics, Logs, Traces, Kubernetes), a kubernetes-agent Helm chart with kubeletstats/k8s_cluster/k8sobjects/prometheus receivers, 40+ dashboard pages for cluster/workload/node/pod/event exploration, HPA/VPA monitoring, service mesh observability, and 12 pre-built alert templates. The `MonitorType.Kubernetes` is fully enabled with monitor creation forms, metric catalog, and template picker.
This plan proposes a phased implementation to deliver first-class Kubernetes monitoring — from cluster health and workload observability to intelligent alerting — leveraging OneUptime's all-in-one observability platform (metrics, logs, traces, incidents, status pages).
@@ -13,7 +13,7 @@ This plan proposes a phased implementation to deliver first-class Kubernetes mon
- **Telemetry-Based Monitors** - Metric, Log, Trace, and Exception monitors with configurable criteria
- **Helm Chart** - OneUptime deploys on Kubernetes with KEDA auto-scaling support
- **OpenTelemetry Collector** - Deployed via Helm, accepts OTLP on ports 4317/4318
- **MonitorType.Kubernetes** - Enum value defined (but disabled and unimplemented)
- **MonitorType.Kubernetes** - Enum value defined, fully enabled and implemented
- **Phase 1.1: kubernetes-agent Helm Chart** - Full chart with kubeletstats, k8s_cluster, k8sobjects receivers, DaemonSet for logs, RBAC, configmaps, secret management
- **Phase 1.2: KubernetesCluster Database Model & Auto-Discovery** - Full model with provider detection, otelCollectorStatus, cached counts; auto-discovery via `findOrCreateByClusterIdentifier()`; disconnect detection after 5 min
- **Phase 1.3: Kubernetes Observability Product (Dashboard/Routes)** - Full sidebar navigation, 40+ routes, cluster list with onboarding guide, breadcrumbs
@@ -27,6 +27,10 @@ This plan proposes a phased implementation to deliver first-class Kubernetes mon
- **Phase 3.1: Kubernetes Log Collection** - DaemonSet with filelog receiver in kubernetes-agent Helm chart, logs tab on pod detail pages
- **Phase 3.3: Kubernetes-to-Telemetry Correlation** - Metrics, Logs, Events, YAML, and Containers tabs on all resource detail pages
- **Phase 4.10: Live YAML / Resource Inspection** - YAML tab component showing live resource specs
- **Phase 2.1: Enable MonitorType.Kubernetes** - Kubernetes enabled in `getAllMonitorTypeProps()`, `getActiveMonitorTypes()`, and `getMonitorTypeCategories()`; full monitor creation form with cluster/resource selection; `monitorKubernetes()` implemented in TelemetryMonitor worker
- **Phase 2.2: Kubernetes-Aware Alert Templates** - 12 alert templates (CrashLoopBackOff, Pod Pending, Node NotReady, High CPU/Memory/Disk, Replica Mismatch, Job Failures, etcd No Leader, API Server Throttling, Scheduler Backlog, DaemonSet Unavailable) with template picker UI and metric catalog
- **Phase 2.3: HPA and VPA Monitoring** - HPA and VPA list and detail pages, Helm chart config for HPA/VPA metrics collection via k8s_cluster receiver, metrics in KubernetesMetricCatalog
- **Phase 3.1: Service Mesh Observability** - Service mesh dashboard with Istio and Linkerd support, Prometheus scrape configs for service mesh metrics in kubernetes-agent Helm chart
---
@@ -171,19 +175,19 @@ There are two separate Helm charts. OneUptime's own OTel Collector is **not modi
| Feature | OneUptime | DataDog | New Relic | Grafana/Prometheus | Priority |
|---------|-----------|---------|-----------|-------------------|----------|
| K8s metric collection (kubelet, kube-state-metrics) | None | Agent auto-discovery | K8s integration | Prometheus + kube-state-metrics | **P0** |
| Cluster overview dashboard | None | Out-of-box | Pre-built | Pre-built via mixins | **P0** |
| Pod/Container resource metrics | None | Live Containers | K8s cluster explorer | cAdvisor + Grafana | **P0** |
| Node health monitoring | None | Host Map + agent | Infrastructure UI | node-exporter + Grafana | **P0** |
| Kubernetes event ingestion | None | Auto-collected | K8s events integration | Eventrouter/Exporter | **P0** |
| Workload health alerts (CrashLoopBackOff, OOMKilled, etc.) | None | Auto-monitors | Pre-built alerts | PrometheusRule CRDs | **P1** |
| Namespace/workload cost attribution | None | Container cost allocation | None | Kubecost integration | **P1** |
| K8s resource inventory (deployments, services, ingresses) | None | Orchestrator Explorer | Cluster explorer | None native | **P1** |
| HPA/VPA monitoring | None | Yes | Partial | Prometheus metrics | **P1** |
| Multi-cluster support | None | Yes | Yes | Thanos/Cortex | **P0** |
| K8s log collection (pod stdout/stderr) | Via Fluentd example | DaemonSet agent | Fluent Bit integration | Loki + Promtail | **P2** |
| Service mesh observability (Istio, Linkerd) | None | Yes | Yes | Partial | **P2** |
| Control plane monitoring (etcd, API server, scheduler, controller-manager) | None | Yes (Agent check) | K8s integration | Prometheus scrape + mixins | **P0** |
| K8s metric collection (kubelet, kube-state-metrics) | **Done** | Agent auto-discovery | K8s integration | Prometheus + kube-state-metrics | ~~P0~~ |
| Cluster overview dashboard | **Done** | Out-of-box | Pre-built | Pre-built via mixins | ~~P0~~ |
| Pod/Container resource metrics | **Done** | Live Containers | K8s cluster explorer | cAdvisor + Grafana | ~~P0~~ |
| Node health monitoring | **Done** | Host Map + agent | Infrastructure UI | node-exporter + Grafana | ~~P0~~ |
| Kubernetes event ingestion | **Done** | Auto-collected | K8s events integration | Eventrouter/Exporter | ~~P0~~ |
| Workload health alerts (CrashLoopBackOff, OOMKilled, etc.) | **Done** | Auto-monitors | Pre-built alerts | PrometheusRule CRDs | ~~P1~~ |
| Namespace/workload cost attribution | None | Container cost allocation | None | Kubecost integration | **P3** |
| K8s resource inventory (deployments, services, ingresses) | **Done** | Orchestrator Explorer | Cluster explorer | None native | ~~P1~~ |
| HPA/VPA monitoring | **Done** | Yes | Partial | Prometheus metrics | ~~P1~~ |
| Multi-cluster support | **Done** | Yes | Yes | Thanos/Cortex | ~~P0~~ |
| K8s log collection (pod stdout/stderr) | **Done** | DaemonSet agent | Fluent Bit integration | Loki + Promtail | ~~P2~~ |
| Service mesh observability (Istio, Linkerd) | **Done** | Yes | Yes | Partial | ~~P2~~ |
| Control plane monitoring (etcd, API server, scheduler, controller-manager) | **Done** | Yes (Agent check) | K8s integration | Prometheus scrape + mixins | ~~P0~~ |
| Network policy monitoring | None | NPM | None | Cilium Hubble | **P3** |
| eBPF-based deep observability | None | Universal Service Monitoring | Pixie | Cilium/Tetragon | **P3** |
@@ -191,108 +195,9 @@ There are two separate Helm charts. OneUptime's own OTel Collector is **not modi
---
## Phase 1: Foundation (P0) — COMPLETED
## Phases 13: Foundation, Alerting, and Advanced Observability — COMPLETED
All Phase 1 items have been implemented. See the Completed section above for details.
---
## Phase 2: Intelligent Alerting & Workload Health (P1) — Actionable Monitoring
### 2.1 Enable MonitorType.Kubernetes
**Current**: `MonitorType.Kubernetes` exists in the enum but is disabled.
**Target**: Enable it and wire it to cluster-scoped monitoring.
**Implementation**:
- Enable Kubernetes in `getAllMonitorTypeProps()` and `getActiveMonitorTypes()`
- Kubernetes monitor creation flow:
1. Select cluster (from auto-discovered clusters)
2. Select resource scope: Cluster / Namespace / Workload / Node / Pod
3. Configure conditions (metric thresholds, event patterns, state changes)
- Monitor evaluation: query ClickHouse for K8s metrics scoped to the selected cluster and resource
- Link monitors to the Kubernetes product pages (click a pod → see its monitors)
**Files to modify**:
- `Common/Types/Monitor/MonitorType.ts` (enable Kubernetes type)
- `App/FeatureSet/Dashboard/src/Pages/Monitor/MonitorCreate.tsx` (add K8s cluster/resource selection)
- `Worker/Jobs/TelemetryMonitor/MonitorTelemetryMonitor.ts` (support K8s-specific criteria evaluation)
### 2.2 Kubernetes-Aware Alert Templates
**Current**: Generic metric threshold alerts only. Users must manually configure alerts for K8s failure modes.
**Target**: Pre-built alert templates for common Kubernetes failure patterns.
**Implementation**:
- Create alert templates for critical K8s conditions:
- **CrashLoopBackOff**: Alert when `k8s.container.restarts` increases rapidly (> N restarts in M minutes)
- **OOMKilled**: Alert on container termination reason = OOMKilled
- **Pod Pending**: Alert when pods remain in Pending phase for > N minutes
- **Node NotReady**: Alert when node condition transitions to NotReady
- **High Resource Utilization**: Alert when node CPU > 90% or memory > 85% sustained
- **Deployment Replica Mismatch**: Alert when available replicas < desired replicas for > N minutes
- **PVC Disk Full**: Alert when PV usage > 90% capacity
- **Failed Scheduling**: Alert on repeated FailedScheduling events
- **Image Pull Failures**: Alert on ErrImagePull/ImagePullBackOff events
- **Job/CronJob Failures**: Alert when job completion fails
- **etcd No Leader**: `etcd_server_has_leader == 0` — critical, immediate page
- **etcd Frequent Leader Elections**: rate of `etcd_server_leader_changes_seen_total` > 3/hour — warning
- **etcd High WAL Fsync Latency**: p99 > 100ms — warning; > 500ms — critical
- **etcd DB Size Near Quota**: > 80% of quota — warning; > 90% — critical
- **API Server Throttling**: `apiserver_dropped_requests_total` rate > 0 — critical
- **API Server Latency**: p99 > 1s for non-WATCH verbs — warning
- **Scheduler Backlog**: `scheduler_pending_pods` > 0 for > 5 minutes — warning
- One-click enable for each alert template during K8s monitoring setup
- Auto-route alerts to the OneUptime incident management system
**Files to modify**:
- `Common/Types/Monitor/Templates/KubernetesAlertTemplates.ts` (new)
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/AlertSetup.tsx` (new - guided alert configuration)
- `Worker/Jobs/TelemetryMonitor/MonitorTelemetryMonitor.ts` (support K8s-specific criteria evaluation)
### 2.3 HPA and VPA Monitoring (was 2.4)
**Current**: No autoscaler visibility.
**Target**: Track HPA/VPA behavior and scaling events.
**Implementation**:
- Ingest HPA metrics from `k8s_cluster` receiver:
- `k8s.hpa.current_replicas`, `k8s.hpa.desired_replicas`, `k8s.hpa.min_replicas`, `k8s.hpa.max_replicas`
- Target metric values vs actual
- HPA overview page within cluster:
- List all HPAs with current/desired/min/max replicas
- Time-series chart showing scaling events overlaid with the target metric
- Alert when HPA is at max replicas sustained (capacity ceiling)
- Alert when scale-up frequency is abnormally high (thrashing)
**Files to modify**:
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Autoscaling.tsx` (new)
---
## Phase 3: Advanced Observability (P2) — Correlation & Deep Visibility
### 3.1 Service Mesh Observability
**Current**: No service mesh integration.
**Target**: Ingest and visualize service mesh metrics from Istio, Linkerd, or similar.
**Implementation**:
- Add Prometheus receiver to the `kubernetes-agent` OTel Collector config for scraping service mesh metrics:
- Istio: `istio_requests_total`, `istio_request_duration_milliseconds`, `istio_tcp_connections_opened_total`
- Linkerd: `request_total`, `response_latency_ms`
- Service-to-service traffic map from mesh metrics
- mTLS status visibility
- Circuit breaker and retry metrics
- Dashboard templates for Istio and Linkerd
**Files to modify**:
- `HelmChart/Public/kubernetes-agent/templates/configmap-deployment.yaml` (add prometheus receiver for mesh metrics)
- `Common/Types/Dashboard/Templates/ServiceMesh.ts` (new - mesh dashboard templates)
All Phase 1 (P0), Phase 2 (P1), and Phase 3 (P2) items have been implemented. See the Completed section above for details.
---
@@ -493,19 +398,9 @@ The roadmap achieves **feature parity** on core K8s monitoring by P2. This is ne
---
## Quick Wins (Can Ship This Week)
1. **Enable Kubernetes MonitorType** - Uncomment the Kubernetes entry in `getAllMonitorTypeProps()` and wire it to existing telemetry monitors
---
## Recommended Implementation Order
1. **Quick Win** - Enable MonitorType.Kubernetes (makes K8s data actionable)
2. **Phase 2.2** - K8s-aware alert templates, including etcd/control plane alerts
3. **Phase 2.3** - HPA/VPA monitoring
4. **Phase 3.1** - Service mesh observability
5. **Phase 4.x** - Cost attribution, network policies, eBPF, compliance, GitOps, AI RCA, incident automation, status page automation, topology map, managed provider integrations, deployment tracking
1. **Phase 4.x** - Cost attribution, network policies, eBPF, compliance, GitOps, AI RCA, incident automation, status page automation, topology map, managed provider integrations, deployment tracking
## Verification