mirror of
https://github.com/OneUptime/oneuptime.git
synced 2026-04-06 00:32:12 +02:00
feat: add plans for Docker and Kubernetes monitoring implementation
This commit is contained in:
400
Internal/Roadmap/DockerContainerMonitoring.md
Normal file
400
Internal/Roadmap/DockerContainerMonitoring.md
Normal file
@@ -0,0 +1,400 @@
|
||||
# Plan: Docker Container Monitoring for OneUptime
|
||||
|
||||
## Context
|
||||
|
||||
OneUptime's infrastructure monitoring currently supports Server/VM monitoring via the InfrastructureAgent (a Go-based agent that collects CPU, memory, disk, and process metrics) and has a commented-out Kubernetes monitor type. There is **no Docker container monitoring** today. Users running containerized workloads — whether on bare-metal Docker hosts, Docker Compose, or Docker Swarm — have no visibility into container-level health, resource consumption, or lifecycle events.
|
||||
|
||||
Docker container monitoring is a critical gap: Docker remains the dominant container runtime, and many users run workloads on Docker without Kubernetes. Competitors (Datadog, New Relic, Grafana Cloud) all provide first-class Docker monitoring. This plan proposes a phased implementation to close this gap.
|
||||
|
||||
## Gap Analysis Summary
|
||||
|
||||
| Feature | OneUptime | Datadog | New Relic | Priority |
|
||||
|---------|-----------|---------|-----------|----------|
|
||||
| Container discovery & inventory | None | Auto-discovery via agent | Auto-discovery via infra agent | **P0** |
|
||||
| Container CPU/memory/network/disk metrics | None | Full metrics via cgroups | Full metrics via cgroups | **P0** |
|
||||
| Container lifecycle events (start/stop/restart/OOM) | None | Event stream + alerts | Event stream + alerts | **P0** |
|
||||
| Container health status monitoring | None | Health check integration | Health check integration | **P0** |
|
||||
| Container log collection | None (generic OTLP only) | Auto-collected per container | Auto-collected per container | **P1** |
|
||||
| Docker Compose service grouping | None | Auto-detection via labels | Label-based grouping | **P1** |
|
||||
| Container image vulnerability scanning | None | Integrated via Snyk/Trivy | None | **P2** |
|
||||
| Docker Swarm service monitoring | None | Full Swarm support | Limited | **P2** |
|
||||
| Container resource limit alerts | None | OOM/throttle alerts | Threshold alerts | **P1** |
|
||||
| Container networking (inter-container traffic) | None | Network map + flow data | Limited | **P2** |
|
||||
| Live container exec / inspect | None | None | None | **P3** |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Foundation (P0) — Container Discovery & Core Metrics
|
||||
|
||||
These are table-stakes features required for any Docker monitoring product.
|
||||
|
||||
### 1.1 Docker Monitor Type
|
||||
|
||||
**Current**: No Docker monitor type exists. Kubernetes is defined but commented out.
|
||||
**Target**: Add a `Docker` monitor type with full UI integration.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add `Docker = "Docker"` to the `MonitorType` enum
|
||||
- Add Docker to the "Infrastructure" monitor type category alongside Server and SNMP
|
||||
- Add monitor type props (title: "Docker Container", description, icon: `IconProp.Cube`)
|
||||
- Create `DockerMonitorResponse` interface for container metric reporting
|
||||
- Add Docker to `getActiveMonitorTypes()` and relevant helper methods
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Monitor/MonitorType.ts` (add enum value, category, props)
|
||||
- `Common/Types/Monitor/DockerMonitor/DockerMonitorResponse.ts` (new)
|
||||
- `Common/Types/Monitor/DockerMonitor/DockerContainerMetrics.ts` (new)
|
||||
|
||||
### 1.2 Container Metrics Collection in InfrastructureAgent
|
||||
|
||||
**Current**: The Go-based InfrastructureAgent collects host-level CPU, memory, disk, and process metrics.
|
||||
**Target**: Extend the agent to discover and collect metrics from all running Docker containers on the host.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add a Docker collector module to the InfrastructureAgent that uses the Docker Engine API (via `/var/run/docker.sock` or configurable endpoint)
|
||||
- Discover all running containers via `GET /containers/json`
|
||||
- For each container, collect metrics via `GET /containers/{id}/stats?stream=false`:
|
||||
- **CPU**: `cpu_stats.cpu_usage.total_usage`, `cpu_stats.system_cpu_usage`, per-core usage, throttled periods/time
|
||||
- **Memory**: `memory_stats.usage`, `memory_stats.limit`, `memory_stats.stats.cache`, RSS, swap, working set, OOM kill count
|
||||
- **Network**: `networks.*.rx_bytes`, `tx_bytes`, `rx_packets`, `tx_packets`, `rx_errors`, `tx_errors`, `rx_dropped`, `tx_dropped` (per interface)
|
||||
- **Block I/O**: `blkio_stats.io_service_bytes_recursive` (read/write bytes), `io_serviced_recursive` (read/write ops)
|
||||
- **PIDs**: `pids_stats.current`, `pids_stats.limit`
|
||||
- Collect container metadata: name, image, image ID, labels, created time, status, health check status, restart count, ports, mounts, environment (filtered for sensitive values)
|
||||
- Report interval: configurable, default 30 seconds (matching existing server monitor interval)
|
||||
|
||||
**Files to modify**:
|
||||
- `InfrastructureAgent/collector/docker.go` (new - Docker metrics collector)
|
||||
- `InfrastructureAgent/model/docker.go` (new - Docker metric data models)
|
||||
- `InfrastructureAgent/agent.go` (add Docker collection to the main loop)
|
||||
- `InfrastructureAgent/config.go` (add Docker-related configuration: socket path, collection enabled/disabled)
|
||||
|
||||
### 1.3 Container Inventory & Discovery
|
||||
|
||||
**Current**: No container awareness.
|
||||
**Target**: Auto-discover containers on monitored hosts and maintain a live inventory.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create a `DockerContainer` PostgreSQL model to store discovered containers:
|
||||
- `containerId` (Docker container ID)
|
||||
- `containerName`
|
||||
- `imageName`, `imageId`, `imageTag`
|
||||
- `status` (running, paused, stopped, restarting, dead, created)
|
||||
- `healthStatus` (healthy, unhealthy, starting, none)
|
||||
- `labels` (JSON)
|
||||
- `createdAt` (container creation time)
|
||||
- `startedAt`
|
||||
- `hostMonitorId` (reference to the Server monitor for the host)
|
||||
- `projectId`
|
||||
- `restartCount`
|
||||
- `ports` (JSON - exposed ports mapping)
|
||||
- `mounts` (JSON - volume mounts)
|
||||
- `cpuLimit`, `memoryLimit` (resource constraints)
|
||||
- On each agent report, upsert container records (create new, update existing, mark removed containers as stopped)
|
||||
- Container inventory page in the dashboard showing all containers across all monitored hosts
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/DatabaseModels/DockerContainer.ts` (new)
|
||||
- `Common/Server/Services/DockerContainerService.ts` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainers.tsx` (new - container list page)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainerDetail.tsx` (new - single container detail)
|
||||
|
||||
### 1.4 Container Lifecycle Events
|
||||
|
||||
**Current**: No container event tracking.
|
||||
**Target**: Capture and surface container lifecycle events (start, stop, restart, OOM kill, health check failures).
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- In the InfrastructureAgent, subscribe to Docker events via `GET /events?filters={"type":["container"]}` (long-poll/streaming)
|
||||
- Capture events: `start`, `stop`, `die`, `kill`, `oom`, `restart`, `pause`, `unpause`, `health_status`
|
||||
- Include exit code, OOM killed flag, and signal information for `die` events
|
||||
- Report events to OneUptime alongside metric data
|
||||
- Store events in the existing telemetry pipeline (as structured logs or a dedicated events table)
|
||||
- Surface events as an overlay on container metric charts (vertical markers)
|
||||
- Enable alerting on lifecycle events (e.g., alert on OOM kill, alert on restart count > N in time window)
|
||||
|
||||
**Files to modify**:
|
||||
- `InfrastructureAgent/collector/docker_events.go` (new - event listener)
|
||||
- `Common/Types/Monitor/DockerMonitor/DockerContainerEvent.ts` (new)
|
||||
- `Worker/Jobs/Monitors/DockerContainerMonitor.ts` (new - process container reports, evaluate criteria)
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Alerting & Monitoring Rules (P0-P1) — Actionable Monitoring
|
||||
|
||||
### 2.1 Container Health Check Monitoring
|
||||
|
||||
**Current**: No health check awareness.
|
||||
**Target**: Monitor Docker health check status and alert on unhealthy containers.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Extract health check status from container inspect data (`State.Health.Status`, `State.Health.FailingStreak`, `State.Health.Log`)
|
||||
- Add monitor criteria for health check status:
|
||||
- Alert when container health transitions to `unhealthy`
|
||||
- Alert when failing streak exceeds threshold
|
||||
- Surface health check log output in alert details
|
||||
- Add health status column to the container inventory table
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Server/Utils/Monitor/Criteria/DockerContainerCriteria.ts` (new)
|
||||
- `Common/Types/Monitor/CriteriaFilter.ts` (add Docker-specific filter types)
|
||||
|
||||
### 2.2 Container Resource Threshold Alerts
|
||||
|
||||
**Current**: Server monitor supports CPU/memory threshold alerts at the host level.
|
||||
**Target**: Per-container resource threshold alerting with limit-aware thresholds.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add monitor criteria for container-level metrics:
|
||||
- CPU usage % (of container limit or host total)
|
||||
- Memory usage % (of container limit)
|
||||
- Memory usage absolute (approaching container limit)
|
||||
- Network error rate
|
||||
- Block I/O throughput
|
||||
- Restart count in time window
|
||||
- PID count approaching limit
|
||||
- **Limit-aware alerting**: When a container has resource limits set, calculate usage as a percentage of the limit rather than host total
|
||||
- E.g., container with 2GB memory limit using 1.8GB = 90% (alert), not 1.8/64GB = 2.8% (misleading)
|
||||
- Support compound criteria (e.g., CPU > 80% AND memory > 90% for 5 minutes)
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Server/Utils/Monitor/Criteria/DockerContainerCriteria.ts` (extend)
|
||||
- `Worker/Jobs/Monitors/DockerContainerMonitor.ts` (criteria evaluation)
|
||||
|
||||
### 2.3 Container Auto-Restart Detection
|
||||
|
||||
**Current**: No restart tracking.
|
||||
**Target**: Detect and alert on container restart loops (CrashLoopBackOff equivalent for Docker).
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Track restart count per container over sliding time windows (5 min, 15 min, 1 hour)
|
||||
- Alert when restart count exceeds configurable threshold
|
||||
- Include container exit code and last log lines in the alert context
|
||||
- Dashboard widget showing containers with highest restart frequency
|
||||
|
||||
**Files to modify**:
|
||||
- `Worker/Jobs/Monitors/DockerContainerMonitor.ts` (add restart loop detection)
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Visualization & UX (P1) — Container Dashboard
|
||||
|
||||
### 3.1 Container Overview Dashboard
|
||||
|
||||
**Current**: No container UI.
|
||||
**Target**: Dedicated container monitoring pages with rich visualizations.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- **Container List Page**: Table with columns for name, image, status, health, CPU%, memory%, network I/O, uptime, restart count. Sortable, filterable, searchable
|
||||
- **Container Detail Page**: Single-container view with:
|
||||
- Header: container name, image, status badge, health badge, uptime
|
||||
- Metrics charts: CPU, memory, network, block I/O (time series, matching existing metric chart style)
|
||||
- Events timeline: lifecycle events overlaid on charts
|
||||
- Container metadata: labels, ports, mounts, environment variables (filtered), resource limits
|
||||
- Processes: top processes inside the container (if available via `docker top`)
|
||||
- Logs: recent container logs (linked to log management if available)
|
||||
- **Host-Container Relationship**: From the existing Server monitor detail page, add a "Containers" tab showing all containers on that host
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainers.tsx` (new - list view)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainerDetail.tsx` (new - detail view)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerMetricsCharts.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerEventsTimeline.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerMetadataPanel.tsx` (new)
|
||||
|
||||
### 3.2 Container Map / Topology View
|
||||
|
||||
**Current**: No topology visualization.
|
||||
**Target**: Visual map showing containers, their host, and network relationships.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Show containers grouped by host
|
||||
- Color-code by status (green=healthy, yellow=warning, red=unhealthy/stopped)
|
||||
- Show network links between containers on the same Docker network
|
||||
- Click to drill into container detail
|
||||
- Show Docker Compose project grouping via labels
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Components/Docker/ContainerTopology.tsx` (new)
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Container Log Collection (P1) — Unified Observability
|
||||
|
||||
### 4.1 Automatic Container Log Collection
|
||||
|
||||
**Current**: Log collection requires explicit OTLP/Fluentd/Syslog integration per application.
|
||||
**Target**: Automatically collect logs from all Docker containers via the InfrastructureAgent.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add a log collector to the InfrastructureAgent using `GET /containers/{id}/logs?stdout=true&stderr=true&follow=true&tail=100`
|
||||
- Automatically enrich logs with container metadata:
|
||||
- `container.id`, `container.name`, `container.image.name`, `container.image.tag`
|
||||
- Host information (hostname, OS)
|
||||
- Docker labels as log attributes
|
||||
- Forward logs to OneUptime's telemetry ingestion endpoint (OTLP format)
|
||||
- Configurable:
|
||||
- Enable/disable per container (via label `oneuptime.logs.enabled=true/false`)
|
||||
- Max log line size
|
||||
- Log rate limiting (to prevent noisy container flooding)
|
||||
- Include/exclude containers by name pattern or label selector
|
||||
|
||||
**Files to modify**:
|
||||
- `InfrastructureAgent/collector/docker_logs.go` (new - log collector)
|
||||
- `InfrastructureAgent/config.go` (add log collection config)
|
||||
|
||||
### 4.2 Container Log Correlation
|
||||
|
||||
**Current**: No automatic correlation between container logs and container metrics.
|
||||
**Target**: Link container logs to container metrics and events for unified troubleshooting.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Automatically tag container logs with `container.id` and `container.name` attributes
|
||||
- In the container detail page, add a "Logs" tab that pre-filters the log viewer to the container's logs
|
||||
- When viewing a metric anomaly or event, show a link to "View logs around this time"
|
||||
- In the log detail view, show a link to "View container metrics" when `container.id` attribute is present
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerContainerDetail.tsx` (add Logs tab)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Logs/LogDetailsPanel.tsx` (add container link)
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Docker Compose & Swarm Support (P1-P2) — Multi-Container Orchestration
|
||||
|
||||
### 5.1 Docker Compose Project Grouping
|
||||
|
||||
**Current**: Containers are flat, no grouping.
|
||||
**Target**: Automatically detect Docker Compose projects and group containers by service.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Detect Compose projects via standard labels:
|
||||
- `com.docker.compose.project` (project name)
|
||||
- `com.docker.compose.service` (service name)
|
||||
- `com.docker.compose.container-number` (replica number)
|
||||
- `com.docker.compose.oneoff` (one-off vs service container)
|
||||
- Create a Compose project view showing:
|
||||
- Project name with list of services
|
||||
- Per-service status (all replicas healthy, degraded, down)
|
||||
- Per-service aggregated metrics (total CPU, memory across replicas)
|
||||
- Service dependency visualization (if depends_on info is available via labels)
|
||||
- Alert at the service level (e.g., "all replicas of service X are down")
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/DatabaseModels/DockerComposeProject.ts` (new)
|
||||
- `Common/Server/Services/DockerComposeProjectService.ts` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerComposeProjects.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerComposeProjectDetail.tsx` (new)
|
||||
|
||||
### 5.2 Docker Swarm Monitoring
|
||||
|
||||
**Current**: No Swarm support.
|
||||
**Target**: Monitor Docker Swarm services, tasks, and nodes.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Detect if the Docker host is a Swarm manager node
|
||||
- Collect Swarm-specific data:
|
||||
- Services: `GET /services` (desired/running replicas, update status)
|
||||
- Tasks: `GET /tasks` (task state, assigned node, error messages)
|
||||
- Nodes: `GET /nodes` (availability, status, resource capacity)
|
||||
- Surface Swarm service health: desired replicas vs running replicas
|
||||
- Alert when service degraded (running < desired) or task failures
|
||||
- Swarm-specific dashboard showing cluster overview
|
||||
|
||||
**Files to modify**:
|
||||
- `InfrastructureAgent/collector/docker_swarm.go` (new)
|
||||
- `Common/Types/Monitor/DockerMonitor/DockerSwarmMetrics.ts` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Infrastructure/DockerSwarm.tsx` (new)
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: Advanced Features (P2-P3) — Differentiation
|
||||
|
||||
### 6.1 Container Image Analysis
|
||||
|
||||
**Current**: No image awareness beyond name/tag.
|
||||
**Target**: Track image versions across containers and optionally scan for known vulnerabilities.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Maintain an image registry (image name, tag, digest, size, creation date)
|
||||
- Show which containers are running outdated images (when a newer tag is available locally)
|
||||
- Optional: integrate with Trivy or Grype for vulnerability scanning of local images
|
||||
- Dashboard showing image inventory, version distribution, and vulnerability summary
|
||||
|
||||
### 6.2 Container Resource Recommendations
|
||||
|
||||
**Current**: No resource guidance.
|
||||
**Target**: Recommend CPU/memory limits based on observed usage patterns.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Analyze historical container metrics (p95 CPU, p99 memory over 7 days)
|
||||
- Compare actual usage to configured limits
|
||||
- Flag over-provisioned containers (limit >> usage) and under-provisioned containers (usage approaching limit)
|
||||
- Generate recommendations: "Container X uses max 256MB, but has a 4GB limit — consider reducing to 512MB"
|
||||
|
||||
### 6.3 Container Diff / Change Detection
|
||||
|
||||
**Current**: No change tracking.
|
||||
**Target**: Detect when container configuration changes (image update, env var change, port mapping change).
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Store container configuration snapshots on each agent report
|
||||
- Diff against previous snapshot and generate change events
|
||||
- Alert on unexpected configuration changes
|
||||
- Show change history in the container detail page
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins (Can Ship First)
|
||||
|
||||
1. **Add Docker monitor type** — Add the enum value, category, and props (no collection yet, but enables the UI scaffolding)
|
||||
2. **Basic container discovery** — Extend InfrastructureAgent to list running containers and report names, images, status
|
||||
3. **Container CPU/memory metrics** — Collect basic cgroup stats via Docker stats API
|
||||
4. **Container inventory page** — Simple table showing discovered containers across hosts
|
||||
|
||||
---
|
||||
|
||||
## Recommended Implementation Order
|
||||
|
||||
1. **Phase 1.1** — Docker monitor type (enum, UI scaffolding)
|
||||
2. **Phase 1.2** — Container metrics collection in InfrastructureAgent
|
||||
3. **Phase 1.3** — Container inventory & discovery
|
||||
4. **Phase 3.1** — Container overview dashboard (list + detail pages)
|
||||
5. **Phase 1.4** — Container lifecycle events
|
||||
6. **Phase 2.1** — Container health check monitoring
|
||||
7. **Phase 2.2** — Container resource threshold alerts
|
||||
8. **Phase 4.1** — Automatic container log collection
|
||||
9. **Phase 5.1** — Docker Compose project grouping
|
||||
10. **Phase 2.3** — Container auto-restart detection
|
||||
11. **Phase 3.2** — Container map / topology view
|
||||
12. **Phase 4.2** — Container log correlation
|
||||
13. **Phase 5.2** — Docker Swarm monitoring
|
||||
14. **Phase 6.1** — Container image analysis
|
||||
15. **Phase 6.2** — Container resource recommendations
|
||||
16. **Phase 6.3** — Container diff / change detection
|
||||
|
||||
## Verification
|
||||
|
||||
For each feature:
|
||||
1. Unit tests for new Docker metric collection, parsing, and criteria evaluation
|
||||
2. Integration tests for container discovery, metric ingestion, and alerting APIs
|
||||
3. Manual verification with a Docker host running multiple containers (various states: healthy, unhealthy, restarting, OOM)
|
||||
4. Test with Docker Compose multi-service applications
|
||||
5. Performance test: verify agent overhead is minimal (< 1% CPU) when monitoring 50+ containers
|
||||
6. Verify container metrics accuracy by comparing agent-reported values to `docker stats` output
|
||||
7. Test graceful handling of Docker daemon unavailability (agent should not crash, should report connection failure)
|
||||
419
Internal/Roadmap/KubernetesMonitoring.md
Normal file
419
Internal/Roadmap/KubernetesMonitoring.md
Normal file
@@ -0,0 +1,419 @@
|
||||
# Plan: Kubernetes Monitoring for OneUptime
|
||||
|
||||
## Context
|
||||
|
||||
OneUptime has foundational infrastructure for Kubernetes monitoring: OTLP ingestion (HTTP and gRPC), ClickHouse metric/log/trace storage, telemetry-based monitors (Metrics, Logs, Traces), and a Helm chart for deploying OneUptime itself on Kubernetes. A `Kubernetes` monitor type exists in the `MonitorType` enum but is currently disabled and has no implementation. The OpenTelemetry Collector config supports OTLP receivers but has no Kubernetes-specific receivers (kubelet, kube-state-metrics, Prometheus). Server monitoring exists but is limited to basic VM-level checks.
|
||||
|
||||
This plan proposes a phased implementation to deliver first-class Kubernetes monitoring — from cluster health and workload observability to intelligent alerting — leveraging OneUptime's all-in-one observability platform (metrics, logs, traces, incidents, status pages).
|
||||
|
||||
## Completed
|
||||
|
||||
- **OTLP Metric Ingestion** - HTTP and gRPC metric ingestion with async queue-based batch processing
|
||||
- **ClickHouse Metric Storage** - MergeTree with partitioning, per-service TTL
|
||||
- **Telemetry-Based Monitors** - Metric, Log, Trace, and Exception monitors with configurable criteria
|
||||
- **Helm Chart** - OneUptime deploys on Kubernetes with KEDA auto-scaling support
|
||||
- **OpenTelemetry Collector** - Deployed via Helm, accepts OTLP on ports 4317/4318
|
||||
- **MonitorType.Kubernetes** - Enum value defined (but disabled and unimplemented)
|
||||
|
||||
## Gap Analysis Summary
|
||||
|
||||
| Feature | OneUptime | DataDog | New Relic | Grafana/Prometheus | Priority |
|
||||
|---------|-----------|---------|-----------|-------------------|----------|
|
||||
| K8s metric collection (kubelet, kube-state-metrics) | None | Agent auto-discovery | K8s integration | Prometheus + kube-state-metrics | **P0** |
|
||||
| Cluster overview dashboard | None | Out-of-box | Pre-built | Pre-built via mixins | **P0** |
|
||||
| Pod/Container resource metrics | None | Live Containers | K8s cluster explorer | cAdvisor + Grafana | **P0** |
|
||||
| Node health monitoring | None | Host Map + agent | Infrastructure UI | node-exporter + Grafana | **P0** |
|
||||
| Kubernetes event ingestion | None | Auto-collected | K8s events integration | Eventrouter/Exporter | **P0** |
|
||||
| Workload health alerts (CrashLoopBackOff, OOMKilled, etc.) | None | Auto-monitors | Pre-built alerts | PrometheusRule CRDs | **P1** |
|
||||
| Namespace/workload cost attribution | None | Container cost allocation | None | Kubecost integration | **P1** |
|
||||
| K8s resource inventory (deployments, services, ingresses) | None | Orchestrator Explorer | Cluster explorer | None native | **P1** |
|
||||
| HPA/VPA monitoring | None | Yes | Partial | Prometheus metrics | **P1** |
|
||||
| Multi-cluster support | None | Yes | Yes | Thanos/Cortex | **P2** |
|
||||
| K8s log collection (pod stdout/stderr) | Via Fluentd example | DaemonSet agent | Fluent Bit integration | Loki + Promtail | **P2** |
|
||||
| Service mesh observability (Istio, Linkerd) | None | Yes | Yes | Partial | **P2** |
|
||||
| Network policy monitoring | None | NPM | None | Cilium Hubble | **P3** |
|
||||
| eBPF-based deep observability | None | Universal Service Monitoring | Pixie | Cilium/Tetragon | **P3** |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Foundation (P0) — Kubernetes Metric Collection & Visibility
|
||||
|
||||
Without these, OneUptime cannot monitor any Kubernetes cluster. This phase makes K8s metrics flow into the platform and provides basic visibility.
|
||||
|
||||
### 1.1 OpenTelemetry Collector Kubernetes Receivers
|
||||
|
||||
**Current**: OTel Collector only has OTLP receivers. No Kubernetes-specific metric collection.
|
||||
**Target**: Pre-configured OTel Collector with receivers for kubelet, kube-state-metrics, and Kubernetes events.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add `kubeletstats` receiver to the OTel Collector config for node and pod resource metrics:
|
||||
- CPU, memory, filesystem, network per node and per pod/container
|
||||
- Collection interval: 30s
|
||||
- Auth via serviceAccount token
|
||||
- Add `k8s_cluster` receiver for cluster-level metrics from the Kubernetes API:
|
||||
- Deployment, ReplicaSet, StatefulSet, DaemonSet replica counts and status
|
||||
- Pod phase, container states (waiting/running/terminated with reasons)
|
||||
- Node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
|
||||
- Namespace resource quotas and limit ranges
|
||||
- HPA current/desired replicas
|
||||
- Add `k8sobjects` receiver for Kubernetes events:
|
||||
- Watch Events API for Warning and Normal events
|
||||
- Ingest as logs with structured attributes (reason, involvedObject, message)
|
||||
- Add `k8s_events` receiver as alternative lightweight event collection
|
||||
- Configure `k8sattributes` processor to enrich all telemetry with K8s metadata:
|
||||
- Pod name, namespace, node, deployment, replicaset, labels, annotations
|
||||
- Provide Helm values to enable/disable K8s monitoring and configure which namespaces to monitor
|
||||
|
||||
**Files to modify**:
|
||||
- `OTelCollector/otel-collector-config.template.yaml` (add kubeletstats, k8s_cluster, k8sobjects receivers and k8sattributes processor)
|
||||
- `HelmChart/Public/oneuptime/templates/otel-collector.yaml` (restore and configure OTel Collector deployment with proper RBAC)
|
||||
- `HelmChart/Public/oneuptime/templates/otel-collector-rbac.yaml` (new - ClusterRole, ClusterRoleBinding, ServiceAccount for K8s API access)
|
||||
- `HelmChart/Public/oneuptime/values.yaml` (add kubernetesMonitoring config section)
|
||||
|
||||
### 1.2 Kubernetes Cluster Overview Dashboard Template
|
||||
|
||||
**Current**: No pre-built Kubernetes dashboards.
|
||||
**Target**: Auto-generated cluster overview dashboard showing key health indicators.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create a dashboard template with the following panels:
|
||||
- **Cluster Summary**: Total nodes, pods (running/pending/failed), namespaces, deployments
|
||||
- **Node Health**: CPU and memory utilization per node, node conditions
|
||||
- **Pod Status**: Pod phase distribution (Running/Pending/Succeeded/Failed/Unknown)
|
||||
- **Resource Utilization**: Cluster-wide CPU and memory usage vs capacity (requests, limits, actual)
|
||||
- **Top Consumers**: Top 10 pods by CPU usage, top 10 by memory usage
|
||||
- **Recent Events**: Kubernetes Warning events stream
|
||||
- **Container Restarts**: Pods with highest restart counts
|
||||
- Auto-detect K8s metrics and offer dashboard creation during onboarding
|
||||
- Use template variables for namespace and node filtering
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Dashboard/Templates/KubernetesCluster.ts` (new - cluster overview template)
|
||||
- `Common/Types/Dashboard/Templates/KubernetesWorkload.ts` (new - per-namespace workload template)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Dashboards/Templates.tsx` (add K8s templates to gallery)
|
||||
|
||||
### 1.3 Pod and Container Resource Metrics
|
||||
|
||||
**Current**: No container-level visibility.
|
||||
**Target**: Detailed resource metrics for every pod and container with drill-down.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Ensure the following kubeletstats metrics are ingested and queryable:
|
||||
- `k8s.pod.cpu.utilization`, `k8s.pod.memory.usage`, `k8s.pod.memory.rss`
|
||||
- `k8s.pod.network.io` (rx/tx bytes), `k8s.pod.filesystem.usage`
|
||||
- `container.cpu.utilization`, `container.memory.usage`, `container.restarts`
|
||||
- Create a "Kubernetes" section in the dashboard navigation:
|
||||
- Cluster > Namespace > Workload > Pod > Container drill-down hierarchy
|
||||
- Pod detail page showing: resource usage over time, container statuses, events, logs (linked), traces (linked)
|
||||
- Calculate resource efficiency: actual usage vs requests vs limits
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/` (new directory)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/ClusterOverview.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Namespaces.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Pods.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/PodDetail.tsx` (new)
|
||||
|
||||
### 1.4 Node Health Monitoring
|
||||
|
||||
**Current**: No node-level metrics.
|
||||
**Target**: Per-node resource utilization, conditions, and capacity tracking.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Ingest node metrics via kubeletstats receiver:
|
||||
- `k8s.node.cpu.utilization`, `k8s.node.memory.usage`, `k8s.node.memory.available`
|
||||
- `k8s.node.filesystem.usage`, `k8s.node.filesystem.capacity`
|
||||
- `k8s.node.network.io`, `k8s.node.condition` (Ready, MemoryPressure, etc.)
|
||||
- Node list page: table with all nodes showing CPU%, memory%, disk%, conditions, pod count
|
||||
- Node detail page: time-series charts for resource usage, pod list on node, events
|
||||
- Node capacity planning: show allocatable vs requested vs used per node
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Nodes.tsx` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/NodeDetail.tsx` (new)
|
||||
|
||||
### 1.5 Kubernetes Event Ingestion
|
||||
|
||||
**Current**: No Kubernetes event collection.
|
||||
**Target**: Ingest and surface Kubernetes events as structured logs with correlation to resources.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Configure `k8sobjects` receiver to watch Kubernetes Events
|
||||
- Map events to structured log entries:
|
||||
- `severity` from event type (Warning -> WARN, Normal -> INFO)
|
||||
- `body` from event message
|
||||
- Attributes: `k8s.event.reason`, `k8s.event.count`, `k8s.object.kind`, `k8s.object.name`, `k8s.namespace.name`
|
||||
- Create a dedicated "Kubernetes Events" view:
|
||||
- Filterable by namespace, event reason, object kind
|
||||
- Timeline visualization showing event frequency
|
||||
- Link events to related pods/deployments/nodes
|
||||
- Alert on specific event patterns (e.g., repeated FailedScheduling, FailedMount)
|
||||
|
||||
**Files to modify**:
|
||||
- `OTelCollector/otel-collector-config.template.yaml` (add k8sobjects receiver)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Events.tsx` (new)
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Intelligent Alerting & Workload Health (P1) — Actionable Monitoring
|
||||
|
||||
### 2.1 Kubernetes-Aware Alert Templates
|
||||
|
||||
**Current**: Generic metric threshold alerts only. Users must manually configure alerts for K8s failure modes.
|
||||
**Target**: Pre-built alert templates for common Kubernetes failure patterns.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create alert templates for critical K8s conditions:
|
||||
- **CrashLoopBackOff**: Alert when `k8s.container.restarts` increases rapidly (> N restarts in M minutes)
|
||||
- **OOMKilled**: Alert on container termination reason = OOMKilled
|
||||
- **Pod Pending**: Alert when pods remain in Pending phase for > N minutes
|
||||
- **Node NotReady**: Alert when node condition transitions to NotReady
|
||||
- **High Resource Utilization**: Alert when node CPU > 90% or memory > 85% sustained
|
||||
- **Deployment Replica Mismatch**: Alert when available replicas < desired replicas for > N minutes
|
||||
- **PVC Disk Full**: Alert when PV usage > 90% capacity
|
||||
- **Failed Scheduling**: Alert on repeated FailedScheduling events
|
||||
- **Image Pull Failures**: Alert on ErrImagePull/ImagePullBackOff events
|
||||
- **Job/CronJob Failures**: Alert when job completion fails
|
||||
- One-click enable for each alert template during K8s monitoring setup
|
||||
- Auto-route alerts to the OneUptime incident management system
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Types/Monitor/Templates/KubernetesAlertTemplates.ts` (new)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/AlertSetup.tsx` (new - guided alert configuration)
|
||||
- `Worker/Jobs/TelemetryMonitor/MonitorTelemetryMonitor.ts` (support K8s-specific criteria evaluation)
|
||||
|
||||
### 2.2 Kubernetes Resource Inventory
|
||||
|
||||
**Current**: No visibility into K8s resource state.
|
||||
**Target**: Live inventory of Kubernetes resources with health status.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Create a `KubernetesResource` model stored in ClickHouse (or PostgreSQL depending on query patterns):
|
||||
- Kind, name, namespace, labels, annotations, status, conditions, timestamps
|
||||
- Updated via the `k8s_cluster` receiver or periodic API sync
|
||||
- Resource pages:
|
||||
- **Deployments**: List with replica status (ready/desired), last update, strategy
|
||||
- **StatefulSets**: Ordered pod status, PVC bindings
|
||||
- **DaemonSets**: Node coverage, desired vs current vs ready
|
||||
- **Services**: Type (ClusterIP/NodePort/LoadBalancer), endpoints, selector
|
||||
- **Ingresses**: Host rules, backend services, TLS status
|
||||
- **ConfigMaps/Secrets**: List with last-modified (secrets show metadata only, never values)
|
||||
- **PVCs**: Bound PV, capacity, access modes, storage class
|
||||
- Drill-down from any resource to its associated pods, events, and telemetry
|
||||
|
||||
**Files to modify**:
|
||||
- `Common/Models/AnalyticsModels/KubernetesResource.ts` (new)
|
||||
- `Telemetry/Services/KubernetesResourceService.ts` (new - sync K8s resources)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Resources/` (new - pages for each resource kind)
|
||||
|
||||
### 2.3 HPA and VPA Monitoring
|
||||
|
||||
**Current**: No autoscaler visibility.
|
||||
**Target**: Track HPA/VPA behavior and scaling events.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Ingest HPA metrics from `k8s_cluster` receiver:
|
||||
- `k8s.hpa.current_replicas`, `k8s.hpa.desired_replicas`, `k8s.hpa.min_replicas`, `k8s.hpa.max_replicas`
|
||||
- Target metric values vs actual
|
||||
- HPA overview page:
|
||||
- List all HPAs with current/desired/min/max replicas
|
||||
- Time-series chart showing scaling events overlaid with the target metric
|
||||
- Alert when HPA is at max replicas sustained (capacity ceiling)
|
||||
- Alert when scale-up frequency is abnormally high (thrashing)
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Autoscaling.tsx` (new)
|
||||
|
||||
### 2.4 Namespace Resource Quota Monitoring
|
||||
|
||||
**Current**: No quota tracking.
|
||||
**Target**: Track resource quota usage per namespace and alert on approaching limits.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Ingest quota metrics from `k8s_cluster` receiver:
|
||||
- `k8s.resource_quota.hard.cpu`, `k8s.resource_quota.used.cpu`
|
||||
- `k8s.resource_quota.hard.memory`, `k8s.resource_quota.used.memory`
|
||||
- `k8s.resource_quota.hard.pods`, `k8s.resource_quota.used.pods`
|
||||
- Namespace detail page showing quota utilization gauges
|
||||
- Alert when any quota usage exceeds 80% (configurable threshold)
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/NamespaceDetail.tsx` (new)
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Advanced Observability (P2) — Correlation & Deep Visibility
|
||||
|
||||
### 3.1 Kubernetes Log Collection
|
||||
|
||||
**Current**: Users can manually configure Fluentd to send logs. No built-in K8s log collection.
|
||||
**Target**: Automated pod log collection via OTel Collector with K8s metadata enrichment.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add `filelog` receiver to OTel Collector for collecting container logs from `/var/log/pods/`:
|
||||
- Parse container runtime log format (Docker JSON, CRI)
|
||||
- Extract pod name, namespace, container name from file path
|
||||
- Enrich with K8s metadata via `k8sattributes` processor
|
||||
- Deploy OTel Collector as a DaemonSet (in addition to existing Deployment) for log collection
|
||||
- Helm values to configure:
|
||||
- Namespace inclusion/exclusion filters
|
||||
- Log level filtering (e.g., only collect WARN and above)
|
||||
- Container name exclusion patterns
|
||||
- Link pod logs in the Kubernetes pod detail page
|
||||
|
||||
**Files to modify**:
|
||||
- `HelmChart/Public/oneuptime/templates/otel-collector-daemonset.yaml` (new - DaemonSet for log collection)
|
||||
- `OTelCollector/otel-collector-daemonset-config.template.yaml` (new - DaemonSet-specific config with filelog receiver)
|
||||
- `HelmChart/Public/oneuptime/values.yaml` (add DaemonSet configuration options)
|
||||
|
||||
### 3.2 Multi-Cluster Support
|
||||
|
||||
**Current**: Single-cluster assumption.
|
||||
**Target**: Monitor multiple Kubernetes clusters from a single OneUptime project.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add `cluster` attribute to all K8s metrics via OTel Collector resource processor
|
||||
- Cluster registration: each cluster gets a unique name and OneUptime API key
|
||||
- Helm install per cluster with cluster-specific configuration
|
||||
- Cluster selector in the K8s monitoring UI (template variable)
|
||||
- Cross-cluster comparison views (e.g., resource utilization across clusters)
|
||||
- Unified alerting: same alert rules applied across all clusters or cluster-specific
|
||||
|
||||
**Files to modify**:
|
||||
- `OTelCollector/otel-collector-config.template.yaml` (add resource processor with cluster name)
|
||||
- `HelmChart/Public/oneuptime/values.yaml` (add clusterName config)
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/Clusters.tsx` (new - multi-cluster view)
|
||||
|
||||
### 3.3 Service Mesh Observability
|
||||
|
||||
**Current**: No service mesh integration.
|
||||
**Target**: Ingest and visualize service mesh metrics from Istio, Linkerd, or similar.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- Add Prometheus receiver to OTel Collector for scraping service mesh metrics:
|
||||
- Istio: `istio_requests_total`, `istio_request_duration_milliseconds`, `istio_tcp_connections_opened_total`
|
||||
- Linkerd: `request_total`, `response_latency_ms`
|
||||
- Service-to-service traffic map from mesh metrics
|
||||
- mTLS status visibility
|
||||
- Circuit breaker and retry metrics
|
||||
- Dashboard templates for Istio and Linkerd
|
||||
|
||||
**Files to modify**:
|
||||
- `OTelCollector/otel-collector-config.template.yaml` (add prometheus receiver for mesh metrics)
|
||||
- `Common/Types/Dashboard/Templates/ServiceMesh.ts` (new - mesh dashboard templates)
|
||||
|
||||
### 3.4 Kubernetes-to-Telemetry Correlation
|
||||
|
||||
**Current**: K8s resources and telemetry (metrics, logs, traces) are separate.
|
||||
**Target**: Click on any K8s resource to see correlated telemetry.
|
||||
|
||||
**Implementation**:
|
||||
|
||||
- From any pod/deployment/service page, show:
|
||||
- **Metrics**: CPU, memory, network filtered to that resource
|
||||
- **Logs**: Logs from containers in that pod, filtered by K8s metadata attributes
|
||||
- **Traces**: Traces originating from or passing through that service
|
||||
- **Events**: Kubernetes events for that resource
|
||||
- Use `k8sattributes` processor enrichment to correlate:
|
||||
- `k8s.pod.name`, `k8s.namespace.name`, `k8s.deployment.name` across all signals
|
||||
- Deep link from incident timeline to K8s resource view
|
||||
|
||||
**Files to modify**:
|
||||
- `App/FeatureSet/Dashboard/src/Pages/Kubernetes/PodDetail.tsx` (add telemetry correlation tabs)
|
||||
- `App/FeatureSet/Dashboard/src/Components/Kubernetes/ResourceTelemetryPanel.tsx` (new - reusable correlation panel)
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Intelligence & Differentiation (P3) — Long-Term
|
||||
|
||||
### 4.1 Kubernetes Cost Attribution
|
||||
|
||||
- Track CPU and memory usage per namespace, workload, and label
|
||||
- Calculate cost based on node instance pricing (configurable per cluster)
|
||||
- Show cost trends over time, cost per team/project (via labels)
|
||||
- Identify idle resources (requested but unused capacity)
|
||||
- Recommendations: right-size requests/limits based on actual usage
|
||||
|
||||
### 4.2 Network Policy Monitoring
|
||||
|
||||
- Visualize network policies and their effect on pod communication
|
||||
- Alert on denied network connections
|
||||
- Integration with Cilium Hubble or Calico for deep network visibility
|
||||
- Service dependency map derived from actual network traffic
|
||||
|
||||
### 4.3 eBPF-Based Deep Observability
|
||||
|
||||
- Kernel-level visibility without application instrumentation
|
||||
- Automatic service discovery and dependency mapping
|
||||
- DNS monitoring and latency
|
||||
- TCP connection tracking and retransmit analysis
|
||||
- Integration with tools like Tetragon, Pixie, or custom eBPF probes
|
||||
|
||||
### 4.4 Kubernetes Compliance and Security Monitoring
|
||||
|
||||
- Pod security standards compliance tracking
|
||||
- RBAC audit logging and visualization
|
||||
- Image vulnerability scanning status
|
||||
- Network policy coverage analysis
|
||||
- CIS Kubernetes Benchmark compliance scoring
|
||||
|
||||
### 4.5 GitOps Integration
|
||||
|
||||
- Track ArgoCD/Flux deployments as annotations on metric charts
|
||||
- Correlate deployment events with performance changes
|
||||
- Show deployment history per workload with rollback status
|
||||
- Alert when deployment sync fails or drift is detected
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins (Can Ship This Week)
|
||||
|
||||
1. **Enable Kubernetes MonitorType** - Uncomment the Kubernetes entry in `getAllMonitorTypeProps()` and wire it to existing telemetry monitors
|
||||
2. **Add k8sattributes processor** - Enrich all existing OTLP data with K8s metadata for free
|
||||
3. **Kubernetes dashboard template** - Create a basic cluster health dashboard using standard OTEL K8s metric names
|
||||
4. **K8s event alerting** - Use existing log monitors to alert on K8s Warning events once event ingestion is configured
|
||||
5. **Document OTel Collector K8s setup** - Guide for users to configure their own OTel Collector with K8s receivers pointing to OneUptime
|
||||
|
||||
---
|
||||
|
||||
## Recommended Implementation Order
|
||||
|
||||
1. **Quick Wins** - Enable MonitorType, k8sattributes processor, documentation
|
||||
2. **Phase 1.1** - OTel Collector K8s receivers (prerequisite for everything else)
|
||||
3. **Phase 1.5** - Kubernetes event ingestion (high value, uses existing log infrastructure)
|
||||
4. **Phase 1.2** - Cluster overview dashboard template
|
||||
5. **Phase 1.3** - Pod and container resource metrics pages
|
||||
6. **Phase 1.4** - Node health monitoring pages
|
||||
7. **Phase 2.1** - K8s-aware alert templates (makes monitoring actionable)
|
||||
8. **Phase 2.2** - Resource inventory pages
|
||||
9. **Phase 2.4** - Namespace quota monitoring
|
||||
10. **Phase 2.3** - HPA/VPA monitoring
|
||||
11. **Phase 3.1** - K8s log collection via DaemonSet
|
||||
12. **Phase 3.4** - K8s-to-telemetry correlation
|
||||
13. **Phase 3.2** - Multi-cluster support
|
||||
14. **Phase 3.3** - Service mesh observability
|
||||
15. **Phase 4.x** - Cost attribution, network policies, eBPF, compliance, GitOps
|
||||
|
||||
## Verification
|
||||
|
||||
For each feature:
|
||||
1. Unit tests for new K8s metric query builders, resource models, and alert template logic
|
||||
2. Integration tests for OTel Collector K8s receivers (use minikube or kind in CI)
|
||||
3. Manual verification on a test cluster (minikube/kind) with representative workloads
|
||||
4. Verify K8s metadata enrichment via `k8sattributes` processor across metrics, logs, and traces
|
||||
5. Check ClickHouse query performance for K8s-specific queries (namespace filtering, resource correlation)
|
||||
6. Load test with realistic cluster sizes (100+ nodes, 1000+ pods) to validate metric volume handling
|
||||
7. Verify RBAC permissions are minimal (principle of least privilege for ClusterRole)
|
||||
8. Test Helm chart upgrades to ensure K8s monitoring can be enabled without disruption
|
||||
Reference in New Issue
Block a user