Merge pull request #339 from gyptazy/feature/337-add-pressure-based-balancing

feature(balancing): Add pressure (memory, cpu, disk) based balancing
2026-04-06 04:41:58 +02:00 · 2025-10-23 11:41:04 +02:00
parent 5b395b7f15 581d6d480b
commit 02b43d3ef7
9 changed files with 516 additions and 36 deletions
--- a/.changelogs/1.1.9/337_add_pressure_based_balancing.yml
+++ b/.changelogs/1.1.9/337_add_pressure_based_balancing.yml
@@ -0,0 +1,5 @@
+added:
+  - Add pressure (PSI) based balancing for memory, cpu, disk (req. PVE9 or greater) (@gyptazy). [#337|
+  - Pressure (PSI) based balancing for nodes
+  - Pressure (PSI) based balancing for guests
+  - Add PVE version evaluation
--- a/.changelogs/1.1.9/release_meta.yml
+++ b/.changelogs/1.1.9/release_meta.yml
@@ -0,0 +1 @@
+date: TBD
--- a/README.md
+++ b/README.md
@@ -54,6 +54,10 @@ ProxLB's key features are by enabling automatic rebalancing of VMs and CTs acros
  * Memory
  * Disk (only local storage)
  * CPU
+* Rebalance by different modes:
+  * Used resources
+  * Assigned resources
+  * PSI (Pressure) of resources
 * Get best nodes for further automation
 * Supported Guest Types
  * VMs
@@ -278,7 +282,8 @@ The following options can be set in the configuration file `proxlb.yaml`:
 |  | max_job_validation |  | 1800 | `Int` | How long a job validation may take in seconds. (default: 1800) |
 |  | balanciness |  | 10 | `Int` | The maximum delta of resource usage between node with highest and lowest usage. |
 |  | method |  | memory | `Str` | The balancing method that should be used.  [values: `memory` (default), `cpu`, `disk`]|
-|  | mode |  | used | `Str` | The balancing mode that should be used. [values: `used` (default), `assigned`] |
+|  | mode |  | used | `Str` | The balancing mode that should be used. [values: `used` (default), `assigned`, `psi` (pressure)] |
+|  | psi |  | { nodes: { memory: { pressure_full: 0.20, pressure_some: 0.20, pressure_spikes: 1.00 } } } | `Dict` | A dict of PSI based thresholds for nodes and guests |
 | `service` |  |  |  |  |  |
 |  | daemon |  | True | `Bool` | If daemon mode should be activated. |
 |  | `schedule` |  |  | `Dict` | Schedule config block for rebalancing. |
@@ -323,6 +328,35 @@ balancing:
  balanciness: 5
  method: memory
  mode: used
+# # PSI thresholds only apply when using mode 'psi'
+# # PSI based balancing is currently in beta and req. PVE >= 9
+# psi:
+#   nodes:
+#     memory:
+#       pressure_full: 0.20
+#       pressure_some: 0.20
+#       pressure_spikes: 1.00
+#     cpu:
+#       pressure_full: 0.20
+#       pressure_some: 0.20
+#       pressure_spikes: 1.00
+#     disk:
+#       pressure_full: 0.20
+#       pressure_some: 0.20
+#       pressure_spikes: 1.00
+#   guests:
+#     memory:
+#       pressure_full: 0.20
+#       pressure_some: 0.20
+#       pressure_spikes: 1.00
+#     cpu:
+#       pressure_full: 0.20
+#       pressure_some: 0.20
+#       pressure_spikes: 1.00
+#     disk:
+#       pressure_full: 0.20
+#       pressure_some: 0.20
+#       pressure_spikes: 1.00

 service:
  daemon: True
--- a/config/proxlb_example.yaml
+++ b/config/proxlb_example.yaml
@@ -26,11 +26,39 @@ balancing:
  live: True
  with_local_disks: True
  with_conntrack_state: True
-  balance_types: ['vm', 'ct']
+  balance_types: ['vm', 'ct']         # 'vm' | 'ct'
  max_job_validation: 1800
  balanciness: 5
-  method: memory
-  mode: used
+  method: memory                      # 'memory' | 'cpu' | 'disk'
+  mode: used                          # 'assigned' | 'used' | 'psi'
+  # # PSI thresholds only apply when using mode 'psi'
+  # psi:
+  #   nodes:
+  #     memory:
+  #       pressure_full: 0.20
+  #       pressure_some: 0.20
+  #       pressure_spikes: 1.00
+  #     cpu:
+  #       pressure_full: 0.20
+  #       pressure_some: 0.20
+  #       pressure_spikes: 1.00
+  #     disk:
+  #       pressure_full: 0.20
+  #       pressure_some: 0.20
+  #       pressure_spikes: 1.00
+  #   guests:
+  #     memory:
+  #       pressure_full: 0.20
+  #       pressure_some: 0.20
+  #       pressure_spikes: 1.00
+  #     cpu:
+  #       pressure_full: 0.20
+  #       pressure_some: 0.20
+  #       pressure_spikes: 1.00
+  #     disk:
+  #       pressure_full: 0.20
+  #       pressure_some: 0.20
+  #       pressure_spikes: 1.00

 service:
  daemon: True
--- a/docs/03_configuration.md
+++ b/docs/03_configuration.md
@@ -20,6 +20,10 @@
    7. [Run as a Systemd-Service](#run-as-a-systemd-service)
    8. [SSL Self-Signed Certificates](#ssl-self-signed-certificates)
    9. [Node Maintenances](#node-maintenances)
+    10. [Balancing Methods](#balancing-methods)
+        1. [Used Resources](#used-resources)
+        2. [Assigned Resources](#assigned-resources)
+        3. [Pressure (PSI) based Resources](#pressure-psi-based-resources)

 ## Authentication / User Accounts / Permissions
 ### Authentication
@@ -235,4 +239,115 @@ The maintenance_nodes key must be defined as a list, even if it only includes a
 * No new workloads will be balanced or migrated onto it.
 * Any existing workloads currently running on the node will be migrated away in accordance with the configured balancing strategies, assuming resources on other nodes allow. 

-This feature is particularly useful during planned maintenance, upgrades, or troubleshooting, ensuring that services continue to run with minimal disruption while the specified node is being worked on.
+This feature is particularly useful during planned maintenance, upgrades, or troubleshooting, ensuring that services continue to run with minimal disruption while the specified node is being worked on.
+
+## 10. Balancing Methods
+ProxLB provides multiple balancing modes that define *how* resources are evaluated and compared during cluster balancing.
+Each mode reflects a different strategy for determining load and distributing guests (VMs or containers) between nodes.
+
+Depending on your environment, provisioning strategy, and performance goals, you can choose between:
+
+| Mode | Description | Typical Use Case |
+|------|--------------|------------------|
+| `used` | Uses the *actual runtime resource usage* (e.g. CPU, memory, disk). | Dynamic or lab environments with frequent workload changes and tolerance for overprovisioning. |
+| `assigned` | Uses the *statically defined resource allocations* from guest configurations. | Production or SLA-driven clusters that require guaranteed resources and predictable performance. |
+| `psi` | Uses Linux *Pressure Stall Information (PSI)* metrics to evaluate real system contention and pressure. | Advanced clusters that require pressure-aware decisions for proactive rebalancing. |
+
+### 10.1 Used Resources
+When **mode: `used`** is configured, ProxLB evaluates the *real usage metrics* of guest objects (VMs and CTs).
+It collects the current CPU, memory, and disk usage directly from the Proxmox API to determine the *actual consumption* of each guest and node.
+
+This mode is ideal for **dynamic environments** where workloads frequently change and **overprovisioning is acceptable**. It provides the most reactive balancing behavior, since decisions are based on live usage instead of static assignment.
+
+Typical scenarios include:
+- Production environments to distribute workloads across the nodes.
+- Test or development clusters with frequent VM changes.
+- Clusters where resource spikes are short-lived.
+- Environments where slight resource contention is tolerable.
+
+#### Example Configuration
+```yaml
+balancing:
+  mode: used
+```
+
+### 10.2 Assigned Resources
+When **mode: `assigned`** is configured, ProxLB evaluates the *provisioned or allocated resources* of each guest (VM or CT) instead of their runtime usage.
+It uses data such as **CPU cores**, **memory limits**, and **disk allocations** defined in Proxmox to calculate how much of each node’s capacity is reserved.
+
+This mode is ideal for **production clusters** where:
+- Overcommitment is *not allowed or only minimally tolerated*.
+- Each node’s workload is planned based on the assigned capacities.
+- Administrators want predictable resource distribution aligned with provisioning policies.
+
+Unlike the `used` mode, `assigned` focuses purely on the *declared configuration* of guests and remains stable even if actual usage varies temporarily.
+
+Typical scenarios include:
+- Enterprise environments with SLA or QoS requirements.
+- Clusters where workloads are sized deterministically.
+- Situations where consistent node utilization and capacity awareness are crucial.
+
+#### Example Configuration
+```yaml
+balancing:
+  mode: assigned
+```
+
+### 10.3 Pressure (PSI) based Resources
+> [!IMPORTANT]
+> PSI based balancing is still in beta! If you find any bugs, please raise an issue including metrics of all nodes and affected guests. You can provide metrics directly from PVE or Grafana (via node_exporter or pve_exporter).
+
+When **mode: `psi`** is configured, ProxLB uses the **Linux Pressure Stall Information (PSI)** interface to measure the *real-time pressure* on system resources such as **CPU**, **memory**, and **disk I/O**.
+Unlike the `used` or `assigned` modes, which rely on static or average metrics, PSI provides *direct insight into how often and how long tasks are stalled* because of insufficient resources.
+
+This enables ProxLB to make **proactive balancing decisions** — moving workloads *before* performance degradation becomes visible to the user.
+
+**IMPORTANT**: Predicting distributing workloads is dangerous and might not result into the expected state. Therefore, ProxLB migrates only a single instance each 60 minutes to obtain new real-metrics and to validate if further changes are required. Keep in mind, that migrations are also costly and should be avoided as much as possible.
+
+PSI metrics are available for both **nodes** and **guest objects**, allowing fine-grained balancing decisions:
+- **Node-level PSI:** Detects cluster nodes under systemic load or contention.
+- **Guest-level PSI:** Identifies individual guests suffering from memory, CPU, or I/O stalls.
+
+### PSI Metrics Explained
+Each monitored resource defines three pressure thresholds:
+| Key | Description |
+|-----|--------------|
+| `pressure_some` | Indicates partial stall conditions where some tasks are waiting for a resource. |
+| `pressure_full` | Represents complete stall conditions where *all* tasks are blocked waiting for a resource. |
+| `pressure_spikes` | Defines short-term burst conditions that may signal saturation spikes. |
+
+These thresholds are expressed in **percentages** and represent how much time the kernel reports stalls over specific averaging windows (e.g. 5s, 10s, 60s).
+
+### Example Configuration
+
+```yaml
+balancing:
+  mode: psi
+  psi:
+    nodes:
+      memory:
+        pressure_full: 0.20
+        pressure_some: 0.20
+        pressure_spikes: 1.00
+      cpu:
+        pressure_full: 0.20
+        pressure_some: 0.20
+        pressure_spikes: 1.00
+      disk:
+        pressure_full: 0.20
+        pressure_some: 0.20
+        pressure_spikes: 1.00
+    guests:
+      memory:
+        pressure_full: 0.20
+        pressure_some: 0.20
+        pressure_spikes: 1.00
+      cpu:
+        pressure_full: 0.20
+        pressure_some: 0.20
+        pressure_spikes: 1.00
+      disk:
+        pressure_full: 0.20
+        pressure_some: 0.20
+        pressure_spikes: 1.00
+```
--- a/proxlb/main.py
+++ b/proxlb/main.py
@@ -81,6 +81,8 @@ def main():
        # Update the initial node resource assignments
        # by the previously created groups.
        Calculations.set_node_assignments(proxlb_data)
+        Calculations.set_node_hot(proxlb_data)
+        Calculations.set_guest_hot(proxlb_data)
        Calculations.get_most_free_node(proxlb_data, cli_args.best_node)
        Calculations.relocate_guests_on_maintenance_nodes(proxlb_data)
        Calculations.get_balanciness(proxlb_data)
--- a/proxlb/models/calculations.py
+++ b/proxlb/models/calculations.py
@@ -80,7 +80,7 @@ class Calculations:

            for guest_name in group_meta["guests"]:
                guest_node_current = proxlb_data["guests"][guest_name]["node_current"]
-                # Update Hardware assignments
+                # Update resource assignments
                # Update assigned values for the current node
                logger.debug(f"set_node_assignment of guest {guest_name} on node {guest_node_current} with cpu_total: {proxlb_data['guests'][guest_name]['cpu_total']}, memory_total: {proxlb_data['guests'][guest_name]['memory_total']}, disk_total: {proxlb_data['guests'][guest_name]['disk_total']}.")
                proxlb_data["nodes"][guest_node_current]["cpu_assigned"] += proxlb_data["guests"][guest_name]["cpu_total"]
@@ -93,6 +93,83 @@ class Calculations:

        logger.debug("Finished: set_node_assignments.")

+    def set_node_hot(proxlb_data: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Evaluates node 'full' pressure metrics for memory, cpu, and io
+        against defined thresholds and sets <metric>_pressure_hot = True
+        when a node is considered HOT.
+
+        Returns the modified proxlb_data dict.
+        """
+        logger.debug("Starting: set_node_hot.")
+        balancing_cfg = proxlb_data.get("meta", {}).get("balancing", {})
+        thresholds = balancing_cfg.get("psi_thresholds", balancing_cfg.get("psi", {}).get("nodes", {}))
+        nodes = proxlb_data.get("nodes", {})
+
+        for node_name, node in nodes.items():
+
+            if node.get("maintenance"):
+                continue
+
+            if node.get("ignore"):
+                continue
+
+            # PSI metrics are only availavble on Proxmox VE 9.0 and higher.
+            if proxlb_data["meta"]["balancing"].get("mode", "used") == "psi":
+
+                if tuple(map(int, proxlb_data["nodes"][node["name"]]["pve_version"].split('.'))) < tuple(map(int, "9.0".split('.'))):
+                    logger.critical(f"Proxmox node {node['name']} runs Proxmox VE version {proxlb_data['nodes'][node['name']]['pve_version']}."
+                                    " PSI metrics require Proxmox VE 9.0 or higher. Balancing deactivated!")
+
+            for metric, threshold in thresholds.items():
+                pressure_full = node.get(f"{metric}_pressure_full_percent", 0.0)
+                pressure_some = node.get(f"{metric}_pressure_some_percent", 0.0)
+                pressure_spikes = node.get(f"{metric}_pressure_full_spikes_percent", 0.0)
+                is_hot = (pressure_full >= threshold["pressure_full"] and pressure_some >= threshold["pressure_some"]) or (pressure_spikes >= threshold["pressure_spikes"])
+
+                if is_hot:
+                    logger.debug(f"Set node {node["name"]} as hot based on {metric} pressure metrics.")
+                    proxlb_data["nodes"][node["name"]][f"{metric}_pressure_hot"] = True
+                    proxlb_data["nodes"][node["name"]][f"pressure_hot"] = True
+                else:
+                    logger.debug(f"Node {node["name"]} is not hot based on {metric} pressure metrics.")
+
+        logger.debug("Finished: set_node_hot.")
+        return proxlb_data
+
+    def set_guest_hot(proxlb_data: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Evaluates guest 'full' pressure metrics for memory, cpu, and io
+        against defined thresholds and sets <metric>_pressure_hot = True
+        when a guest is considered HOT.
+
+        Returns the modified proxlb_data dict.
+        """
+        logger.debug("Starting: set_guest_hot.")
+        balancing_cfg = proxlb_data.get("meta", {}).get("balancing", {})
+        thresholds = balancing_cfg.get("psi_thresholds", balancing_cfg.get("psi", {}).get("guests", {}))
+        guests = proxlb_data.get("guests", {})
+
+        for guest_name, guest in guests.items():
+            if guest.get("ignore"):
+                continue
+
+            for metric, threshold in thresholds.items():
+                pressure_full = guest.get(f"{metric}_pressure_full_percent", 0.0)
+                pressure_some = guest.get(f"{metric}_pressure_some_percent", 0.0)
+                pressure_spikes = guest.get(f"{metric}_pressure_full_spikes_percent", 0.0)
+                is_hot = (pressure_full >= threshold["pressure_full"] and pressure_some >= threshold["pressure_some"]) or (pressure_spikes >= threshold["pressure_spikes"])
+
+                if is_hot:
+                    logger.debug(f"Set guest {guest["name"]} as hot based on {metric} pressure metrics.")
+                    proxlb_data["guests"][guest["name"]][f"{metric}_pressure_hot"] = True
+                    proxlb_data["guests"][guest["name"]][f"pressure_hot"] = True
+                else:
+                    logger.debug(f"guest {guest["name"]} is not hot based on {metric} pressure metrics.")
+
+        logger.debug("Finished: set_guest_hot.")
+        return proxlb_data
+
    @staticmethod
    def get_balanciness(proxlb_data: Dict[str, Any]) -> Dict[str, Any]:
        """
@@ -113,7 +190,36 @@ class Calculations:
            method = proxlb_data["meta"]["balancing"].get("method", "memory")
            mode = proxlb_data["meta"]["balancing"].get("mode", "used")
            balanciness = proxlb_data["meta"]["balancing"].get("balanciness", 10)
-            method_value = [node_meta[f"{method}_{mode}_percent"] for node_meta in proxlb_data["nodes"].values()]
+
+            if mode == "assigned":
+                method_value = [node_meta[f"{method}_{mode}_percent"] for node_meta in proxlb_data["nodes"].values()]
+
+            elif mode == "used":
+                method_value = [node_meta[f"{method}_{mode}_percent"] for node_meta in proxlb_data["nodes"].values()]
+
+            elif mode == "psi":
+                method_value = [node_meta[f"{method}_pressure_full_spikes_percent"] for node_meta in proxlb_data["nodes"].values()]
+                any_node_hot = any(node.get(f"{method}_pressure_hot", False) for node in proxlb_data["nodes"].values())
+                any_guest_hot = any(node.get(f"{method}_pressure_hot", False) for node in proxlb_data["guests"].values())
+
+                if any_node_hot:
+                    logger.debug(f"Guest balancing is required. A node is marked as HOT based on {method} pressure metrics.")
+                    proxlb_data["meta"]["balancing"]["balance"] = True
+                else:
+                    logger.debug(f"Guest balancing is ok. No node is marked as HOT based on {method} pressure metrics.")
+
+                if any_guest_hot:
+                    logger.debug(f"Guest balancing is required. A guest is marked as HOT based on {method} pressure metrics.")
+                    proxlb_data["meta"]["balancing"]["balance"] = True
+                else:
+                    logger.debug(f"Guest balancing is ok. No guest is marked as HOT based on {method} pressure metrics.")
+
+                return proxlb_data
+
+            else:
+                logger.critical(f"Unknown balancing mode: {mode} provided. Cannot get balanciness.")
+                sys.exit(1)
+
            method_value_highest = max(method_value)
            method_value_lowest = min(method_value)

@@ -159,7 +265,23 @@ class Calculations:
        # Filter by the defined methods and modes for balancing
        method = proxlb_data["meta"]["balancing"].get("method", "memory")
        mode = proxlb_data["meta"]["balancing"].get("mode", "used")
-        lowest_usage_node = min(filtered_nodes, key=lambda x: x[f"{method}_{mode}_percent"])
+
+        if mode == "assigned":
+            logger.debug(f"Get best node for balancing by assigned {method} resources.")
+            lowest_usage_node = min(filtered_nodes, key=lambda x: x[f"{method}_{mode}_percent"])
+
+        elif mode == "used":
+            logger.debug(f"Get best node for balancing by used {method} resources.")
+            lowest_usage_node = min(filtered_nodes, key=lambda x: x[f"{method}_{mode}_percent"])
+
+        elif mode == "psi":
+            logger.debug(f"Get best node for balancing by pressure of {method} resources.")
+            lowest_usage_node = min(filtered_nodes, key=lambda x: x[f"{method}_pressure_full_spikes_percent"])
+
+        else:
+            logger.critical(f"Unknown balancing mode: {mode} provided. Cannot get best node.")
+            sys.exit(1)
+
        proxlb_data["meta"]["balancing"]["balance_reason"] = 'resources'
        proxlb_data["meta"]["balancing"]["balance_next_node"] = lowest_usage_node["name"]

@@ -188,7 +310,7 @@ class Calculations:
        Returns:
        None
        """
-        logger.debug("Starting: get_most_free_node.")
+        logger.debug("Starting: relocate_guests_on_maintenance_nodes.")
        proxlb_data["meta"]["balancing"]["balance_next_guest"] = ""

        for guest_name in proxlb_data["groups"]["maintenance"]:
@@ -199,7 +321,7 @@ class Calculations:
            Calculations.update_node_resources(proxlb_data)
            logger.warning(f"Warning: Balancing may not be perfect because guest {guest_name} was located on a node which is in maintenance mode.")

-        logger.debug("Finished: get_most_free_node.")
+        logger.debug("Finished: relocate_guests_on_maintenance_nodes.")

    @staticmethod
    def relocate_guests(proxlb_data: Dict[str, Any]):
@@ -233,7 +355,26 @@ class Calculations:
                Calculations.get_most_free_node(proxlb_data)

                for guest_name in proxlb_data["groups"]["affinity"][group_name]["guests"]:
-                    proxlb_data["meta"]["balancing"]["balance_next_guest"] = guest_name
+                    mode = proxlb_data["meta"]["balancing"].get("mode", "used")
+
+                    if mode == 'psi':
+                        logger.debug(f"Evaluating guest relocation based on {mode} mode.")
+                        method = proxlb_data["meta"]["balancing"].get("method", "memory")
+                        processed_guests_psi = proxlb_data["meta"]["balancing"].setdefault("processed_guests_psi", [])
+                        unprocessed_guests_psi = [guest for guest in proxlb_data["guests"].values() if guest["name"] not in processed_guests_psi]
+
+                        # Filter by the defined methods and modes for balancing
+                        highest_usage_guest = max(unprocessed_guests_psi, key=lambda x: x[f"{method}_pressure_full_spikes_percent"])
+
+                        # Append guest to the psi based processed list of guests
+                        if highest_usage_guest["name"] == guest_name and guest_name not in proxlb_data["meta"]["balancing"]["processed_guests_psi"]:
+                            proxlb_data["meta"]["balancing"]["processed_guests_psi"].append(guest_name)
+                            proxlb_data["meta"]["balancing"]["balance_next_guest"] = guest_name
+
+                    else:
+                        logger.debug(f"Evaluating guest relocation based on {mode} mode.")
+                        proxlb_data["meta"]["balancing"]["balance_next_guest"] = guest_name
+
                    Calculations.val_anti_affinity(proxlb_data, guest_name)
                    Calculations.val_node_relationships(proxlb_data, guest_name)
                    Calculations.update_node_resources(proxlb_data)
@@ -348,6 +489,11 @@ class Calculations:
        """
        logger.debug("Starting: update_node_resources.")
        guest_name = proxlb_data["meta"]["balancing"]["balance_next_guest"]
+
+        if guest_name == "":
+            logger.debug("No guest defined to update node resources for.")
+            return
+
        node_current = proxlb_data["guests"][guest_name]["node_current"]
        node_target = proxlb_data["meta"]["balancing"]["balance_next_node"]

--- a/proxlb/models/guests.py
+++ b/proxlb/models/guests.py
@@ -62,19 +62,34 @@ class Guests:
            # resource metrics for rebalancing to ensure that we do not overprovisiong the node.
            for guest in proxmox_api.nodes(node).qemu.get():
                if guest['status'] == 'running':
-
                    guests['guests'][guest['name']] = {}
                    guests['guests'][guest['name']]['name'] = guest['name']
                    guests['guests'][guest['name']]['cpu_total'] = int(guest['cpus'])
-                    guests['guests'][guest['name']]['cpu_used'] = Guests.get_guest_cpu_usage(proxmox_api, node, guest['vmid'], guest['name'])
+                    guests['guests'][guest['name']]['cpu_used'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', None)
+                    guests['guests'][guest['name']]['cpu_pressure_some_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', 'some')
+                    guests['guests'][guest['name']]['cpu_pressure_full_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', 'full')
+                    guests['guests'][guest['name']]['cpu_pressure_some_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', 'some', spikes=True)
+                    guests['guests'][guest['name']]['cpu_pressure_full_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', 'full', spikes=True)
+                    guests['guests'][guest['name']]['cpu_pressure_hot'] = False
                    guests['guests'][guest['name']]['memory_total'] = guest['maxmem']
                    guests['guests'][guest['name']]['memory_used'] = guest['mem']
+                    guests['guests'][guest['name']]['memory_pressure_some_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'memory', 'some')
+                    guests['guests'][guest['name']]['memory_pressure_full_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'memory', 'full')
+                    guests['guests'][guest['name']]['memory_pressure_some_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'memory', 'some', spikes=True)
+                    guests['guests'][guest['name']]['memory_pressure_full_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'memory', 'full', spikes=True)
+                    guests['guests'][guest['name']]['memory_pressure_hot'] = False
                    guests['guests'][guest['name']]['disk_total'] = guest['maxdisk']
                    guests['guests'][guest['name']]['disk_used'] = guest['disk']
+                    guests['guests'][guest['name']]['disk_pressure_some_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'disk', 'some')
+                    guests['guests'][guest['name']]['disk_pressure_full_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'disk', 'full')
+                    guests['guests'][guest['name']]['disk_pressure_some_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'disk', 'some', spikes=True)
+                    guests['guests'][guest['name']]['disk_pressure_full_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'disk', 'full', spikes=True)
+                    guests['guests'][guest['name']]['disk_pressure_hot'] = False
                    guests['guests'][guest['name']]['id'] = guest['vmid']
                    guests['guests'][guest['name']]['node_current'] = node
                    guests['guests'][guest['name']]['node_target'] = node
                    guests['guests'][guest['name']]['processed'] = False
+                    guests['guests'][guest['name']]['pressure_hot'] = False
                    guests['guests'][guest['name']]['tags'] = Tags.get_tags_from_guests(proxmox_api, node, guest['vmid'], 'vm')
                    guests['guests'][guest['name']]['affinity_groups'] = Tags.get_affinity_groups(guests['guests'][guest['name']]['tags'])
                    guests['guests'][guest['name']]['anti_affinity_groups'] = Tags.get_anti_affinity_groups(guests['guests'][guest['name']]['tags'])
@@ -94,15 +109,31 @@ class Guests:
                    guests['guests'][guest['name']] = {}
                    guests['guests'][guest['name']]['name'] = guest['name']
                    guests['guests'][guest['name']]['cpu_total'] = int(guest['cpus'])
-                    guests['guests'][guest['name']]['cpu_used'] = Guests.get_guest_cpu_usage(proxmox_api, node, guest['vmid'], guest['name'])
+                    guests['guests'][guest['name']]['cpu_used'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', None)
+                    guests['guests'][guest['name']]['cpu_pressure_some_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', 'some')
+                    guests['guests'][guest['name']]['cpu_pressure_full_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', 'full')
+                    guests['guests'][guest['name']]['cpu_pressure_some_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', 'some', spikes=True)
+                    guests['guests'][guest['name']]['cpu_pressure_full_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'cpu', 'full', spikes=True)
+                    guests['guests'][guest['name']]['cpu_pressure_hot'] = False
                    guests['guests'][guest['name']]['memory_total'] = guest['maxmem']
                    guests['guests'][guest['name']]['memory_used'] = guest['mem']
+                    guests['guests'][guest['name']]['memory_pressure_some_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'memory', 'some')
+                    guests['guests'][guest['name']]['memory_pressure_full_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'memory', 'full')
+                    guests['guests'][guest['name']]['memory_pressure_some_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'memory', 'some', spikes=True)
+                    guests['guests'][guest['name']]['memory_pressure_full_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'memory', 'full', spikes=True)
+                    guests['guests'][guest['name']]['memory_pressure_hot'] = False
                    guests['guests'][guest['name']]['disk_total'] = guest['maxdisk']
                    guests['guests'][guest['name']]['disk_used'] = guest['disk']
+                    guests['guests'][guest['name']]['disk_pressure_some_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'disk', 'some')
+                    guests['guests'][guest['name']]['disk_pressure_full_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'disk', 'full')
+                    guests['guests'][guest['name']]['disk_pressure_some_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'disk', 'some', spikes=True)
+                    guests['guests'][guest['name']]['disk_pressure_full_spikes_percent'] = Guests.get_guest_rrd_data(proxmox_api, node, guest['vmid'], guest['name'], 'disk', 'full', spikes=True)
+                    guests['guests'][guest['name']]['disk_pressure_hot'] = False
                    guests['guests'][guest['name']]['id'] = guest['vmid']
                    guests['guests'][guest['name']]['node_current'] = node
                    guests['guests'][guest['name']]['node_target'] = node
                    guests['guests'][guest['name']]['processed'] = False
+                    guests['guests'][guest['name']]['pressure_hot'] = False
                    guests['guests'][guest['name']]['tags'] = Tags.get_tags_from_guests(proxmox_api, node, guest['vmid'], 'ct')
                    guests['guests'][guest['name']]['affinity_groups'] = Tags.get_affinity_groups(guests['guests'][guest['name']]['tags'])
                    guests['guests'][guest['name']]['anti_affinity_groups'] = Tags.get_anti_affinity_groups(guests['guests'][guest['name']]['tags'])
@@ -118,36 +149,55 @@ class Guests:
        return guests

    @staticmethod
-    def get_guest_cpu_usage(proxmox_api, node_name: str, vm_id: int, vm_name: str) -> float:
+    def get_guest_rrd_data(proxmox_api, node_name: str, vm_id: int, vm_name: str, object_name: str, object_type: str, spikes=False) -> float:
        """
-        Retrieve the average CPU usage of a guest instance (VM/CT) over the past hour.
-
-        This method queries the Proxmox VE API for RRD (Round-Robin Database) data
-        related to CPU usage of a specific guest instance and calculates the average CPU usage
-        over the last hour using the "AVERAGE" consolidation function.
+        Retrieves the rrd data metrics for a specific resource (CPU, memory, disk) of a guest VM or CT.

        Args:
-            proxmox_api: An instance of the Proxmox API client.
-            node_name (str): The name of the Proxmox node hosting the VM.
-            vm_id (int): The unique identifier of the guest instance (VM/CT).
-            vm_name (str): The name of the guest instance (VM/CT).
+            proxmox_api (Any): The Proxmox API client instance.
+            node_name (str): The name of the node hosting the guest.
+            vm_id (int): The ID of the guest VM or CT.
+            vm_name (str): The name of the guest VM or CT.
+            object_name (str): The resource type to query (e.g., 'cpu', 'memory', 'disk').
+            object_type (str, optional): The pressure type ('some', 'full') or None for average usage.
+            spikes (bool, optional): Whether to consider spikes in the calculation. Defaults to False.

        Returns:
-            float: The average CPU usage as a fraction (0.0 to 1.0) over the past hour.
-                   Returns 0.0 if no data is available.
+            float: The calculated average usage value for the specified resource.
        """
-        logger.debug("Finished: get_guest_cpu_usage.")
+        logger.debug("Starting: get_guest_rrd_data.")
        time.sleep(0.1)

        try:
-            logger.debug(f"Getting RRD dara for guest: {vm_name}.")
-            guest_data_rrd = proxmox_api.nodes(node_name).qemu(vm_id).rrddata.get(timeframe="hour", cf="AVERAGE")
+            if spikes:
+                logger.debug(f"Getting spike RRD data for {object_name} from guest: {vm_name}.")
+                guest_data_rrd = proxmox_api.nodes(node_name).qemu(vm_id).rrddata.get(timeframe="hour", cf="MAX")
+            else:
+                logger.debug(f"Getting average RRD data for {object_name} from guest: {vm_name}.")
+                guest_data_rrd = proxmox_api.nodes(node_name).qemu(vm_id).rrddata.get(timeframe="hour", cf="AVERAGE")
        except Exception:
-            logger.error(f"Failed to retrieve RRD data for guest: {vm_name} (ID: {vm_id}) on node: {node_name}. Using 0.0 as CPU usage.")
-            logger.debug("Finished: get_guest_cpu_usage.")
-            return 0.0
+            logger.error(f"Failed to retrieve RRD data for guest: {vm_name} (ID: {vm_id}) on node: {node_name}. Using 0.0 as value.")
+            logger.debug("Finished: get_guest_rrd_data.")
+            return float(0.0)

-        cpu_usage = sum(entry.get("cpu", 0.0) for entry in guest_data_rrd) / len(guest_data_rrd)
-        logger.debug(f"CPU RRD data for guest: {vm_name}: {cpu_usage}")
-        logger.debug("Finished: get_guest_cpu_usage.")
-        return cpu_usage
+        if object_type:
+
+            lookup_key = f"pressure{object_name}{object_type}"
+            if spikes:
+                # RRD data is collected every minute, so we look at the last 6 entries
+                # and take the maximum value to represent the spike
+                logger.debug(f"Getting RRD data (spike: {spikes}) of pressure for {object_name} {object_type} from guest: {vm_name}.")
+                rrd_data_value = [row.get(lookup_key) for row in guest_data_rrd if row.get(lookup_key) is not None]
+                rrd_data_value = max(rrd_data_value[-6:], default=0.0)
+            else:
+                # Calculate the average value from the RRD data entries
+                logger.debug(f"Getting RRD data (spike: {spikes}) of pressure for {object_name} {object_type} from guest: {vm_name}.")
+                rrd_data_value = sum(entry.get(lookup_key, 0.0) for entry in guest_data_rrd) / len(guest_data_rrd)
+
+        else:
+            logger.debug(f"Getting RRD data of cpu usage from guest: {vm_name}.")
+            rrd_data_value = sum(entry.get("cpu", 0.0) for entry in guest_data_rrd) / len(guest_data_rrd)
+
+        logger.debug(f"RRD data (spike: {spikes}) for {object_name} from guest: {vm_name}: {rrd_data_value}")
+        logger.debug("Finished: get_guest_rrd_data.")
+        return rrd_data_value
--- a/proxlb/models/nodes.py
+++ b/proxlb/models/nodes.py
@@ -21,6 +21,7 @@ __copyright__ = "Copyright (C) 2025 Florian Paul Azim Hoberg (@gyptazy)"
 __license__ = "GPL-3.0"


+import time
 from typing import Dict, Any
 from utils.logger import SystemdLogger

@@ -60,6 +61,8 @@ class Nodes:
            if node["status"] == "online" and not Nodes.set_node_ignore(proxlb_config, node["node"]):
                nodes["nodes"][node["node"]] = {}
                nodes["nodes"][node["node"]]["name"] = node["node"]
+                nodes["nodes"][node["node"]]["pve_version"] = Nodes.get_node_pve_version(proxmox_api, node["node"])
+                nodes["nodes"][node["node"]]["pressure_hot"] = False
                nodes["nodes"][node["node"]]["maintenance"] = False
                nodes["nodes"][node["node"]]["cpu_total"] = node["maxcpu"]
                nodes["nodes"][node["node"]]["cpu_assigned"] = 0
@@ -68,6 +71,11 @@ class Nodes:
                nodes["nodes"][node["node"]]["cpu_assigned_percent"] = nodes["nodes"][node["node"]]["cpu_assigned"] / nodes["nodes"][node["node"]]["cpu_total"] * 100
                nodes["nodes"][node["node"]]["cpu_free_percent"] = nodes["nodes"][node["node"]]["cpu_free"] / node["maxcpu"] * 100
                nodes["nodes"][node["node"]]["cpu_used_percent"] = nodes["nodes"][node["node"]]["cpu_used"] / node["maxcpu"] * 100
+                nodes["nodes"][node["node"]]["cpu_pressure_some_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "cpu", "some")
+                nodes["nodes"][node["node"]]["cpu_pressure_full_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "cpu", "full")
+                nodes["nodes"][node["node"]]["cpu_pressure_some_spikes_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "cpu", "some", spikes=True)
+                nodes["nodes"][node["node"]]["cpu_pressure_full_spikes_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "cpu", "full", spikes=True)
+                nodes["nodes"][node["node"]]["cpu_pressure_hot"] = False
                nodes["nodes"][node["node"]]["memory_total"] = node["maxmem"]
                nodes["nodes"][node["node"]]["memory_assigned"] = 0
                nodes["nodes"][node["node"]]["memory_used"] = node["mem"]
@@ -75,6 +83,11 @@ class Nodes:
                nodes["nodes"][node["node"]]["memory_assigned_percent"] = nodes["nodes"][node["node"]]["memory_assigned"] / nodes["nodes"][node["node"]]["memory_total"] * 100
                nodes["nodes"][node["node"]]["memory_free_percent"] = nodes["nodes"][node["node"]]["memory_free"] / node["maxmem"] * 100
                nodes["nodes"][node["node"]]["memory_used_percent"] = nodes["nodes"][node["node"]]["memory_used"] / node["maxmem"] * 100
+                nodes["nodes"][node["node"]]["memory_pressure_some_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "memory", "some")
+                nodes["nodes"][node["node"]]["memory_pressure_full_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "memory", "full")
+                nodes["nodes"][node["node"]]["memory_pressure_some_spikes_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "memory", "some", spikes=True)
+                nodes["nodes"][node["node"]]["memory_pressure_full_spikes_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "memory", "full", spikes=True)
+                nodes["nodes"][node["node"]]["memory_pressure_hot"] = False
                nodes["nodes"][node["node"]]["disk_total"] = node["maxdisk"]
                nodes["nodes"][node["node"]]["disk_assigned"] = 0
                nodes["nodes"][node["node"]]["disk_used"] = node["disk"]
@@ -82,11 +95,17 @@ class Nodes:
                nodes["nodes"][node["node"]]["disk_assigned_percent"] = nodes["nodes"][node["node"]]["disk_assigned"] / nodes["nodes"][node["node"]]["disk_total"] * 100
                nodes["nodes"][node["node"]]["disk_free_percent"] = nodes["nodes"][node["node"]]["disk_free"] / node["maxdisk"] * 100
                nodes["nodes"][node["node"]]["disk_used_percent"] = nodes["nodes"][node["node"]]["disk_used"] / node["maxdisk"] * 100
+                nodes["nodes"][node["node"]]["disk_pressure_some_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "disk", "some")
+                nodes["nodes"][node["node"]]["disk_pressure_full_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "disk", "full")
+                nodes["nodes"][node["node"]]["disk_pressure_some_spikes_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "disk", "some", spikes=True)
+                nodes["nodes"][node["node"]]["disk_pressure_full_spikes_percent"] = Nodes.get_node_rrd_data(proxmox_api, node["node"], "disk", "full", spikes=True)
+                nodes["nodes"][node["node"]]["disk_pressure_hot"] = False

                # Evaluate if node should be set to maintenance mode
                if Nodes.set_node_maintenance(proxmox_api, proxlb_config, node["node"]):
                    nodes["nodes"][node["node"]]["maintenance"] = True

+        logger.debug(f"Node metrics collected: {nodes}")
        logger.debug("Finished: get_nodes.")
        return nodes

@@ -153,3 +172,83 @@ class Nodes:
                    return True

        logger.debug("Finished: set_node_ignore.")
+
+    @staticmethod
+    def get_node_rrd_data(proxmox_api, node_name: str, object_name: str, object_type: str, spikes=False) -> float:
+        """
+        Retrieves the rrd data metrics for a specific resource (CPU, memory, disk) of a node.
+
+        Args:
+            proxmox_api (Any): The Proxmox API client instance.
+            node_name (str): The name of the node hosting the guest.
+            object_name (str): The resource type to query (e.g., 'cpu', 'memory', 'disk').
+            object_type (str, optional): The pressure type ('some', 'full') or None for average usage.
+            spikes (bool, optional): Whether to consider spikes in the calculation. Defaults to False.
+
+        Returns:
+            float: The calculated average usage value for the specified resource.
+        """
+        logger.debug("Starting: get_node_rrd_data.")
+        time.sleep(0.1)
+
+        try:
+            if spikes:
+                logger.debug(f"Getting spike RRD data for {object_name} from node: {node_name}.")
+                node_data_rrd = proxmox_api.nodes(node_name).rrddata.get(timeframe="hour", cf="MAX")
+            else:
+                logger.debug(f"Getting average RRD data for {object_name} from node: {node_name}.")
+                node_data_rrd = proxmox_api.nodes(node_name).rrddata.get(timeframe="hour", cf="AVERAGE")
+
+        except Exception:
+            logger.error(f"Failed to retrieve RRD data for guest: {node_name}. Using 0.0 as value.")
+            logger.debug("Finished: get_node_rrd_data.")
+            return 0.0
+
+        lookup_key = f"pressure{object_name}{object_type}"
+
+        if spikes:
+            # RRD data is collected every minute, so we look at the last 6 entries
+            # and take the maximum value to represent the spike
+            rrd_data_value = [row.get(lookup_key) for row in node_data_rrd if row.get(lookup_key) is not None]
+            rrd_data_value = max(rrd_data_value[-6:], default=0.0)
+        else:
+            # Calculate the average value from the RRD data entries
+            rrd_data_value = sum(entry.get(lookup_key, 0.0) for entry in node_data_rrd) / len(node_data_rrd)
+
+        logger.debug(f"RRD data (spike: {spikes}) for {object_name} from node: {node_name}: {rrd_data_value}")
+        logger.debug("Finished: get_node_rrd_data.")
+        return rrd_data_value
+
+    @staticmethod
+    def get_node_pve_version(proxmox_api, node_name: str) -> float:
+        """
+        Return the Proxmox VE (PVE) version for a given node by querying the Proxmox API.
+
+        This function calls proxmox_api.nodes(node_name).version.get() and extracts the
+        'version' field from the returned mapping. The value is expected to be numeric
+        (or convertible to float) and is returned as a float.
+
+        Args:
+            proxmox_api (Any): The Proxmox API client instance.
+            node_name (str): The name of the node hosting the guest.
+
+        Returns:
+            float: The PVE version for the specified node as a floating point number.
+
+        Raises:
+        Exception: If the proxmox_api call fails, returns an unexpected structure, or the
+                   'version' field is missing or cannot be converted to float. Callers should
+                    handle or propagate exceptions as appropriate.
+        """
+        logger.debug("Starting: get_node_pve_version.")
+        time.sleep(0.1)
+
+        try:
+            logger.debug(f"Trying to get PVE version for node: {node_name}.")
+            version = proxmox_api.nodes(node_name).version.get()
+        except Exception:
+            logger.error(f"Failed to get PVE version for node: {node_name}.")
+
+        logger.debug(f"Got version {version['version']} for node {node_name}.")
+        logger.debug("Finished: get_node_pve_version.")
+        return version["version"]