Add DPM (Dynamic Power Management) feature for Proxmox cluster nodes

Fixes: #141
This commit is contained in:
Florian Paul Azim Hoberg
2025-05-08 15:12:44 +02:00
parent 1e096e1aae
commit c56d465f90
9 changed files with 443 additions and 28 deletions

View File

@@ -0,0 +1,2 @@
added:
- Add power management feature for cluster nodes (by @gyptazy) [#141]

View File

@@ -0,0 +1 @@
date: TBD

View File

@@ -46,28 +46,29 @@ Overall, ProxLB significantly enhances resource management by intelligently dist
<img src="https://cdn.gyptazy.com/images/proxlb-rebalancing-demo.gif"/>
## Features
ProxLB's key features are by enabling automatic rebalancing of VMs and CTs across a Proxmox cluster based on memory, CPU, and local disk usage while identifying optimal nodes for automation. It supports maintenance mode, affinity rules, and seamless Proxmox API integration with ACL support, offering flexible usage as a one-time operation, a daemon, or through the Proxmox Web GUI.
ProxLB's key features are by enabling automatic rebalancing of VMs and CTs across a Proxmox cluster based on memory, CPU, and local disk usage while identifying optimal nodes for automation. It supports maintenance mode, affinity rules, and seamless Proxmox API integration with ACL support, offering flexible usage as a one-time operation, a daemon, or through the Proxmox Web GUI. In addition, ProxLB also supports additional enterprise alike features like power managements for nodes (often also known as DPM) where nodes can be turned on/off on demand when workloads are higher/lower than usual. Also the automated security-patching of nodes within the cluster (known as ASPM) may help to reduce the manual work from cluster admins, where nodes will install patches, move guests across the cluster, reboot and then reblance the cluster again.
**Features**
* Rebalance VMs/CTs in the cluster by:
* Memory
* Disk (only local storage)
* CPU
* Get best nodes for further automation
* Supported Guest Types
* VMs
* CTs
* Re-Balancing (DRS)
* Supporting VMs & CTs
* Balancing by:
* CPU
* Memory
* Disk
* Affinity / Anti-Affinity Rules
* Affinity: Groups guests together
* Anti-Affinity: Ensuring guests run on different nodes
* Best node evaluation
* Get the best node for guest placement (e.g., CI/CD)
* Maintenance Mode
* Set node(s) into maintenance
* Move all workloads to different nodes
* Affinity / Anti-Affinity Rules
* Evacuating a sinlge or multiple nodes
* Node Power Management (DPM)
* Auto Node Security-Patch-Management (ASPM)
* Fully based on Proxmox API
* Fully integrated into the Proxmox ACL
* No SSH required
* Usage
* One-Time
* Daemon
* Proxmox Web GUI Integration
* Utilizing the Proxmox User Authentications
* Supporting API tokens
* No SSH or Agents required
* Can run everywhere
## How does it work?
ProxLB is a load-balancing system designed to optimize the distribution of virtual machines (VMs) and containers (CTs) across a cluster. It works by first gathering resource usage metrics from all nodes in the cluster through the Proxmox API. This includes detailed resource metrics for each VM and CT on every node. ProxLB then evaluates the difference between the maximum and minimum resource usage of the nodes, referred to as "Balanciness." If this difference exceeds a predefined threshold (which is configurable), the system initiates the rebalancing process.
@@ -261,6 +262,12 @@ The following options can be set in the configuration file `proxlb.yaml`:
| | balanciness | | 10 | `Int` | The maximum delta of resource usage between node with highest and lowest usage. |
| | method | | memory | `Str` | The balancing method that should be used. [values: `memory` (default), `cpu`, `disk`]|
| | mode | | used | `Str` | The balancing mode that should be used. [values: `used` (default), `assigned`] |
| `dpm` | | | | | |
| | enable | | True | `Bool` | Enables the Dynamic Power Management functions.|
| | method | | memory | `Str` | The balancing method that should be used. [values: `memory` (default), `cpu`, `disk`]|
| | mode | | static | `Str` | The balancing mode that should be used. [values: `static` (default), `auto`] |
| | cluster_min_free_resources | | 60 | `Int` | Representing the minimum required free resouzrces in percent within the cluster. [values: `60`% (default)] |
| | cluster_min_nodes | | 3 | `Int` | The minimum of required nodes that should remain in a cluster. [values: `3` (default)] |
| `service` | | | | | |
| | daemon | | True | `Bool` | If daemon mode should be activated. |
| | `schedule` | | | `Dict` | Schedule config block for rebalancing. |
@@ -301,6 +308,15 @@ balancing:
method: memory
mode: used
dpm:
# DPM requires you to define the WOL (Wake-on-Lan)
# MAC address for each node in Proxmox.
enable: True
method: memory
mode: static
cluster_min_free_resources: 60
cluster_min_nodes: 1
service:
daemon: True
schedule:

View File

@@ -28,6 +28,13 @@ balancing:
method: memory
mode: used
dpm:
enable: True
method: memory
mode: static
cluster_min_free_resources: 60
cluster_min_nodes: 1
service:
daemon: True
schedule:

View File

@@ -19,6 +19,7 @@
6. [Parallel Migrations](#parallel-migrations)
7. [Run as a Systemd-Service](#run-as-a-systemd-service)
8. [SSL Self-Signed Certificates](#ssl-self-signed-certificates)
9. [Dynamic Power Management (DPM)](#dynamic-power-management)
## Authentication / User Accounts / Permissions
### Authentication
@@ -207,4 +208,34 @@ proxmox_api:
ssl_verification: False
```
*Note: Disabling SSL certificate validation is not recommended.*
*Note: Disabling SSL certificate validation is not recommended.*
### Dynamic Power Management (DPM)
<img align="left" src="https://cdn.gyptazy.com/images/proxlb-proxmox-node-wakeonlan-wol-mac-dpm.jpg"/> Configuring Dynamic Power Management (DPM) in ProxLB within a Proxmox cluster involves a few critical steps to ensure proper operation. The first consideration is that any node intended for automatic shutdown and startup must support Wake-on-LAN (WOL). This is essential because DPM relies on the ability to power nodes back on remotely. For this to work, the ProxLB instance must be able to reach the target nodes MAC address directly over the network.
To make this possible, you must configure the correct MAC address for WOL within the Proxmox web interface. This is done by selecting the node, going to the “System” section, then “Options,” and finally setting the “MAC address for Wake-on-LAN.” Alternatively, this value can also be submitted using the Proxmox API. Without this MAC address in place, ProxLB will not allow the node to be shut down. This restriction is in place to prevent nodes from being turned off without a way to bring them back online, which could lead to service disruption. By ensuring that each node has a valid WOL MAC address configured, DPM can operate safely and effectively, allowing ProxLB to manage the clusters power consumption dynamically.
#### Requirements
Using the powermanagement feature within clusters comes along with several requirements:
* ProxLB needs to reach the WOL-Mac address of the node (plain network)
* WOL must be enabled of the node in general (BIOS/UEFI)
* The related WOL network interface must be defined
* The related WOL network interface MAC address must be defined in Proxmox for the node
#### Options
| Section | Option | Sub Option | Example | Type | Description |
|---------|:------:|:----------:|:-------:|:----:|:-----------:|
| `dpm` | | | | | |
| | enable | | True | `Bool` | Enables the Dynamic Power Management functions.|
| | method | | memory | `Str` | The balancing method that should be used. [values: `memory` (default), `cpu`, `disk`]|
| | mode | | static | `Str` | The balancing mode that should be used. [values: `static` (default), `auto`] |
| | cluster_min_free_resources | | 60 | `Int` | Representing the minimum required free resouzrces in percent within the cluster. [values: `60`% (default)] |
| | cluster_min_nodes | | 3 | `Int` | The minimum of required nodes that should remain in a cluster. [values: `3` (default)] |
#### DPM Modes
##### Static
Static mode in DPM lets you set a fixed number of nodes that should always stay powered on in a Proxmox cluster. This is important to keep the cluster working properly, since you need at least three nodes to maintain quorum. The system wont let you go below that limit to avoid breaking cluster functionality.
Besides the minimum number of active nodes, you can also define a baseline for how many free resources—like CPU or RAM—should always be available when the virtual machines are running. If the available resources drop below that level, ProxLB will try to power on more nodes, as long as they're available and can be started. On the other hand, if the cluster has more than enough resources, ProxLB will begin to shut down nodes again, but only until the free resource threshold is reached.
This mode gives you a more stable setup by always keeping a minimum number of nodes ready while still adjusting the rest of the cluster based on resource usage, but in a controlled and predictable way.

View File

@@ -17,6 +17,7 @@ from utils.logger import SystemdLogger
from utils.cli_parser import CliParser
from utils.config_parser import ConfigParser
from utils.proxmox_api import ProxmoxApi
from models.dpm import DPM
from models.nodes import Nodes
from models.guests import Guests
from models.groups import Groups
@@ -53,14 +54,17 @@ def main():
while True:
# Get all required objects from the Proxmox cluster
meta = {"meta": proxlb_config}
nodes = Nodes.get_nodes(proxmox_api, proxlb_config)
nodes, cluster = Nodes.get_nodes(proxmox_api, proxlb_config)
guests = Guests.get_guests(proxmox_api, nodes, meta)
groups = Groups.get_groups(guests, nodes)
# Merge obtained objects from the Proxmox cluster for further usage
proxlb_data = {**meta, **nodes, **guests, **groups}
proxlb_data = {**meta, **cluster, **nodes, **guests, **groups}
Helper.log_node_metrics(proxlb_data)
# Evaluate the dynamic power management for nodes in the clustet
DPM(proxlb_data)
# Update the initial node resource assignments
# by the previously created groups.
Calculations.set_node_assignments(proxlb_data)
@@ -70,10 +74,14 @@ def main():
Calculations.relocate_guests(proxlb_data)
Helper.log_node_metrics(proxlb_data, init=False)
# Perform balancing actions via Proxmox API
# Perform balancing
if not cli_args.dry_run or not proxlb_data["meta"]["balancing"].get("enable", False):
Balancing(proxmox_api, proxlb_data)
# Perform DPM
if not cli_args.dry_run:
DPM.dpm_shutdown_nodes(proxmox_api, proxlb_data)
# Validate if the JSON output should be
# printed to stdout
Helper.print_json(proxlb_data, cli_args.json)

View File

@@ -162,7 +162,7 @@ class Calculations:
logger.debug("Finished: get_most_free_node.")
@staticmethod
def relocate_guests_on_maintenance_nodes(proxlb_data: Dict[str, Any]):
def relocate_guests_on_maintenance_nodes(proxlb_data: Dict[str, Any]) -> None:
"""
Relocates guests that are currently on nodes marked for maintenance to
nodes with the most available resources.
@@ -192,7 +192,7 @@ class Calculations:
logger.debug("Finished: get_most_free_node.")
@staticmethod
def relocate_guests(proxlb_data: Dict[str, Any]):
def relocate_guests(proxlb_data: Dict[str, Any]) -> None:
"""
Relocates guests within the provided data structure to ensure affinity groups are
placed on nodes with the most free resources.
@@ -231,7 +231,7 @@ class Calculations:
logger.debug("Finished: relocate_guests.")
@staticmethod
def val_anti_affinity(proxlb_data: Dict[str, Any], guest_name: str):
def val_anti_affinity(proxlb_data: Dict[str, Any], guest_name: str) -> None:
"""
Validates and assigns nodes to guests based on anti-affinity rules.
@@ -280,7 +280,7 @@ class Calculations:
logger.debug("Finished: val_anti_affinity.")
@staticmethod
def val_node_relationship(proxlb_data: Dict[str, Any], guest_name: str):
def val_node_relationship(proxlb_data: Dict[str, Any], guest_name: str) -> None:
"""
Validates and assigns guests to nodes based on defined relationships based on tags.
@@ -311,7 +311,7 @@ class Calculations:
logger.debug("Finished: val_node_relationship.")
@staticmethod
def update_node_resources(proxlb_data):
def update_node_resources(proxlb_data: Dict[str, Any]) -> None:
"""
Updates the resource allocation and usage statistics for nodes when a guest
is moved from one node to another.
@@ -375,3 +375,68 @@ class Calculations:
logger.debug(f"Set guest {guest_name} from node {node_current} to node {node_target}.")
logger.debug("Finished: update_node_resources.")
@staticmethod
def update_cluster_resources(proxlb_data: Dict[str, Any], node: str, action: str) -> None:
"""
Updates the cluster resource statistics based on the specified action and node.
This method modifies the cluster-level resource data (such as CPU, memory, disk usage,
and node counts) based on the action performed ('add' or 'remove') for the specified node.
It calculates the updated statistics after adding or removing a node and logs the results.
Parameters:
proxlb_data (Dict[str, Any]): The data representing the current state of the cluster,
including node-level statistics for CPU, memory, and disk.
node (str): The identifier of the node whose resources are being added or removed from the cluster.
action (str): The action to perform, either 'add' or 'remove'. 'add' will include the node's
resources in the cluster, while 'remove' will exclude the node's resources.
Returns:
None: The function modifies the `proxlb_data` dictionary in place to update the cluster resources.
"""
logger.debug("Starting: update_cluster_resources.")
logger.debug(f"DPM: Updating cluster statistics by online node {node}. Action: {action}")
logger.debug(f"DPM: update_cluster_resources - Before {action}: {proxlb_data['cluster']['memory_free_percent']}")
if action == "add":
proxlb_data["cluster"]["node_count"] = proxlb_data["cluster"].get("node_count", 0) + 1
proxlb_data["cluster"]["cpu_total"] = proxlb_data["cluster"].get("cpu_total", 0) + proxlb_data["nodes"][node]["cpu_total"]
proxlb_data["cluster"]["cpu_used"] = proxlb_data["cluster"].get("cpu_used", 0) + proxlb_data["nodes"][node]["cpu_used"]
proxlb_data["cluster"]["cpu_free"] = proxlb_data["cluster"].get("cpu_free", 0) + proxlb_data["nodes"][node]["cpu_free"]
proxlb_data["cluster"]["cpu_free_percent"] = proxlb_data["cluster"].get("cpu_free", 0) / proxlb_data["cluster"].get("cpu_total", 0) * 100
proxlb_data["cluster"]["cpu_used_percent"] = proxlb_data["cluster"].get("cpu_used", 0) / proxlb_data["cluster"].get("cpu_total", 0) * 100
proxlb_data["cluster"]["memory_total"] = proxlb_data["cluster"].get("memory_total", 0) + proxlb_data["nodes"][node]["memory_total"]
proxlb_data["cluster"]["memory_used"] = proxlb_data["cluster"].get("memory_used", 0) + proxlb_data["nodes"][node]["memory_used"]
proxlb_data["cluster"]["memory_free"] = proxlb_data["cluster"].get("memory_free", 0) + proxlb_data["nodes"][node]["memory_free"]
proxlb_data["cluster"]["memory_free_percent"] = proxlb_data["cluster"].get("memory_free", 0) / proxlb_data["cluster"].get("memory_total", 0) * 100
proxlb_data["cluster"]["memory_used_percent"] = proxlb_data["cluster"].get("memory_used", 0) / proxlb_data["cluster"].get("memory_total", 0) * 100
proxlb_data["cluster"]["disk_total"] = proxlb_data["cluster"].get("disk_total", 0) + proxlb_data["nodes"][node]["disk_total"]
proxlb_data["cluster"]["disk_used"] = proxlb_data["cluster"].get("disk_used", 0) + proxlb_data["nodes"][node]["disk_used"]
proxlb_data["cluster"]["disk_free"] = proxlb_data["cluster"].get("disk_free", 0) + proxlb_data["nodes"][node]["disk_free"]
proxlb_data["cluster"]["disk_free_percent"] = proxlb_data["cluster"].get("disk_free", 0) / proxlb_data["cluster"].get("disk_total", 0) * 100
proxlb_data["cluster"]["disk_used_percent"] = proxlb_data["cluster"].get("disk_used", 0) / proxlb_data["cluster"].get("disk_total", 0) * 100
proxlb_data["cluster"]["node_count_available"] = proxlb_data["cluster"].get("node_count_available", 0) + 1
proxlb_data["cluster"]["node_count_overall"] = proxlb_data["cluster"].get("node_count_overall", 0) + 1
if action == "remove":
proxlb_data["cluster"]["node_count"] = proxlb_data["cluster"].get("node_count", 0) - 1
proxlb_data["cluster"]["cpu_total"] = proxlb_data["cluster"].get("cpu_total", 0) - proxlb_data["nodes"][node]["cpu_total"]
proxlb_data["cluster"]["cpu_used"] = proxlb_data["cluster"].get("cpu_used", 0) - proxlb_data["nodes"][node]["cpu_used"]
proxlb_data["cluster"]["cpu_free"] = proxlb_data["cluster"].get("cpu_free", 0) - proxlb_data["nodes"][node]["cpu_free"]
proxlb_data["cluster"]["cpu_free_percent"] = proxlb_data["cluster"].get("cpu_free", 0) / proxlb_data["cluster"].get("cpu_total", 0) * 100
proxlb_data["cluster"]["cpu_used_percent"] = proxlb_data["cluster"].get("cpu_used", 0) / proxlb_data["cluster"].get("cpu_total", 0) * 100
proxlb_data["cluster"]["memory_total"] = proxlb_data["cluster"].get("memory_total", 0) - proxlb_data["nodes"][node]["memory_total"]
proxlb_data["cluster"]["memory_used"] = proxlb_data["cluster"].get("memory_used") - proxlb_data["nodes"][node]["memory_used"]
proxlb_data["cluster"]["memory_free"] = proxlb_data["cluster"].get("memory_free") - proxlb_data["nodes"][node]["memory_free"]
proxlb_data["cluster"]["memory_free_percent"] = proxlb_data["cluster"].get("memory_free") / proxlb_data["cluster"].get("memory_total", 0) * 100
proxlb_data["cluster"]["memory_used_percent"] = proxlb_data["cluster"].get("memory_used") / proxlb_data["cluster"].get("memory_total", 0) * 100
proxlb_data["cluster"]["disk_total"] = proxlb_data["cluster"].get("disk_total", 0) - proxlb_data["nodes"][node]["disk_total"]
proxlb_data["cluster"]["disk_used"] = proxlb_data["cluster"].get("disk_used", 0) - proxlb_data["nodes"][node]["disk_used"]
proxlb_data["cluster"]["disk_free"] = proxlb_data["cluster"].get("disk_free", 0) - proxlb_data["nodes"][node]["disk_free"]
proxlb_data["cluster"]["disk_free_percent"] = proxlb_data["cluster"].get("disk_free", 0) / proxlb_data["cluster"].get("disk_total", 0) * 100
proxlb_data["cluster"]["disk_used_percent"] = proxlb_data["cluster"].get("disk_used", 0) / proxlb_data["cluster"].get("disk_total", 0) * 100
proxlb_data["cluster"]["node_count_available"] = proxlb_data["cluster"].get("node_count_available", 0) - 1
logger.debug(f"DPM: update_cluster_resources - After {action}: {proxlb_data['cluster']['memory_free_percent']}")
logger.debug("Finished: update_cluster_resources.")

255
proxlb/models/dpm.py Normal file
View File

@@ -0,0 +1,255 @@
"""
The DPM (Dynamic Power Management) class is responsible for the dynamic management
of nodes within a Proxmox cluster, optimizing resource utilization by controlling
node power states based on specified schedules and conditions.
This class provides functionality for:
- Tracking and validating schedules for dynamic power management.
- Shutting down nodes that are underutilized or not needed.
- Starting up nodes using Wake-on-LAN (WOL) based on certain conditions.
- Ensuring that nodes are properly flagged for maintenance and startup/shutdown actions.
The DPM class can operate in different modes, such as static and automatic,
to either perform predefined actions or dynamically adjust based on real-time resource usage.
"""
__author__ = "Florian Paul Azim Hoberg <gyptazy>"
__copyright__ = "Copyright (C) 2025 Florian Paul Azim Hoberg (@gyptazy)"
__license__ = "GPL-3.0"
import proxmoxer
from typing import Dict, Any
from models.calculations import Calculations
from utils.logger import SystemdLogger
logger = SystemdLogger()
class DPM:
"""
The DPM (Dynamic Power Management) class is responsible for the dynamic management
of nodes within a Proxmox cluster, optimizing resource utilization by controlling
node power states based on specified schedules and conditions.
This class provides functionality for:
- Tracking and validating schedules for dynamic power management.
- Shutting down nodes that are underutilized or not needed.
- Starting up nodes using Wake-on-LAN (WOL) based on certain conditions.
- Ensuring that nodes are properly flagged for maintenance and startup/shutdown actions.
The DPM class can operate in different modes, such as static and automatic,
to either perform predefined actions or dynamically adjust based on real-time resource usage.
Attributes:
None directly defined for the class; instead, all actions are based on input data
and interactions with the Proxmox API and other helper functions.
Methods:
__init__(proxlb_data: Dict[str, Any]):
Initializes the DPM class, checking whether DPM is enabled and operating in the
appropriate mode (static or auto).
dpm_static(proxlb_data: Dict[str, Any]) -> None:
Evaluates the cluster's resource availability and performs static power management
actions by removing nodes that are not required.
dpm_shutdown_nodes(proxmox_api, proxlb_data) -> None:
Shuts down nodes flagged for DPM shutdown by using the Proxmox API, ensuring
that Wake-on-LAN (WOL) is available for proper node recovery.
dpm_startup_nodes(proxmox_api, proxlb_data) -> None:
Powers on nodes that are flagged for startup and are not in maintenance mode,
leveraging Wake-on-LAN (WOL) functionality.
dpm_validate_wol_mac(proxmox_api, node) -> None:
Validates and retrieves the Wake-on-LAN (WOL) MAC address for a given node,
ensuring that a valid address is set for powering on the node remotely.
"""
def __init__(self, proxlb_data: Dict[str, Any]):
"""
Initializes the DPM class with the provided ProxLB data.
Args:
proxlb_data (dict): The data required for balancing VMs and CTs.
"""
logger.debug("Starting: dpm class.")
if proxlb_data["meta"].get("dpm", {}).get("enable", False):
logger.debug("DPM function is enabled.")
mode = proxlb_data["meta"].get("dpm", {}).get("mode", None)
if mode == "static":
self.dpm_static(proxlb_data)
if mode == "auto":
self.dpm_auto(proxlb_data)
else:
logger.debug("DPM function is not enabled.")
logger.debug("Finished: dpm class.")
def dpm_static(self, proxlb_data: Dict[str, Any]) -> None:
"""
Evaluates and performs static Distributed Power Management (DPM) actions based on current cluster state.
This method monitors cluster resource availability and attempts to reduce the number of active nodes
when sufficient free resources are available. It ensures a minimum number of nodes remains active
and prioritizes shutting down nodes with the least utilized resources to minimize impact. Nodes selected
for shutdown are marked for maintenance and flagged for DPM shutdown.
Parameters:
proxlb_data (Dict[str, Any]): A dictionary containing metadata, cluster status, and node-level information
including resource utilization, configuration settings, and DPM thresholds.
Returns:
None: Modifies the input dictionary in-place to reflect updated cluster state and node flags.
"""
logger.debug("Starting: dpm_static.")
method = proxlb_data["meta"].get("dpm", {}).get("method", "memory")
cluster_nodes_overall = proxlb_data["cluster"]["node_count_overall"]
cluster_nodes_available = proxlb_data["cluster"]["node_count_available"]
cluster_free_resources_percent = int(proxlb_data["cluster"][f"{method}_free_percent"])
cluster_free_resources_req_min = proxlb_data["meta"].get("dpm", {}).get("cluster_min_free_resources", 0)
cluster_mind_nodes = proxlb_data["meta"].get("dpm", {}).get("cluster_min_nodes", 3)
logger.debug(f"DPM: Cluster Nodes: {cluster_nodes_overall} | Nodes available: {cluster_nodes_available} | Nodes offline: {cluster_nodes_overall - cluster_nodes_available}")
# Only proceed removing nodes if the cluster has enough resources
while cluster_free_resources_percent > cluster_free_resources_req_min:
logger.debug(f"DPM: More free resources {cluster_free_resources_percent}% available than required: {cluster_free_resources_req_min}%. DPM evaluation starting...")
# Ensure that we have at least a defined minimum of nodes left
if cluster_nodes_available > cluster_mind_nodes:
logger.debug(f"DPM: A minimum of {cluster_mind_nodes} nodes is required. {cluster_nodes_available} are available. Proceeding...")
# Get the node with the fewest used resources to keep migrations low
Calculations.get_most_free_node(proxlb_data, False)
dpm_node = proxlb_data["meta"]["balancing"]["balance_next_node"]
# Perform cluster calculation for evaluating how many nodes can safely leave
# the cluster. Further object calculations are being processed afterwards by
# the calculation class
logger.debug(f"DPM: Removing node {dpm_node} from cluster. Node will be turned off later.")
Calculations.update_cluster_resources(proxlb_data, dpm_node, "remove")
cluster_free_resources_percent = int(proxlb_data["cluster"][f"{method}_free_percent"])
logger.debug(f"DPM: Free cluster resources changed to: {int(proxlb_data['cluster'][f'{method}_free_percent'])}%.")
# Set node to maintenance and DPM shutdown
proxlb_data["nodes"][dpm_node]["maintenance"] = True
proxlb_data["nodes"][dpm_node]["dpm_shutdown"] = True
else:
logger.warning(f"DPM: A minimum of {cluster_mind_nodes} nodes is required. {cluster_nodes_available} are available. Cannot proceed!")
logger.debug(f"DPM: Not enough free resources {cluster_free_resources_percent}% available than required: {cluster_free_resources_req_min}%. DPM evaluation stopped.")
logger.debug("Finished: dpm_static.")
return proxlb_data
@staticmethod
def dpm_shutdown_nodes(proxmox_api, proxlb_data: Dict[str, Any]) -> None:
"""
Shuts down cluster nodes that are marked for maintenance and flagged for DPM shutdown.
This method iterates through the cluster nodes in the provided data and attempts to
power off any node that has both the 'maintenance' and 'dpm_shutdown' flags set.
It communicates with the Proxmox API to issue shutdown commands and logs any failures.
Parameters:
proxmox_api: An instance of the Proxmox API client used to issue node shutdown commands.
proxlb_data: A dictionary containing node status information, including flags for
maintenance and DPM shutdown readiness.
Returns:
None: Performs shutdown operations and logs outcomes; modifies no data directly.
"""
logger.debug("Starting: dpm_shutdown_nodes.")
for node, node_info in proxlb_data["nodes"].items():
if node_info["maintenance"] and node_info["dpm_shutdown"]:
logger.debug(f"DPM: Node: {node} is flagged as maintenance mode and to be powered off.")
# Ensure that the node has a valid WOL MAC defined. If not
# we would be unable to power on that system again
valid_wol_mac = DPM.dpm_validate_wol_mac(proxmox_api, node)
if valid_wol_mac:
try:
logger.debug(f"DPM: Shutting down node: {node}.")
job_id = proxmox_api.nodes(node).status.post(command="shutdown")
except proxmoxer.core.ResourceException as proxmox_api_error:
logger.critical(f"DPM: Error while powering off node {node}. Please check job-id: {job_id}")
logger.debug(f"DPM: Error while powering off node {node}. Please check job-id: {job_id}")
else:
logger.critical(f"DPM: Node {node} cannot be powered off due to missing WOL MAC. Please define a valid WOL MAC for this node.")
logger.debug("Finished: dpm_shutdown_nodes.")
@staticmethod
def dpm_startup_nodes(proxmox_api, proxlb_data: Dict[str, Any]) -> None:
"""
Starts uo cluster nodes that are marked for DPM start up.
This method iterates through the cluster nodes in the provided data and attempts to
power on any node that is not flagged as 'maintenance' but flagged as 'dpm_startup'.
It communicates with the Proxmox API to issue poweron commands and logs any failures.
Parameters:
proxmox_api: An instance of the Proxmox API client used to issue node startup commands.
proxlb_data: A dictionary containing node status information, including flags for
maintenance and DPM shutdown readiness.
Returns:
None: Performs poweron operations and logs outcomes; modifies no data directly.
"""
logger.debug("Starting: dpm_startup_nodes.")
for node, node_info in proxlb_data["nodes"].items():
if not node_info["maintenance"]:
logger.debug(f"DPM: Node: {node} is not in maintenance mode.")
if node_info["dpm_startup"]:
logger.debug(f"DPM: Node: {node} is flagged as to be started.")
try:
logger.debug(f"DPM: Powering on node: {node}.")
# Important: This requires Proxmox Operators to define the
# WOL address for each node within the Proxmox webinterface
job_id = proxmox_api.nodes().wakeonlan.post(node=node)
except proxmoxer.core.ResourceException as proxmox_api_error:
logger.critical(f"DPM: Error while powering on node {node}. Please check job-id: {job_id}")
logger.debug(f"DPM: Error while powering on node {node}. Please check job-id: {job_id}")
logger.debug("Finished: dpm_startup_nodes.")
@staticmethod
def dpm_validate_wol_mac(proxmox_api, node: Dict[str, Any]) -> str:
"""
Retrieves and validates the Wake-on-LAN (WOL) MAC address for a specified node.
This method fetches the MAC address configured for Wake-on-LAN (WOL) from the Proxmox API.
If the MAC address is found, it is logged. In case of failure to retrieve the address,
a critical log is generated indicating the absence of a WOL MAC address for the node.
Parameters:
proxmox_api: An instance of the Proxmox API client used to query node configurations.
node: The identifier (name or ID) of the node for which the WOL MAC address is to be validated.
Returns:
node_wol_mac_address: The WOL MAC address for the specified node if found, otherwise `None`.
"""
logger.debug("Starting: dpm_validate_wol_mac.")
try:
logger.debug(f"DPM: Getting WOL MAC address for node {node} from API.")
node_wol_mac_address = proxmox_api.nodes(node).config.get(property="wakeonlan")
node_wol_mac_address = node_wol_mac_address.get("wakeonlan")
logger.debug(f"DPM: Node {node} has MAC address: {node_wol_mac_address} for WOL.")
except proxmoxer.core.ResourceException as proxmox_api_error:
logger.debug(f"DPM: Failed to get WOL MAC address for node {node} from API.")
node_wol_mac_address = None
logger.critical(f"DPM: Node {node} has no MAC address defined for WOL.")
logger.debug("Finished: dpm_validate_wol_mac.")
return node_wol_mac_address

View File

@@ -54,6 +54,7 @@ class Nodes:
"""
logger.debug("Starting: get_nodes.")
nodes = {"nodes": {}}
cluster = {"cluster": {}}
for node in proxmox_api.nodes.get():
# Ignoring a node results into ignoring all placed guests on the ignored node!
@@ -61,6 +62,8 @@ class Nodes:
nodes["nodes"][node["node"]] = {}
nodes["nodes"][node["node"]]["name"] = node["node"]
nodes["nodes"][node["node"]]["maintenance"] = False
nodes["nodes"][node["node"]]["dpm_shutdown"] = False
nodes["nodes"][node["node"]]["dpm_startup"] = False
nodes["nodes"][node["node"]]["cpu_total"] = node["maxcpu"]
nodes["nodes"][node["node"]]["cpu_assigned"] = 0
nodes["nodes"][node["node"]]["cpu_used"] = node["cpu"] * node["maxcpu"]
@@ -87,8 +90,35 @@ class Nodes:
if Nodes.set_node_maintenance(proxlb_config, node["node"]):
nodes["nodes"][node["node"]]["maintenance"] = True
# Generate the intial cluster statistics within the same loop to avoid a further one.
logger.debug(f"Updating cluster statistics by online node {node['node']}.")
cluster["cluster"]["node_count"] = cluster["cluster"].get("node_count", 0) + 1
cluster["cluster"]["cpu_total"] = cluster["cluster"].get("cpu_total", 0) + nodes["nodes"][node["node"]]["cpu_total"]
cluster["cluster"]["cpu_used"] = cluster["cluster"].get("cpu_used", 0) + nodes["nodes"][node["node"]]["cpu_used"]
cluster["cluster"]["cpu_free"] = cluster["cluster"].get("cpu_free", 0) + nodes["nodes"][node["node"]]["cpu_free"]
cluster["cluster"]["cpu_free_percent"] = cluster["cluster"].get("cpu_free", 0) / cluster["cluster"].get("cpu_total", 0) * 100
cluster["cluster"]["cpu_used_percent"] = cluster["cluster"].get("cpu_used", 0) / cluster["cluster"].get("cpu_total", 0) * 100
cluster["cluster"]["memory_total"] = cluster["cluster"].get("memory_total", 0) + nodes["nodes"][node["node"]]["memory_total"]
cluster["cluster"]["memory_used"] = cluster["cluster"].get("memory_used", 0) + nodes["nodes"][node["node"]]["memory_used"]
cluster["cluster"]["memory_free"] = cluster["cluster"].get("memory_free", 0) + nodes["nodes"][node["node"]]["memory_free"]
cluster["cluster"]["memory_free_percent"] = cluster["cluster"].get("memory_free", 0) / cluster["cluster"].get("memory_total", 0) * 100
cluster["cluster"]["memory_used_percent"] = cluster["cluster"].get("memory_used", 0) / cluster["cluster"].get("memory_total", 0) * 100
cluster["cluster"]["disk_total"] = cluster["cluster"].get("disk_total", 0) + nodes["nodes"][node["node"]]["disk_total"]
cluster["cluster"]["disk_used"] = cluster["cluster"].get("disk_used", 0) + nodes["nodes"][node["node"]]["disk_used"]
cluster["cluster"]["disk_free"] = cluster["cluster"].get("disk_free", 0) + nodes["nodes"][node["node"]]["disk_free"]
cluster["cluster"]["disk_free_percent"] = cluster["cluster"].get("disk_free", 0) / cluster["cluster"].get("disk_total", 0) * 100
cluster["cluster"]["disk_used_percent"] = cluster["cluster"].get("disk_used", 0) / cluster["cluster"].get("disk_total", 0) * 100
cluster["cluster"]["node_count_available"] = cluster["cluster"].get("node_count_available", 0) + 1
cluster["cluster"]["node_count_overall"] = cluster["cluster"].get("node_count_overall", 0) + 1
# Update the cluster statistics by offline nodes to have the overall count of nodes in the cluster
else:
logger.debug(f"Updating cluster statistics by offline node {node['node']}.")
cluster["cluster"]["node_count_overall"] = cluster["cluster"].get("node_count_overall", 0) + 1
logger.debug("Finished: get_nodes.")
return nodes
return nodes, cluster
@staticmethod
def set_node_maintenance(proxlb_config: Dict[str, Any], node_name: str) -> Dict[str, Any]: