mirror of
https://github.com/gyptazy/ProxLB.git
synced 2026-04-05 20:31:57 +02:00
Add DPM (Dynamic Power Management) feature for Proxmox cluster nodes
Fixes: #141
This commit is contained in:
@@ -0,0 +1,2 @@
|
||||
added:
|
||||
- Add power management feature for cluster nodes (by @gyptazy) [#141]
|
||||
1
.changelogs/1.2.0/release_meta.yml
Normal file
1
.changelogs/1.2.0/release_meta.yml
Normal file
@@ -0,0 +1 @@
|
||||
date: TBD
|
||||
52
README.md
52
README.md
@@ -46,28 +46,29 @@ Overall, ProxLB significantly enhances resource management by intelligently dist
|
||||
<img src="https://cdn.gyptazy.com/images/proxlb-rebalancing-demo.gif"/>
|
||||
|
||||
## Features
|
||||
ProxLB's key features are by enabling automatic rebalancing of VMs and CTs across a Proxmox cluster based on memory, CPU, and local disk usage while identifying optimal nodes for automation. It supports maintenance mode, affinity rules, and seamless Proxmox API integration with ACL support, offering flexible usage as a one-time operation, a daemon, or through the Proxmox Web GUI.
|
||||
ProxLB's key features are by enabling automatic rebalancing of VMs and CTs across a Proxmox cluster based on memory, CPU, and local disk usage while identifying optimal nodes for automation. It supports maintenance mode, affinity rules, and seamless Proxmox API integration with ACL support, offering flexible usage as a one-time operation, a daemon, or through the Proxmox Web GUI. In addition, ProxLB also supports additional enterprise alike features like power managements for nodes (often also known as DPM) where nodes can be turned on/off on demand when workloads are higher/lower than usual. Also the automated security-patching of nodes within the cluster (known as ASPM) may help to reduce the manual work from cluster admins, where nodes will install patches, move guests across the cluster, reboot and then reblance the cluster again.
|
||||
|
||||
**Features**
|
||||
* Rebalance VMs/CTs in the cluster by:
|
||||
* Memory
|
||||
* Disk (only local storage)
|
||||
* CPU
|
||||
* Get best nodes for further automation
|
||||
* Supported Guest Types
|
||||
* VMs
|
||||
* CTs
|
||||
* Re-Balancing (DRS)
|
||||
* Supporting VMs & CTs
|
||||
* Balancing by:
|
||||
* CPU
|
||||
* Memory
|
||||
* Disk
|
||||
* Affinity / Anti-Affinity Rules
|
||||
* Affinity: Groups guests together
|
||||
* Anti-Affinity: Ensuring guests run on different nodes
|
||||
* Best node evaluation
|
||||
* Get the best node for guest placement (e.g., CI/CD)
|
||||
* Maintenance Mode
|
||||
* Set node(s) into maintenance
|
||||
* Move all workloads to different nodes
|
||||
* Affinity / Anti-Affinity Rules
|
||||
* Evacuating a sinlge or multiple nodes
|
||||
* Node Power Management (DPM)
|
||||
* Auto Node Security-Patch-Management (ASPM)
|
||||
* Fully based on Proxmox API
|
||||
* Fully integrated into the Proxmox ACL
|
||||
* No SSH required
|
||||
* Usage
|
||||
* One-Time
|
||||
* Daemon
|
||||
* Proxmox Web GUI Integration
|
||||
* Utilizing the Proxmox User Authentications
|
||||
* Supporting API tokens
|
||||
* No SSH or Agents required
|
||||
* Can run everywhere
|
||||
|
||||
## How does it work?
|
||||
ProxLB is a load-balancing system designed to optimize the distribution of virtual machines (VMs) and containers (CTs) across a cluster. It works by first gathering resource usage metrics from all nodes in the cluster through the Proxmox API. This includes detailed resource metrics for each VM and CT on every node. ProxLB then evaluates the difference between the maximum and minimum resource usage of the nodes, referred to as "Balanciness." If this difference exceeds a predefined threshold (which is configurable), the system initiates the rebalancing process.
|
||||
@@ -261,6 +262,12 @@ The following options can be set in the configuration file `proxlb.yaml`:
|
||||
| | balanciness | | 10 | `Int` | The maximum delta of resource usage between node with highest and lowest usage. |
|
||||
| | method | | memory | `Str` | The balancing method that should be used. [values: `memory` (default), `cpu`, `disk`]|
|
||||
| | mode | | used | `Str` | The balancing mode that should be used. [values: `used` (default), `assigned`] |
|
||||
| `dpm` | | | | | |
|
||||
| | enable | | True | `Bool` | Enables the Dynamic Power Management functions.|
|
||||
| | method | | memory | `Str` | The balancing method that should be used. [values: `memory` (default), `cpu`, `disk`]|
|
||||
| | mode | | static | `Str` | The balancing mode that should be used. [values: `static` (default), `auto`] |
|
||||
| | cluster_min_free_resources | | 60 | `Int` | Representing the minimum required free resouzrces in percent within the cluster. [values: `60`% (default)] |
|
||||
| | cluster_min_nodes | | 3 | `Int` | The minimum of required nodes that should remain in a cluster. [values: `3` (default)] |
|
||||
| `service` | | | | | |
|
||||
| | daemon | | True | `Bool` | If daemon mode should be activated. |
|
||||
| | `schedule` | | | `Dict` | Schedule config block for rebalancing. |
|
||||
@@ -301,6 +308,15 @@ balancing:
|
||||
method: memory
|
||||
mode: used
|
||||
|
||||
dpm:
|
||||
# DPM requires you to define the WOL (Wake-on-Lan)
|
||||
# MAC address for each node in Proxmox.
|
||||
enable: True
|
||||
method: memory
|
||||
mode: static
|
||||
cluster_min_free_resources: 60
|
||||
cluster_min_nodes: 1
|
||||
|
||||
service:
|
||||
daemon: True
|
||||
schedule:
|
||||
|
||||
@@ -28,6 +28,13 @@ balancing:
|
||||
method: memory
|
||||
mode: used
|
||||
|
||||
dpm:
|
||||
enable: True
|
||||
method: memory
|
||||
mode: static
|
||||
cluster_min_free_resources: 60
|
||||
cluster_min_nodes: 1
|
||||
|
||||
service:
|
||||
daemon: True
|
||||
schedule:
|
||||
|
||||
@@ -19,6 +19,7 @@
|
||||
6. [Parallel Migrations](#parallel-migrations)
|
||||
7. [Run as a Systemd-Service](#run-as-a-systemd-service)
|
||||
8. [SSL Self-Signed Certificates](#ssl-self-signed-certificates)
|
||||
9. [Dynamic Power Management (DPM)](#dynamic-power-management)
|
||||
|
||||
## Authentication / User Accounts / Permissions
|
||||
### Authentication
|
||||
@@ -207,4 +208,34 @@ proxmox_api:
|
||||
ssl_verification: False
|
||||
```
|
||||
|
||||
*Note: Disabling SSL certificate validation is not recommended.*
|
||||
*Note: Disabling SSL certificate validation is not recommended.*
|
||||
|
||||
### Dynamic Power Management (DPM)
|
||||
<img align="left" src="https://cdn.gyptazy.com/images/proxlb-proxmox-node-wakeonlan-wol-mac-dpm.jpg"/> Configuring Dynamic Power Management (DPM) in ProxLB within a Proxmox cluster involves a few critical steps to ensure proper operation. The first consideration is that any node intended for automatic shutdown and startup must support Wake-on-LAN (WOL). This is essential because DPM relies on the ability to power nodes back on remotely. For this to work, the ProxLB instance must be able to reach the target node’s MAC address directly over the network.
|
||||
|
||||
To make this possible, you must configure the correct MAC address for WOL within the Proxmox web interface. This is done by selecting the node, going to the “System” section, then “Options,” and finally setting the “MAC address for Wake-on-LAN.” Alternatively, this value can also be submitted using the Proxmox API. Without this MAC address in place, ProxLB will not allow the node to be shut down. This restriction is in place to prevent nodes from being turned off without a way to bring them back online, which could lead to service disruption. By ensuring that each node has a valid WOL MAC address configured, DPM can operate safely and effectively, allowing ProxLB to manage the cluster’s power consumption dynamically.
|
||||
|
||||
#### Requirements
|
||||
Using the powermanagement feature within clusters comes along with several requirements:
|
||||
* ProxLB needs to reach the WOL-Mac address of the node (plain network)
|
||||
* WOL must be enabled of the node in general (BIOS/UEFI)
|
||||
* The related WOL network interface must be defined
|
||||
* The related WOL network interface MAC address must be defined in Proxmox for the node
|
||||
|
||||
#### Options
|
||||
| Section | Option | Sub Option | Example | Type | Description |
|
||||
|---------|:------:|:----------:|:-------:|:----:|:-----------:|
|
||||
| `dpm` | | | | | |
|
||||
| | enable | | True | `Bool` | Enables the Dynamic Power Management functions.|
|
||||
| | method | | memory | `Str` | The balancing method that should be used. [values: `memory` (default), `cpu`, `disk`]|
|
||||
| | mode | | static | `Str` | The balancing mode that should be used. [values: `static` (default), `auto`] |
|
||||
| | cluster_min_free_resources | | 60 | `Int` | Representing the minimum required free resouzrces in percent within the cluster. [values: `60`% (default)] |
|
||||
| | cluster_min_nodes | | 3 | `Int` | The minimum of required nodes that should remain in a cluster. [values: `3` (default)] |
|
||||
|
||||
#### DPM Modes
|
||||
##### Static
|
||||
Static mode in DPM lets you set a fixed number of nodes that should always stay powered on in a Proxmox cluster. This is important to keep the cluster working properly, since you need at least three nodes to maintain quorum. The system won’t let you go below that limit to avoid breaking cluster functionality.
|
||||
|
||||
Besides the minimum number of active nodes, you can also define a baseline for how many free resources—like CPU or RAM—should always be available when the virtual machines are running. If the available resources drop below that level, ProxLB will try to power on more nodes, as long as they're available and can be started. On the other hand, if the cluster has more than enough resources, ProxLB will begin to shut down nodes again, but only until the free resource threshold is reached.
|
||||
|
||||
This mode gives you a more stable setup by always keeping a minimum number of nodes ready while still adjusting the rest of the cluster based on resource usage, but in a controlled and predictable way.
|
||||
@@ -17,6 +17,7 @@ from utils.logger import SystemdLogger
|
||||
from utils.cli_parser import CliParser
|
||||
from utils.config_parser import ConfigParser
|
||||
from utils.proxmox_api import ProxmoxApi
|
||||
from models.dpm import DPM
|
||||
from models.nodes import Nodes
|
||||
from models.guests import Guests
|
||||
from models.groups import Groups
|
||||
@@ -53,14 +54,17 @@ def main():
|
||||
while True:
|
||||
# Get all required objects from the Proxmox cluster
|
||||
meta = {"meta": proxlb_config}
|
||||
nodes = Nodes.get_nodes(proxmox_api, proxlb_config)
|
||||
nodes, cluster = Nodes.get_nodes(proxmox_api, proxlb_config)
|
||||
guests = Guests.get_guests(proxmox_api, nodes, meta)
|
||||
groups = Groups.get_groups(guests, nodes)
|
||||
|
||||
# Merge obtained objects from the Proxmox cluster for further usage
|
||||
proxlb_data = {**meta, **nodes, **guests, **groups}
|
||||
proxlb_data = {**meta, **cluster, **nodes, **guests, **groups}
|
||||
Helper.log_node_metrics(proxlb_data)
|
||||
|
||||
# Evaluate the dynamic power management for nodes in the clustet
|
||||
DPM(proxlb_data)
|
||||
|
||||
# Update the initial node resource assignments
|
||||
# by the previously created groups.
|
||||
Calculations.set_node_assignments(proxlb_data)
|
||||
@@ -70,10 +74,14 @@ def main():
|
||||
Calculations.relocate_guests(proxlb_data)
|
||||
Helper.log_node_metrics(proxlb_data, init=False)
|
||||
|
||||
# Perform balancing actions via Proxmox API
|
||||
# Perform balancing
|
||||
if not cli_args.dry_run or not proxlb_data["meta"]["balancing"].get("enable", False):
|
||||
Balancing(proxmox_api, proxlb_data)
|
||||
|
||||
# Perform DPM
|
||||
if not cli_args.dry_run:
|
||||
DPM.dpm_shutdown_nodes(proxmox_api, proxlb_data)
|
||||
|
||||
# Validate if the JSON output should be
|
||||
# printed to stdout
|
||||
Helper.print_json(proxlb_data, cli_args.json)
|
||||
|
||||
@@ -162,7 +162,7 @@ class Calculations:
|
||||
logger.debug("Finished: get_most_free_node.")
|
||||
|
||||
@staticmethod
|
||||
def relocate_guests_on_maintenance_nodes(proxlb_data: Dict[str, Any]):
|
||||
def relocate_guests_on_maintenance_nodes(proxlb_data: Dict[str, Any]) -> None:
|
||||
"""
|
||||
Relocates guests that are currently on nodes marked for maintenance to
|
||||
nodes with the most available resources.
|
||||
@@ -192,7 +192,7 @@ class Calculations:
|
||||
logger.debug("Finished: get_most_free_node.")
|
||||
|
||||
@staticmethod
|
||||
def relocate_guests(proxlb_data: Dict[str, Any]):
|
||||
def relocate_guests(proxlb_data: Dict[str, Any]) -> None:
|
||||
"""
|
||||
Relocates guests within the provided data structure to ensure affinity groups are
|
||||
placed on nodes with the most free resources.
|
||||
@@ -231,7 +231,7 @@ class Calculations:
|
||||
logger.debug("Finished: relocate_guests.")
|
||||
|
||||
@staticmethod
|
||||
def val_anti_affinity(proxlb_data: Dict[str, Any], guest_name: str):
|
||||
def val_anti_affinity(proxlb_data: Dict[str, Any], guest_name: str) -> None:
|
||||
"""
|
||||
Validates and assigns nodes to guests based on anti-affinity rules.
|
||||
|
||||
@@ -280,7 +280,7 @@ class Calculations:
|
||||
logger.debug("Finished: val_anti_affinity.")
|
||||
|
||||
@staticmethod
|
||||
def val_node_relationship(proxlb_data: Dict[str, Any], guest_name: str):
|
||||
def val_node_relationship(proxlb_data: Dict[str, Any], guest_name: str) -> None:
|
||||
"""
|
||||
Validates and assigns guests to nodes based on defined relationships based on tags.
|
||||
|
||||
@@ -311,7 +311,7 @@ class Calculations:
|
||||
logger.debug("Finished: val_node_relationship.")
|
||||
|
||||
@staticmethod
|
||||
def update_node_resources(proxlb_data):
|
||||
def update_node_resources(proxlb_data: Dict[str, Any]) -> None:
|
||||
"""
|
||||
Updates the resource allocation and usage statistics for nodes when a guest
|
||||
is moved from one node to another.
|
||||
@@ -375,3 +375,68 @@ class Calculations:
|
||||
logger.debug(f"Set guest {guest_name} from node {node_current} to node {node_target}.")
|
||||
|
||||
logger.debug("Finished: update_node_resources.")
|
||||
|
||||
@staticmethod
|
||||
def update_cluster_resources(proxlb_data: Dict[str, Any], node: str, action: str) -> None:
|
||||
"""
|
||||
Updates the cluster resource statistics based on the specified action and node.
|
||||
|
||||
This method modifies the cluster-level resource data (such as CPU, memory, disk usage,
|
||||
and node counts) based on the action performed ('add' or 'remove') for the specified node.
|
||||
It calculates the updated statistics after adding or removing a node and logs the results.
|
||||
|
||||
Parameters:
|
||||
proxlb_data (Dict[str, Any]): The data representing the current state of the cluster,
|
||||
including node-level statistics for CPU, memory, and disk.
|
||||
node (str): The identifier of the node whose resources are being added or removed from the cluster.
|
||||
action (str): The action to perform, either 'add' or 'remove'. 'add' will include the node's
|
||||
resources in the cluster, while 'remove' will exclude the node's resources.
|
||||
|
||||
Returns:
|
||||
None: The function modifies the `proxlb_data` dictionary in place to update the cluster resources.
|
||||
"""
|
||||
logger.debug("Starting: update_cluster_resources.")
|
||||
logger.debug(f"DPM: Updating cluster statistics by online node {node}. Action: {action}")
|
||||
logger.debug(f"DPM: update_cluster_resources - Before {action}: {proxlb_data['cluster']['memory_free_percent']}")
|
||||
|
||||
if action == "add":
|
||||
proxlb_data["cluster"]["node_count"] = proxlb_data["cluster"].get("node_count", 0) + 1
|
||||
proxlb_data["cluster"]["cpu_total"] = proxlb_data["cluster"].get("cpu_total", 0) + proxlb_data["nodes"][node]["cpu_total"]
|
||||
proxlb_data["cluster"]["cpu_used"] = proxlb_data["cluster"].get("cpu_used", 0) + proxlb_data["nodes"][node]["cpu_used"]
|
||||
proxlb_data["cluster"]["cpu_free"] = proxlb_data["cluster"].get("cpu_free", 0) + proxlb_data["nodes"][node]["cpu_free"]
|
||||
proxlb_data["cluster"]["cpu_free_percent"] = proxlb_data["cluster"].get("cpu_free", 0) / proxlb_data["cluster"].get("cpu_total", 0) * 100
|
||||
proxlb_data["cluster"]["cpu_used_percent"] = proxlb_data["cluster"].get("cpu_used", 0) / proxlb_data["cluster"].get("cpu_total", 0) * 100
|
||||
proxlb_data["cluster"]["memory_total"] = proxlb_data["cluster"].get("memory_total", 0) + proxlb_data["nodes"][node]["memory_total"]
|
||||
proxlb_data["cluster"]["memory_used"] = proxlb_data["cluster"].get("memory_used", 0) + proxlb_data["nodes"][node]["memory_used"]
|
||||
proxlb_data["cluster"]["memory_free"] = proxlb_data["cluster"].get("memory_free", 0) + proxlb_data["nodes"][node]["memory_free"]
|
||||
proxlb_data["cluster"]["memory_free_percent"] = proxlb_data["cluster"].get("memory_free", 0) / proxlb_data["cluster"].get("memory_total", 0) * 100
|
||||
proxlb_data["cluster"]["memory_used_percent"] = proxlb_data["cluster"].get("memory_used", 0) / proxlb_data["cluster"].get("memory_total", 0) * 100
|
||||
proxlb_data["cluster"]["disk_total"] = proxlb_data["cluster"].get("disk_total", 0) + proxlb_data["nodes"][node]["disk_total"]
|
||||
proxlb_data["cluster"]["disk_used"] = proxlb_data["cluster"].get("disk_used", 0) + proxlb_data["nodes"][node]["disk_used"]
|
||||
proxlb_data["cluster"]["disk_free"] = proxlb_data["cluster"].get("disk_free", 0) + proxlb_data["nodes"][node]["disk_free"]
|
||||
proxlb_data["cluster"]["disk_free_percent"] = proxlb_data["cluster"].get("disk_free", 0) / proxlb_data["cluster"].get("disk_total", 0) * 100
|
||||
proxlb_data["cluster"]["disk_used_percent"] = proxlb_data["cluster"].get("disk_used", 0) / proxlb_data["cluster"].get("disk_total", 0) * 100
|
||||
proxlb_data["cluster"]["node_count_available"] = proxlb_data["cluster"].get("node_count_available", 0) + 1
|
||||
proxlb_data["cluster"]["node_count_overall"] = proxlb_data["cluster"].get("node_count_overall", 0) + 1
|
||||
|
||||
if action == "remove":
|
||||
proxlb_data["cluster"]["node_count"] = proxlb_data["cluster"].get("node_count", 0) - 1
|
||||
proxlb_data["cluster"]["cpu_total"] = proxlb_data["cluster"].get("cpu_total", 0) - proxlb_data["nodes"][node]["cpu_total"]
|
||||
proxlb_data["cluster"]["cpu_used"] = proxlb_data["cluster"].get("cpu_used", 0) - proxlb_data["nodes"][node]["cpu_used"]
|
||||
proxlb_data["cluster"]["cpu_free"] = proxlb_data["cluster"].get("cpu_free", 0) - proxlb_data["nodes"][node]["cpu_free"]
|
||||
proxlb_data["cluster"]["cpu_free_percent"] = proxlb_data["cluster"].get("cpu_free", 0) / proxlb_data["cluster"].get("cpu_total", 0) * 100
|
||||
proxlb_data["cluster"]["cpu_used_percent"] = proxlb_data["cluster"].get("cpu_used", 0) / proxlb_data["cluster"].get("cpu_total", 0) * 100
|
||||
proxlb_data["cluster"]["memory_total"] = proxlb_data["cluster"].get("memory_total", 0) - proxlb_data["nodes"][node]["memory_total"]
|
||||
proxlb_data["cluster"]["memory_used"] = proxlb_data["cluster"].get("memory_used") - proxlb_data["nodes"][node]["memory_used"]
|
||||
proxlb_data["cluster"]["memory_free"] = proxlb_data["cluster"].get("memory_free") - proxlb_data["nodes"][node]["memory_free"]
|
||||
proxlb_data["cluster"]["memory_free_percent"] = proxlb_data["cluster"].get("memory_free") / proxlb_data["cluster"].get("memory_total", 0) * 100
|
||||
proxlb_data["cluster"]["memory_used_percent"] = proxlb_data["cluster"].get("memory_used") / proxlb_data["cluster"].get("memory_total", 0) * 100
|
||||
proxlb_data["cluster"]["disk_total"] = proxlb_data["cluster"].get("disk_total", 0) - proxlb_data["nodes"][node]["disk_total"]
|
||||
proxlb_data["cluster"]["disk_used"] = proxlb_data["cluster"].get("disk_used", 0) - proxlb_data["nodes"][node]["disk_used"]
|
||||
proxlb_data["cluster"]["disk_free"] = proxlb_data["cluster"].get("disk_free", 0) - proxlb_data["nodes"][node]["disk_free"]
|
||||
proxlb_data["cluster"]["disk_free_percent"] = proxlb_data["cluster"].get("disk_free", 0) / proxlb_data["cluster"].get("disk_total", 0) * 100
|
||||
proxlb_data["cluster"]["disk_used_percent"] = proxlb_data["cluster"].get("disk_used", 0) / proxlb_data["cluster"].get("disk_total", 0) * 100
|
||||
proxlb_data["cluster"]["node_count_available"] = proxlb_data["cluster"].get("node_count_available", 0) - 1
|
||||
|
||||
logger.debug(f"DPM: update_cluster_resources - After {action}: {proxlb_data['cluster']['memory_free_percent']}")
|
||||
logger.debug("Finished: update_cluster_resources.")
|
||||
|
||||
255
proxlb/models/dpm.py
Normal file
255
proxlb/models/dpm.py
Normal file
@@ -0,0 +1,255 @@
|
||||
"""
|
||||
The DPM (Dynamic Power Management) class is responsible for the dynamic management
|
||||
of nodes within a Proxmox cluster, optimizing resource utilization by controlling
|
||||
node power states based on specified schedules and conditions.
|
||||
|
||||
This class provides functionality for:
|
||||
- Tracking and validating schedules for dynamic power management.
|
||||
- Shutting down nodes that are underutilized or not needed.
|
||||
- Starting up nodes using Wake-on-LAN (WOL) based on certain conditions.
|
||||
- Ensuring that nodes are properly flagged for maintenance and startup/shutdown actions.
|
||||
|
||||
The DPM class can operate in different modes, such as static and automatic,
|
||||
to either perform predefined actions or dynamically adjust based on real-time resource usage.
|
||||
"""
|
||||
|
||||
__author__ = "Florian Paul Azim Hoberg <gyptazy>"
|
||||
__copyright__ = "Copyright (C) 2025 Florian Paul Azim Hoberg (@gyptazy)"
|
||||
__license__ = "GPL-3.0"
|
||||
|
||||
|
||||
import proxmoxer
|
||||
from typing import Dict, Any
|
||||
from models.calculations import Calculations
|
||||
from utils.logger import SystemdLogger
|
||||
|
||||
logger = SystemdLogger()
|
||||
|
||||
|
||||
class DPM:
|
||||
"""
|
||||
The DPM (Dynamic Power Management) class is responsible for the dynamic management
|
||||
of nodes within a Proxmox cluster, optimizing resource utilization by controlling
|
||||
node power states based on specified schedules and conditions.
|
||||
|
||||
This class provides functionality for:
|
||||
- Tracking and validating schedules for dynamic power management.
|
||||
- Shutting down nodes that are underutilized or not needed.
|
||||
- Starting up nodes using Wake-on-LAN (WOL) based on certain conditions.
|
||||
- Ensuring that nodes are properly flagged for maintenance and startup/shutdown actions.
|
||||
|
||||
The DPM class can operate in different modes, such as static and automatic,
|
||||
to either perform predefined actions or dynamically adjust based on real-time resource usage.
|
||||
|
||||
Attributes:
|
||||
None directly defined for the class; instead, all actions are based on input data
|
||||
and interactions with the Proxmox API and other helper functions.
|
||||
|
||||
Methods:
|
||||
__init__(proxlb_data: Dict[str, Any]):
|
||||
Initializes the DPM class, checking whether DPM is enabled and operating in the
|
||||
appropriate mode (static or auto).
|
||||
|
||||
dpm_static(proxlb_data: Dict[str, Any]) -> None:
|
||||
Evaluates the cluster's resource availability and performs static power management
|
||||
actions by removing nodes that are not required.
|
||||
|
||||
dpm_shutdown_nodes(proxmox_api, proxlb_data) -> None:
|
||||
Shuts down nodes flagged for DPM shutdown by using the Proxmox API, ensuring
|
||||
that Wake-on-LAN (WOL) is available for proper node recovery.
|
||||
|
||||
dpm_startup_nodes(proxmox_api, proxlb_data) -> None:
|
||||
Powers on nodes that are flagged for startup and are not in maintenance mode,
|
||||
leveraging Wake-on-LAN (WOL) functionality.
|
||||
|
||||
dpm_validate_wol_mac(proxmox_api, node) -> None:
|
||||
Validates and retrieves the Wake-on-LAN (WOL) MAC address for a given node,
|
||||
ensuring that a valid address is set for powering on the node remotely.
|
||||
"""
|
||||
|
||||
def __init__(self, proxlb_data: Dict[str, Any]):
|
||||
"""
|
||||
Initializes the DPM class with the provided ProxLB data.
|
||||
|
||||
Args:
|
||||
proxlb_data (dict): The data required for balancing VMs and CTs.
|
||||
"""
|
||||
logger.debug("Starting: dpm class.")
|
||||
|
||||
if proxlb_data["meta"].get("dpm", {}).get("enable", False):
|
||||
logger.debug("DPM function is enabled.")
|
||||
mode = proxlb_data["meta"].get("dpm", {}).get("mode", None)
|
||||
|
||||
if mode == "static":
|
||||
self.dpm_static(proxlb_data)
|
||||
|
||||
if mode == "auto":
|
||||
self.dpm_auto(proxlb_data)
|
||||
|
||||
else:
|
||||
logger.debug("DPM function is not enabled.")
|
||||
|
||||
logger.debug("Finished: dpm class.")
|
||||
|
||||
def dpm_static(self, proxlb_data: Dict[str, Any]) -> None:
|
||||
"""
|
||||
Evaluates and performs static Distributed Power Management (DPM) actions based on current cluster state.
|
||||
|
||||
This method monitors cluster resource availability and attempts to reduce the number of active nodes
|
||||
when sufficient free resources are available. It ensures a minimum number of nodes remains active
|
||||
and prioritizes shutting down nodes with the least utilized resources to minimize impact. Nodes selected
|
||||
for shutdown are marked for maintenance and flagged for DPM shutdown.
|
||||
|
||||
Parameters:
|
||||
proxlb_data (Dict[str, Any]): A dictionary containing metadata, cluster status, and node-level information
|
||||
including resource utilization, configuration settings, and DPM thresholds.
|
||||
|
||||
Returns:
|
||||
None: Modifies the input dictionary in-place to reflect updated cluster state and node flags.
|
||||
"""
|
||||
logger.debug("Starting: dpm_static.")
|
||||
|
||||
method = proxlb_data["meta"].get("dpm", {}).get("method", "memory")
|
||||
cluster_nodes_overall = proxlb_data["cluster"]["node_count_overall"]
|
||||
cluster_nodes_available = proxlb_data["cluster"]["node_count_available"]
|
||||
cluster_free_resources_percent = int(proxlb_data["cluster"][f"{method}_free_percent"])
|
||||
cluster_free_resources_req_min = proxlb_data["meta"].get("dpm", {}).get("cluster_min_free_resources", 0)
|
||||
cluster_mind_nodes = proxlb_data["meta"].get("dpm", {}).get("cluster_min_nodes", 3)
|
||||
logger.debug(f"DPM: Cluster Nodes: {cluster_nodes_overall} | Nodes available: {cluster_nodes_available} | Nodes offline: {cluster_nodes_overall - cluster_nodes_available}")
|
||||
|
||||
# Only proceed removing nodes if the cluster has enough resources
|
||||
while cluster_free_resources_percent > cluster_free_resources_req_min:
|
||||
logger.debug(f"DPM: More free resources {cluster_free_resources_percent}% available than required: {cluster_free_resources_req_min}%. DPM evaluation starting...")
|
||||
|
||||
# Ensure that we have at least a defined minimum of nodes left
|
||||
if cluster_nodes_available > cluster_mind_nodes:
|
||||
logger.debug(f"DPM: A minimum of {cluster_mind_nodes} nodes is required. {cluster_nodes_available} are available. Proceeding...")
|
||||
|
||||
# Get the node with the fewest used resources to keep migrations low
|
||||
Calculations.get_most_free_node(proxlb_data, False)
|
||||
dpm_node = proxlb_data["meta"]["balancing"]["balance_next_node"]
|
||||
|
||||
# Perform cluster calculation for evaluating how many nodes can safely leave
|
||||
# the cluster. Further object calculations are being processed afterwards by
|
||||
# the calculation class
|
||||
logger.debug(f"DPM: Removing node {dpm_node} from cluster. Node will be turned off later.")
|
||||
Calculations.update_cluster_resources(proxlb_data, dpm_node, "remove")
|
||||
cluster_free_resources_percent = int(proxlb_data["cluster"][f"{method}_free_percent"])
|
||||
logger.debug(f"DPM: Free cluster resources changed to: {int(proxlb_data['cluster'][f'{method}_free_percent'])}%.")
|
||||
|
||||
# Set node to maintenance and DPM shutdown
|
||||
proxlb_data["nodes"][dpm_node]["maintenance"] = True
|
||||
proxlb_data["nodes"][dpm_node]["dpm_shutdown"] = True
|
||||
else:
|
||||
logger.warning(f"DPM: A minimum of {cluster_mind_nodes} nodes is required. {cluster_nodes_available} are available. Cannot proceed!")
|
||||
|
||||
logger.debug(f"DPM: Not enough free resources {cluster_free_resources_percent}% available than required: {cluster_free_resources_req_min}%. DPM evaluation stopped.")
|
||||
logger.debug("Finished: dpm_static.")
|
||||
return proxlb_data
|
||||
|
||||
@staticmethod
|
||||
def dpm_shutdown_nodes(proxmox_api, proxlb_data: Dict[str, Any]) -> None:
|
||||
"""
|
||||
Shuts down cluster nodes that are marked for maintenance and flagged for DPM shutdown.
|
||||
|
||||
This method iterates through the cluster nodes in the provided data and attempts to
|
||||
power off any node that has both the 'maintenance' and 'dpm_shutdown' flags set.
|
||||
It communicates with the Proxmox API to issue shutdown commands and logs any failures.
|
||||
|
||||
Parameters:
|
||||
proxmox_api: An instance of the Proxmox API client used to issue node shutdown commands.
|
||||
proxlb_data: A dictionary containing node status information, including flags for
|
||||
maintenance and DPM shutdown readiness.
|
||||
|
||||
Returns:
|
||||
None: Performs shutdown operations and logs outcomes; modifies no data directly.
|
||||
"""
|
||||
logger.debug("Starting: dpm_shutdown_nodes.")
|
||||
for node, node_info in proxlb_data["nodes"].items():
|
||||
|
||||
if node_info["maintenance"] and node_info["dpm_shutdown"]:
|
||||
logger.debug(f"DPM: Node: {node} is flagged as maintenance mode and to be powered off.")
|
||||
|
||||
# Ensure that the node has a valid WOL MAC defined. If not
|
||||
# we would be unable to power on that system again
|
||||
valid_wol_mac = DPM.dpm_validate_wol_mac(proxmox_api, node)
|
||||
|
||||
if valid_wol_mac:
|
||||
try:
|
||||
logger.debug(f"DPM: Shutting down node: {node}.")
|
||||
job_id = proxmox_api.nodes(node).status.post(command="shutdown")
|
||||
except proxmoxer.core.ResourceException as proxmox_api_error:
|
||||
logger.critical(f"DPM: Error while powering off node {node}. Please check job-id: {job_id}")
|
||||
logger.debug(f"DPM: Error while powering off node {node}. Please check job-id: {job_id}")
|
||||
else:
|
||||
logger.critical(f"DPM: Node {node} cannot be powered off due to missing WOL MAC. Please define a valid WOL MAC for this node.")
|
||||
|
||||
logger.debug("Finished: dpm_shutdown_nodes.")
|
||||
|
||||
@staticmethod
|
||||
def dpm_startup_nodes(proxmox_api, proxlb_data: Dict[str, Any]) -> None:
|
||||
"""
|
||||
Starts uo cluster nodes that are marked for DPM start up.
|
||||
|
||||
This method iterates through the cluster nodes in the provided data and attempts to
|
||||
power on any node that is not flagged as 'maintenance' but flagged as 'dpm_startup'.
|
||||
It communicates with the Proxmox API to issue poweron commands and logs any failures.
|
||||
|
||||
Parameters:
|
||||
proxmox_api: An instance of the Proxmox API client used to issue node startup commands.
|
||||
proxlb_data: A dictionary containing node status information, including flags for
|
||||
maintenance and DPM shutdown readiness.
|
||||
|
||||
Returns:
|
||||
None: Performs poweron operations and logs outcomes; modifies no data directly.
|
||||
"""
|
||||
logger.debug("Starting: dpm_startup_nodes.")
|
||||
for node, node_info in proxlb_data["nodes"].items():
|
||||
|
||||
if not node_info["maintenance"]:
|
||||
logger.debug(f"DPM: Node: {node} is not in maintenance mode.")
|
||||
|
||||
if node_info["dpm_startup"]:
|
||||
logger.debug(f"DPM: Node: {node} is flagged as to be started.")
|
||||
|
||||
try:
|
||||
logger.debug(f"DPM: Powering on node: {node}.")
|
||||
# Important: This requires Proxmox Operators to define the
|
||||
# WOL address for each node within the Proxmox webinterface
|
||||
job_id = proxmox_api.nodes().wakeonlan.post(node=node)
|
||||
except proxmoxer.core.ResourceException as proxmox_api_error:
|
||||
logger.critical(f"DPM: Error while powering on node {node}. Please check job-id: {job_id}")
|
||||
logger.debug(f"DPM: Error while powering on node {node}. Please check job-id: {job_id}")
|
||||
|
||||
logger.debug("Finished: dpm_startup_nodes.")
|
||||
|
||||
@staticmethod
|
||||
def dpm_validate_wol_mac(proxmox_api, node: Dict[str, Any]) -> str:
|
||||
"""
|
||||
Retrieves and validates the Wake-on-LAN (WOL) MAC address for a specified node.
|
||||
|
||||
This method fetches the MAC address configured for Wake-on-LAN (WOL) from the Proxmox API.
|
||||
If the MAC address is found, it is logged. In case of failure to retrieve the address,
|
||||
a critical log is generated indicating the absence of a WOL MAC address for the node.
|
||||
|
||||
Parameters:
|
||||
proxmox_api: An instance of the Proxmox API client used to query node configurations.
|
||||
node: The identifier (name or ID) of the node for which the WOL MAC address is to be validated.
|
||||
|
||||
Returns:
|
||||
node_wol_mac_address: The WOL MAC address for the specified node if found, otherwise `None`.
|
||||
"""
|
||||
logger.debug("Starting: dpm_validate_wol_mac.")
|
||||
|
||||
try:
|
||||
logger.debug(f"DPM: Getting WOL MAC address for node {node} from API.")
|
||||
node_wol_mac_address = proxmox_api.nodes(node).config.get(property="wakeonlan")
|
||||
node_wol_mac_address = node_wol_mac_address.get("wakeonlan")
|
||||
logger.debug(f"DPM: Node {node} has MAC address: {node_wol_mac_address} for WOL.")
|
||||
except proxmoxer.core.ResourceException as proxmox_api_error:
|
||||
logger.debug(f"DPM: Failed to get WOL MAC address for node {node} from API.")
|
||||
node_wol_mac_address = None
|
||||
logger.critical(f"DPM: Node {node} has no MAC address defined for WOL.")
|
||||
|
||||
logger.debug("Finished: dpm_validate_wol_mac.")
|
||||
return node_wol_mac_address
|
||||
@@ -54,6 +54,7 @@ class Nodes:
|
||||
"""
|
||||
logger.debug("Starting: get_nodes.")
|
||||
nodes = {"nodes": {}}
|
||||
cluster = {"cluster": {}}
|
||||
|
||||
for node in proxmox_api.nodes.get():
|
||||
# Ignoring a node results into ignoring all placed guests on the ignored node!
|
||||
@@ -61,6 +62,8 @@ class Nodes:
|
||||
nodes["nodes"][node["node"]] = {}
|
||||
nodes["nodes"][node["node"]]["name"] = node["node"]
|
||||
nodes["nodes"][node["node"]]["maintenance"] = False
|
||||
nodes["nodes"][node["node"]]["dpm_shutdown"] = False
|
||||
nodes["nodes"][node["node"]]["dpm_startup"] = False
|
||||
nodes["nodes"][node["node"]]["cpu_total"] = node["maxcpu"]
|
||||
nodes["nodes"][node["node"]]["cpu_assigned"] = 0
|
||||
nodes["nodes"][node["node"]]["cpu_used"] = node["cpu"] * node["maxcpu"]
|
||||
@@ -87,8 +90,35 @@ class Nodes:
|
||||
if Nodes.set_node_maintenance(proxlb_config, node["node"]):
|
||||
nodes["nodes"][node["node"]]["maintenance"] = True
|
||||
|
||||
# Generate the intial cluster statistics within the same loop to avoid a further one.
|
||||
logger.debug(f"Updating cluster statistics by online node {node['node']}.")
|
||||
cluster["cluster"]["node_count"] = cluster["cluster"].get("node_count", 0) + 1
|
||||
cluster["cluster"]["cpu_total"] = cluster["cluster"].get("cpu_total", 0) + nodes["nodes"][node["node"]]["cpu_total"]
|
||||
cluster["cluster"]["cpu_used"] = cluster["cluster"].get("cpu_used", 0) + nodes["nodes"][node["node"]]["cpu_used"]
|
||||
cluster["cluster"]["cpu_free"] = cluster["cluster"].get("cpu_free", 0) + nodes["nodes"][node["node"]]["cpu_free"]
|
||||
cluster["cluster"]["cpu_free_percent"] = cluster["cluster"].get("cpu_free", 0) / cluster["cluster"].get("cpu_total", 0) * 100
|
||||
cluster["cluster"]["cpu_used_percent"] = cluster["cluster"].get("cpu_used", 0) / cluster["cluster"].get("cpu_total", 0) * 100
|
||||
cluster["cluster"]["memory_total"] = cluster["cluster"].get("memory_total", 0) + nodes["nodes"][node["node"]]["memory_total"]
|
||||
cluster["cluster"]["memory_used"] = cluster["cluster"].get("memory_used", 0) + nodes["nodes"][node["node"]]["memory_used"]
|
||||
cluster["cluster"]["memory_free"] = cluster["cluster"].get("memory_free", 0) + nodes["nodes"][node["node"]]["memory_free"]
|
||||
cluster["cluster"]["memory_free_percent"] = cluster["cluster"].get("memory_free", 0) / cluster["cluster"].get("memory_total", 0) * 100
|
||||
cluster["cluster"]["memory_used_percent"] = cluster["cluster"].get("memory_used", 0) / cluster["cluster"].get("memory_total", 0) * 100
|
||||
cluster["cluster"]["disk_total"] = cluster["cluster"].get("disk_total", 0) + nodes["nodes"][node["node"]]["disk_total"]
|
||||
cluster["cluster"]["disk_used"] = cluster["cluster"].get("disk_used", 0) + nodes["nodes"][node["node"]]["disk_used"]
|
||||
cluster["cluster"]["disk_free"] = cluster["cluster"].get("disk_free", 0) + nodes["nodes"][node["node"]]["disk_free"]
|
||||
cluster["cluster"]["disk_free_percent"] = cluster["cluster"].get("disk_free", 0) / cluster["cluster"].get("disk_total", 0) * 100
|
||||
cluster["cluster"]["disk_used_percent"] = cluster["cluster"].get("disk_used", 0) / cluster["cluster"].get("disk_total", 0) * 100
|
||||
|
||||
cluster["cluster"]["node_count_available"] = cluster["cluster"].get("node_count_available", 0) + 1
|
||||
cluster["cluster"]["node_count_overall"] = cluster["cluster"].get("node_count_overall", 0) + 1
|
||||
|
||||
# Update the cluster statistics by offline nodes to have the overall count of nodes in the cluster
|
||||
else:
|
||||
logger.debug(f"Updating cluster statistics by offline node {node['node']}.")
|
||||
cluster["cluster"]["node_count_overall"] = cluster["cluster"].get("node_count_overall", 0) + 1
|
||||
|
||||
logger.debug("Finished: get_nodes.")
|
||||
return nodes
|
||||
return nodes, cluster
|
||||
|
||||
@staticmethod
|
||||
def set_node_maintenance(proxlb_config: Dict[str, Any], node_name: str) -> Dict[str, Any]:
|
||||
|
||||
Reference in New Issue
Block a user