Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(robot): add test case Check If Nodes Are Under Memory Pressure After Cluster Restart #2285

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions e2e/keywords/metrics.resource
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
*** Settings ***
Documentation Metrics Keywords

Library ../libs/keywords/metrics_keywords.py

*** Keywords ***
Check if nodes are under memory pressure
${worker_nodes} = get_worker_nodes
FOR ${worker_node} IN @{worker_nodes}
get_node_memory_usage_in_percentage ${worker_node}
check_if_node_under_memory_pressure ${worker_node}
END
49 changes: 49 additions & 0 deletions e2e/libs/keywords/metrics_keywords.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import time

from node import Node
from metrics.metrics import get_node_metrics
from utility.utility import get_retry_count_and_interval
from utility.utility import logging


class metrics_keywords:

def __init__(self):
self.node = Node()
retry_count, retry_interval = get_retry_count_and_interval()

Comment on lines +11 to +14
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Initialize retry_interval in constructor.

The retry_interval value is used in check_if_node_under_memory_pressure but not initialized in the constructor.

     def __init__(self):
         self.node = Node()
-        retry_count, retry_interval = get_retry_count_and_interval()
+        self.retry_count, self.retry_interval = get_retry_count_and_interval()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self):
self.node = Node()
retry_count, retry_interval = get_retry_count_and_interval()
def __init__(self):
self.node = Node()
self.retry_count, self.retry_interval = get_retry_count_and_interval()

def get_node_total_memory_in_mi(self, node_name):
total_memory = self.node.get_node_total_memory(node_name)
if "Ki" in total_memory:
total_memory = int(total_memory.replace("Ki", ""))
total_memory = total_memory / 1024
elif "Mi" in total_memory:
total_memory = int(total_memory.replace("Mi", ""))
logging(f'Got node {node_name} total memory: {total_memory} Mi')
return total_memory

def get_node_memory_usage_in_mi(self, node_name):
memory_usage = get_node_metrics(node_name, 'memory')
if "Ki" in memory_usage:
memory_usage = int(memory_usage.replace("Ki", ""))
memory_usage = memory_usage / 1024
elif "Mi" in memory_usage:
memory_usage = int(memory_usage.replace("Mi", ""))
logging(f'Got node {node_name} memory usage: {memory_usage} Mi')
return memory_usage

def get_node_memory_usage_in_percentage(self, node_name):
memory_usage_in_mi = self.get_node_memory_usage_in_mi(node_name)
total_memory_in_mi = self.get_node_total_memory_in_mi(node_name)
memory_usage_in_percentage = memory_usage_in_mi / total_memory_in_mi * 100
logging(f'Got node {node_name} memory usage: {memory_usage_in_percentage} %')
return memory_usage_in_percentage

Comment on lines +15 to +41
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor memory conversion logic to reduce code duplication.

The memory conversion logic is duplicated in both get_node_total_memory_in_mi and get_node_memory_usage_in_mi. Extract this into a helper method.

+    def _convert_to_mi(self, memory_value: str) -> float:
+        """Convert memory value to MiB.
+        
+        Args:
+            memory_value: Memory value with unit (e.g., "1024Ki", "512Mi")
+            
+        Returns:
+            float: Memory value in MiB
+        """
+        if "Ki" in memory_value:
+            return int(memory_value.replace("Ki", "")) / 1024
+        elif "Mi" in memory_value:
+            return int(memory_value.replace("Mi", ""))
+        return float(memory_value)  # Assume MiB if no unit

     def get_node_total_memory_in_mi(self, node_name):
         total_memory = self.node.get_node_total_memory(node_name)
-        if "Ki" in total_memory:
-            total_memory = int(total_memory.replace("Ki", ""))
-            total_memory = total_memory / 1024
-        elif "Mi" in total_memory:
-            total_memory = int(total_memory.replace("Mi", ""))
+        total_memory = self._convert_to_mi(total_memory)
         logging(f'Got node {node_name} total memory: {total_memory} Mi')
         return total_memory

     def get_node_memory_usage_in_mi(self, node_name):
         memory_usage = get_node_metrics(node_name, 'memory')
-        if "Ki" in memory_usage:
-            memory_usage = int(memory_usage.replace("Ki", ""))
-            memory_usage = memory_usage / 1024
-        elif "Mi" in memory_usage:
-            memory_usage = int(memory_usage.replace("Mi", ""))
+        memory_usage = self._convert_to_mi(memory_usage)
         logging(f'Got node {node_name} memory usage: {memory_usage} Mi')
         return memory_usage
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_node_total_memory_in_mi(self, node_name):
total_memory = self.node.get_node_total_memory(node_name)
if "Ki" in total_memory:
total_memory = int(total_memory.replace("Ki", ""))
total_memory = total_memory / 1024
elif "Mi" in total_memory:
total_memory = int(total_memory.replace("Mi", ""))
logging(f'Got node {node_name} total memory: {total_memory} Mi')
return total_memory
def get_node_memory_usage_in_mi(self, node_name):
memory_usage = get_node_metrics(node_name, 'memory')
if "Ki" in memory_usage:
memory_usage = int(memory_usage.replace("Ki", ""))
memory_usage = memory_usage / 1024
elif "Mi" in memory_usage:
memory_usage = int(memory_usage.replace("Mi", ""))
logging(f'Got node {node_name} memory usage: {memory_usage} Mi')
return memory_usage
def get_node_memory_usage_in_percentage(self, node_name):
memory_usage_in_mi = self.get_node_memory_usage_in_mi(node_name)
total_memory_in_mi = self.get_node_total_memory_in_mi(node_name)
memory_usage_in_percentage = memory_usage_in_mi / total_memory_in_mi * 100
logging(f'Got node {node_name} memory usage: {memory_usage_in_percentage} %')
return memory_usage_in_percentage
def _convert_to_mi(self, memory_value: str) -> float:
"""Convert memory value to MiB.
Args:
memory_value: Memory value with unit (e.g., "1024Ki", "512Mi")
Returns:
float: Memory value in MiB
"""
if "Ki" in memory_value:
return int(memory_value.replace("Ki", "")) / 1024
elif "Mi" in memory_value:
return int(memory_value.replace("Mi", ""))
return float(memory_value) # Assume MiB if no unit
def get_node_total_memory_in_mi(self, node_name):
total_memory = self.node.get_node_total_memory(node_name)
total_memory = self._convert_to_mi(total_memory)
logging(f'Got node {node_name} total memory: {total_memory} Mi')
return total_memory
def get_node_memory_usage_in_mi(self, node_name):
memory_usage = get_node_metrics(node_name, 'memory')
memory_usage = self._convert_to_mi(memory_usage)
logging(f'Got node {node_name} memory usage: {memory_usage} Mi')
return memory_usage
def get_node_memory_usage_in_percentage(self, node_name):
memory_usage_in_mi = self.get_node_memory_usage_in_mi(node_name)
total_memory_in_mi = self.get_node_total_memory_in_mi(node_name)
memory_usage_in_percentage = memory_usage_in_mi / total_memory_in_mi * 100
logging(f'Got node {node_name} memory usage: {memory_usage_in_percentage} %')
return memory_usage_in_percentage

def check_if_node_under_memory_pressure(self, node_name):
logging(f"Checking if node {node_name} is under memory pressure")
condition_status = self.node.get_node_condition(node_name, "MemoryPressure")
if condition_status == "True":
logging(f"Node {node_name} is under memory pressure")
time.sleep(self.retry_count)
assert False, f"Node {node_name} is under memory pressure"

Comment on lines +42 to +49
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling in check_if_node_under_memory_pressure.

Replace assert False with a proper exception and add retry logging.

     def check_if_node_under_memory_pressure(self, node_name):
         logging(f"Checking if node {node_name} is under memory pressure")
         condition_status = self.node.get_node_condition(node_name, "MemoryPressure")
         if condition_status == "True":
             logging(f"Node {node_name} is under memory pressure")
-            time.sleep(self.retry_count)
-            assert False, f"Node {node_name} is under memory pressure"
+            logging(f"Waiting {self.retry_interval}s before failing...")
+            time.sleep(self.retry_interval)
+            raise AssertionError(f"Node {node_name} is under memory pressure")

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.8.2)

48-48: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

22 changes: 22 additions & 0 deletions e2e/libs/metrics/metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import time

from kubernetes import client
from kubernetes.client.rest import ApiException

from utility.utility import get_retry_count_and_interval
from utility.utility import logging

def get_node_metrics(node_name, metrics_name):
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
api = client.CustomObjectsApi()
try:
node_metrics = api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes")
for node in node_metrics['items']:
if node_name == node['metadata']['name']:
logging(f"Got node {node_name} metrics {metrics_name} = {node['usage'][metrics_name]}")
return node['usage'][metrics_name]
except ApiException as e:
logging(f"Failed to get node {node_name} metrics {metrics_name}: {e}")
time.sleep(retry_interval)
assert False, f"Failed to get node {node_name} metrics {metrics_name}"
Comment on lines +9 to +22
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add type hints, docstring and improve error handling.

The function implementation could be improved in several ways:

  1. Add type hints for parameters and return value
  2. Add docstring explaining the function's purpose, parameters, and return value
  3. Replace assert False with raise AssertionError
  4. Validate the metrics_name parameter
-def get_node_metrics(node_name, metrics_name):
+def get_node_metrics(node_name: str, metrics_name: str) -> str:
+    """Retrieve metrics for a specified Kubernetes node.
+    
+    Args:
+        node_name: Name of the node to retrieve metrics for
+        metrics_name: Type of metrics to retrieve (e.g., 'memory', 'cpu')
+    
+    Returns:
+        str: The metrics value for the specified node and metrics type
+        
+    Raises:
+        AssertionError: If unable to retrieve metrics after retries
+    """
+    if metrics_name not in ['memory', 'cpu']:
+        raise ValueError(f"Unsupported metrics type: {metrics_name}")
     retry_count, retry_interval = get_retry_count_and_interval()
-    for i in range(retry_count):
+    for _i in range(retry_count):
         api = client.CustomObjectsApi()
         try:
             node_metrics = api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes")
             for node in node_metrics['items']:
                 if node_name == node['metadata']['name']:
                     logging(f"Got node {node_name} metrics {metrics_name} = {node['usage'][metrics_name]}")
                     return node['usage'][metrics_name]
         except ApiException as e:
             logging(f"Failed to get node {node_name} metrics {metrics_name}: {e}")
         time.sleep(retry_interval)
-    assert False, f"Failed to get node {node_name} metrics {metrics_name}"
+    raise AssertionError(f"Failed to get node {node_name} metrics {metrics_name}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_node_metrics(node_name, metrics_name):
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
api = client.CustomObjectsApi()
try:
node_metrics = api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes")
for node in node_metrics['items']:
if node_name == node['metadata']['name']:
logging(f"Got node {node_name} metrics {metrics_name} = {node['usage'][metrics_name]}")
return node['usage'][metrics_name]
except ApiException as e:
logging(f"Failed to get node {node_name} metrics {metrics_name}: {e}")
time.sleep(retry_interval)
assert False, f"Failed to get node {node_name} metrics {metrics_name}"
def get_node_metrics(node_name: str, metrics_name: str) -> str:
"""Retrieve metrics for a specified Kubernetes node.
Args:
node_name: Name of the node to retrieve metrics for
metrics_name: Type of metrics to retrieve (e.g., 'memory', 'cpu')
Returns:
str: The metrics value for the specified node and metrics type
Raises:
AssertionError: If unable to retrieve metrics after retries
"""
if metrics_name not in ['memory', 'cpu']:
raise ValueError(f"Unsupported metrics type: {metrics_name}")
retry_count, retry_interval = get_retry_count_and_interval()
for _i in range(retry_count):
api = client.CustomObjectsApi()
try:
node_metrics = api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes")
for node in node_metrics['items']:
if node_name == node['metadata']['name']:
logging(f"Got node {node_name} metrics {metrics_name} = {node['usage'][metrics_name]}")
return node['usage'][metrics_name]
except ApiException as e:
logging(f"Failed to get node {node_name} metrics {metrics_name}: {e}")
time.sleep(retry_interval)
raise AssertionError(f"Failed to get node {node_name} metrics {metrics_name}")
🧰 Tools
🪛 Ruff (0.8.2)

11-11: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


22-22: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

12 changes: 12 additions & 0 deletions e2e/libs/node/node.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,18 @@ def get_node_cpu_cores(self, node_name):
node = self.get_node_by_name(node_name)
return node.status.capacity['cpu']

def get_node_total_memory(self, node_name):
node = self.get_node_by_name(node_name)
return node.status.capacity['memory']

def get_node_condition(self, node_name, condition_type):
node = self.get_node_by_name(node_name)
for condition in node.status.conditions:
if condition.type == condition_type:
logging(f"Got node {node_name} condition {condition_type}: {condition}")
return condition.status
assert False, f"Failed to get node {node_name} condition {condition_type}: {node}"

Comment on lines +127 to +134
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling and add documentation.

  1. Replace assert False with proper exception handling as it can be removed by Python's -O flag.
  2. Add docstring and type hints for better code documentation.
-    def get_node_condition(self, node_name, condition_type):
+    def get_node_condition(self, node_name: str, condition_type: str) -> str:
+        """Get the status of a specific condition type for a node.
+
+        Args:
+            node_name: Name of the node to check.
+            condition_type: Type of condition to retrieve (e.g., 'Ready', 'MemoryPressure').
+
+        Returns:
+            str: The status of the condition ('True', 'False', or 'Unknown').
+
+        Raises:
+            RuntimeError: If the condition type is not found for the node.
+        """
         node = self.get_node_by_name(node_name)
         for condition in node.status.conditions:
             if condition.type == condition_type:
                 logging(f"Got node {node_name} condition {condition_type}: {condition}")
                 return condition.status
-        assert False, f"Failed to get node {node_name} condition {condition_type}: {node}"
+        raise RuntimeError(f"Failed to get node {node_name} condition {condition_type}: {node}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_node_condition(self, node_name, condition_type):
node = self.get_node_by_name(node_name)
for condition in node.status.conditions:
if condition.type == condition_type:
logging(f"Got node {node_name} condition {condition_type}: {condition}")
return condition.status
assert False, f"Failed to get node {node_name} condition {condition_type}: {node}"
def get_node_condition(self, node_name: str, condition_type: str) -> str:
"""Get the status of a specific condition type for a node.
Args:
node_name: Name of the node to check.
condition_type: Type of condition to retrieve (e.g., 'Ready', 'MemoryPressure').
Returns:
str: The status of the condition ('True', 'False', or 'Unknown').
Raises:
RuntimeError: If the condition type is not found for the node.
"""
node = self.get_node_by_name(node_name)
for condition in node.status.conditions:
if condition.type == condition_type:
logging(f"Got node {node_name} condition {condition_type}: {condition}")
return condition.status
raise RuntimeError(f"Failed to get node {node_name} condition {condition_type}: {node}")
🧰 Tools
🪛 Ruff (0.8.2)

133-133: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

def list_node_names_by_volumes(self, volume_names):
volume_keywords = BuiltIn().get_library_instance('volume_keywords')
volume_nodes = {}
Expand Down
45 changes: 45 additions & 0 deletions e2e/tests/negative/cluster_restart.robot
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@ Resource ../keywords/storageclass.resource
Resource ../keywords/persistentvolumeclaim.resource
Resource ../keywords/statefulset.resource
Resource ../keywords/workload.resource
Resource ../keywords/backup.resource
Resource ../keywords/setting.resource
Resource ../keywords/metrics.resource

Test Setup Set test environment
Test Teardown Cleanup test resources
Expand Down Expand Up @@ -77,3 +79,46 @@ Restart Cluster While Workload Heavy Writing
And Check statefulset 4 works
And Check statefulset 5 works
END

Check If Nodes Are Under Memory Pressure After Cluster Restart
[Tags] cluster
Given Create storageclass longhorn-test with dataEngine=${DATA_ENGINE}
And Create storageclass strict-local with numberOfReplicas=1 dataLocality=strict-local dataEngine=${DATA_ENGINE}
And Create storageclass nfs-4-2 with nfsOptions=vers=4.2,noresvport,timeo=450,retrans=8 dataEngine=${DATA_ENGINE}
And Create storageclass nfs-hard-mount with nfsOptions=hard,timeo=50,retrans=1 dataEngine=${DATA_ENGINE}
And Create storageclass nfs-soft-mount with nfsOptions=soft,timeo=250,retrans=5 dataEngine=${DATA_ENGINE}
And Create statefulset 0 using RWO volume with longhorn-test storageclass
And Create statefulset 1 using RWX volume with longhorn-test storageclass
And Create statefulset 2 using RWO volume with strict-local storageclass
And Create statefulset 3 using RWX volume with nfs-4-2 storageclass
And Create statefulset 4 using RWX volume with nfs-hard-mount storageclass
And Create statefulset 5 using RWX volume with nfs-soft-mount storageclass
And Write 1024 MB data to file data.bin in statefulset 0
And Write 1024 MB data to file data.bin in statefulset 1
And Write 1024 MB data to file data.bin in statefulset 2
And Write 1024 MB data to file data.bin in statefulset 3
And Write 1024 MB data to file data.bin in statefulset 4
And Write 1024 MB data to file data.bin in statefulset 5

FOR ${i} IN RANGE ${LOOP_COUNT}

And Create backup ${i} for statefulset 0 volume
And Create backup ${i} for statefulset 1 volume
And Create backup ${i} for statefulset 2 volume
And Create backup ${i} for statefulset 3 volume
And Create backup ${i} for statefulset 4 volume
And Create backup ${i} for statefulset 5 volume

When Restart cluster
And Wait for longhorn ready
And Wait for workloads pods stable
... statefulset 0 statefulset 1 statefulset 2 statefulset 3 statefulset 4 statefulset 5

Then Check statefulset 0 works
And Check statefulset 1 works
And Check statefulset 2 works
And Check statefulset 3 works
And Check statefulset 4 works
And Check statefulset 5 works
And Check if nodes are under memory pressure
END
Loading