-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(robot): add test case Check If Nodes Are Under Memory Pressure After Cluster Restart #2285
base: master
Are you sure you want to change the base?
Conversation
…fter Cluster Restart Signed-off-by: Yang Chiu <[email protected]>
WalkthroughThis pull request introduces a comprehensive set of changes focused on monitoring node memory metrics and pressure in a Kubernetes cluster. A new resource file Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (4)
e2e/keywords/metrics.resource (1)
7-12
: Add documentation for the keyword.Add documentation to explain the purpose and behavior of the keyword.
*** Keywords *** Check if nodes are under memory pressure + [Documentation] Checks if any worker nodes in the cluster are under memory pressure. + ... + ... This keyword: + ... 1. Retrieves all worker nodes + ... 2. For each node: + ... - Gets memory usage percentage + ... - Checks if node is under memory pressure + ... + ... Fails if any node is under memory pressure. ${worker_nodes} = get_worker_nodes FOR ${worker_node} IN @{worker_nodes} get_node_memory_usage_in_percentage ${worker_node} check_if_node_under_memory_pressure ${worker_node} ENDe2e/tests/negative/cluster_restart.robot (2)
83-124
: Add documentation and consider test optimization.
- Add documentation to explain the test's purpose and expectations
- Consider parameterizing the data size (1024 MB) for flexibility
Check If Nodes Are Under Memory Pressure After Cluster Restart + [Documentation] Verifies that nodes do not experience memory pressure after cluster restart. + ... + ... Test Steps: + ... 1. Creates multiple storage classes with different configurations + ... 2. Creates stateful sets using these storage classes + ... 3. Writes ${DATA_SIZE} MB of data to each stateful set + ... 4. In a loop: + ... - Creates backups for each volume + ... - Restarts the cluster + ... - Verifies workload stability + ... - Checks for memory pressure + [Variables] ${DATA_SIZE}=1024 [Tags] cluster Given Create storageclass longhorn-test with dataEngine=${DATA_ENGINE} And Create storageclass strict-local with numberOfReplicas=1 dataLocality=strict-local dataEngine=${DATA_ENGINE} And Create storageclass nfs-4-2 with nfsOptions=vers=4.2,noresvport,timeo=450,retrans=8 dataEngine=${DATA_ENGINE} And Create storageclass nfs-hard-mount with nfsOptions=hard,timeo=50,retrans=1 dataEngine=${DATA_ENGINE} And Create storageclass nfs-soft-mount with nfsOptions=soft,timeo=250,retrans=5 dataEngine=${DATA_ENGINE} And Create statefulset 0 using RWO volume with longhorn-test storageclass And Create statefulset 1 using RWX volume with longhorn-test storageclass And Create statefulset 2 using RWO volume with strict-local storageclass And Create statefulset 3 using RWX volume with nfs-4-2 storageclass And Create statefulset 4 using RWX volume with nfs-hard-mount storageclass And Create statefulset 5 using RWX volume with nfs-soft-mount storageclass - And Write 1024 MB data to file data.bin in statefulset 0 - And Write 1024 MB data to file data.bin in statefulset 1 - And Write 1024 MB data to file data.bin in statefulset 2 - And Write 1024 MB data to file data.bin in statefulset 3 - And Write 1024 MB data to file data.bin in statefulset 4 - And Write 1024 MB data to file data.bin in statefulset 5 + And Write ${DATA_SIZE} MB data to file data.bin in statefulset 0 + And Write ${DATA_SIZE} MB data to file data.bin in statefulset 1 + And Write ${DATA_SIZE} MB data to file data.bin in statefulset 2 + And Write ${DATA_SIZE} MB data to file data.bin in statefulset 3 + And Write ${DATA_SIZE} MB data to file data.bin in statefulset 4 + And Write ${DATA_SIZE} MB data to file data.bin in statefulset 5
103-111
: Consider parallelizing backup creation.The sequential creation of backups might extend test duration unnecessarily. Consider implementing parallel backup creation if the test framework supports it.
e2e/libs/node/node.py (1)
123-125
: Add error handling for non-existent nodes.The method should handle cases where the node doesn't exist or the memory capacity information is unavailable.
def get_node_total_memory(self, node_name): - node = self.get_node_by_name(node_name) - return node.status.capacity['memory'] + try: + node = self.get_node_by_name(node_name) + return node.status.capacity['memory'] + except Exception as e: + raise RuntimeError(f"Failed to get memory capacity for node {node_name}: {str(e)}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
e2e/keywords/metrics.resource
(1 hunks)e2e/libs/keywords/metrics_keywords.py
(1 hunks)e2e/libs/metrics/metrics.py
(1 hunks)e2e/libs/node/node.py
(1 hunks)e2e/tests/negative/cluster_restart.robot
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
e2e/libs/node/node.py
133-133: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
e2e/libs/metrics/metrics.py
11-11: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
22-22: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
e2e/libs/keywords/metrics_keywords.py
48-48: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Build images
- GitHub Check: Summary
🔇 Additional comments (1)
e2e/libs/node/node.py (1)
123-134
: LGTM! Changes align well with PR objectives.The added methods provide the necessary functionality to support testing node memory pressure after cluster restart. The implementation is clean and follows existing patterns in the codebase.
🧰 Tools
🪛 Ruff (0.8.2)
133-133: Do not
assert False
(python -O
removes these calls), raiseAssertionError()
Replace
assert False
(B011)
def get_node_metrics(node_name, metrics_name): | ||
retry_count, retry_interval = get_retry_count_and_interval() | ||
for i in range(retry_count): | ||
api = client.CustomObjectsApi() | ||
try: | ||
node_metrics = api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes") | ||
for node in node_metrics['items']: | ||
if node_name == node['metadata']['name']: | ||
logging(f"Got node {node_name} metrics {metrics_name} = {node['usage'][metrics_name]}") | ||
return node['usage'][metrics_name] | ||
except ApiException as e: | ||
logging(f"Failed to get node {node_name} metrics {metrics_name}: {e}") | ||
time.sleep(retry_interval) | ||
assert False, f"Failed to get node {node_name} metrics {metrics_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add type hints, docstring and improve error handling.
The function implementation could be improved in several ways:
- Add type hints for parameters and return value
- Add docstring explaining the function's purpose, parameters, and return value
- Replace
assert False
withraise AssertionError
- Validate the
metrics_name
parameter
-def get_node_metrics(node_name, metrics_name):
+def get_node_metrics(node_name: str, metrics_name: str) -> str:
+ """Retrieve metrics for a specified Kubernetes node.
+
+ Args:
+ node_name: Name of the node to retrieve metrics for
+ metrics_name: Type of metrics to retrieve (e.g., 'memory', 'cpu')
+
+ Returns:
+ str: The metrics value for the specified node and metrics type
+
+ Raises:
+ AssertionError: If unable to retrieve metrics after retries
+ """
+ if metrics_name not in ['memory', 'cpu']:
+ raise ValueError(f"Unsupported metrics type: {metrics_name}")
retry_count, retry_interval = get_retry_count_and_interval()
- for i in range(retry_count):
+ for _i in range(retry_count):
api = client.CustomObjectsApi()
try:
node_metrics = api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes")
for node in node_metrics['items']:
if node_name == node['metadata']['name']:
logging(f"Got node {node_name} metrics {metrics_name} = {node['usage'][metrics_name]}")
return node['usage'][metrics_name]
except ApiException as e:
logging(f"Failed to get node {node_name} metrics {metrics_name}: {e}")
time.sleep(retry_interval)
- assert False, f"Failed to get node {node_name} metrics {metrics_name}"
+ raise AssertionError(f"Failed to get node {node_name} metrics {metrics_name}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_node_metrics(node_name, metrics_name): | |
retry_count, retry_interval = get_retry_count_and_interval() | |
for i in range(retry_count): | |
api = client.CustomObjectsApi() | |
try: | |
node_metrics = api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes") | |
for node in node_metrics['items']: | |
if node_name == node['metadata']['name']: | |
logging(f"Got node {node_name} metrics {metrics_name} = {node['usage'][metrics_name]}") | |
return node['usage'][metrics_name] | |
except ApiException as e: | |
logging(f"Failed to get node {node_name} metrics {metrics_name}: {e}") | |
time.sleep(retry_interval) | |
assert False, f"Failed to get node {node_name} metrics {metrics_name}" | |
def get_node_metrics(node_name: str, metrics_name: str) -> str: | |
"""Retrieve metrics for a specified Kubernetes node. | |
Args: | |
node_name: Name of the node to retrieve metrics for | |
metrics_name: Type of metrics to retrieve (e.g., 'memory', 'cpu') | |
Returns: | |
str: The metrics value for the specified node and metrics type | |
Raises: | |
AssertionError: If unable to retrieve metrics after retries | |
""" | |
if metrics_name not in ['memory', 'cpu']: | |
raise ValueError(f"Unsupported metrics type: {metrics_name}") | |
retry_count, retry_interval = get_retry_count_and_interval() | |
for _i in range(retry_count): | |
api = client.CustomObjectsApi() | |
try: | |
node_metrics = api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes") | |
for node in node_metrics['items']: | |
if node_name == node['metadata']['name']: | |
logging(f"Got node {node_name} metrics {metrics_name} = {node['usage'][metrics_name]}") | |
return node['usage'][metrics_name] | |
except ApiException as e: | |
logging(f"Failed to get node {node_name} metrics {metrics_name}: {e}") | |
time.sleep(retry_interval) | |
raise AssertionError(f"Failed to get node {node_name} metrics {metrics_name}") |
🧰 Tools
🪛 Ruff (0.8.2)
11-11: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
22-22: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
def __init__(self): | ||
self.node = Node() | ||
retry_count, retry_interval = get_retry_count_and_interval() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize retry_interval in constructor.
The retry_interval
value is used in check_if_node_under_memory_pressure
but not initialized in the constructor.
def __init__(self):
self.node = Node()
- retry_count, retry_interval = get_retry_count_and_interval()
+ self.retry_count, self.retry_interval = get_retry_count_and_interval()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def __init__(self): | |
self.node = Node() | |
retry_count, retry_interval = get_retry_count_and_interval() | |
def __init__(self): | |
self.node = Node() | |
self.retry_count, self.retry_interval = get_retry_count_and_interval() |
def check_if_node_under_memory_pressure(self, node_name): | ||
logging(f"Checking if node {node_name} is under memory pressure") | ||
condition_status = self.node.get_node_condition(node_name, "MemoryPressure") | ||
if condition_status == "True": | ||
logging(f"Node {node_name} is under memory pressure") | ||
time.sleep(self.retry_count) | ||
assert False, f"Node {node_name} is under memory pressure" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling in check_if_node_under_memory_pressure.
Replace assert False
with a proper exception and add retry logging.
def check_if_node_under_memory_pressure(self, node_name):
logging(f"Checking if node {node_name} is under memory pressure")
condition_status = self.node.get_node_condition(node_name, "MemoryPressure")
if condition_status == "True":
logging(f"Node {node_name} is under memory pressure")
- time.sleep(self.retry_count)
- assert False, f"Node {node_name} is under memory pressure"
+ logging(f"Waiting {self.retry_interval}s before failing...")
+ time.sleep(self.retry_interval)
+ raise AssertionError(f"Node {node_name} is under memory pressure")
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.8.2)
48-48: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
def get_node_total_memory_in_mi(self, node_name): | ||
total_memory = self.node.get_node_total_memory(node_name) | ||
if "Ki" in total_memory: | ||
total_memory = int(total_memory.replace("Ki", "")) | ||
total_memory = total_memory / 1024 | ||
elif "Mi" in total_memory: | ||
total_memory = int(total_memory.replace("Mi", "")) | ||
logging(f'Got node {node_name} total memory: {total_memory} Mi') | ||
return total_memory | ||
|
||
def get_node_memory_usage_in_mi(self, node_name): | ||
memory_usage = get_node_metrics(node_name, 'memory') | ||
if "Ki" in memory_usage: | ||
memory_usage = int(memory_usage.replace("Ki", "")) | ||
memory_usage = memory_usage / 1024 | ||
elif "Mi" in memory_usage: | ||
memory_usage = int(memory_usage.replace("Mi", "")) | ||
logging(f'Got node {node_name} memory usage: {memory_usage} Mi') | ||
return memory_usage | ||
|
||
def get_node_memory_usage_in_percentage(self, node_name): | ||
memory_usage_in_mi = self.get_node_memory_usage_in_mi(node_name) | ||
total_memory_in_mi = self.get_node_total_memory_in_mi(node_name) | ||
memory_usage_in_percentage = memory_usage_in_mi / total_memory_in_mi * 100 | ||
logging(f'Got node {node_name} memory usage: {memory_usage_in_percentage} %') | ||
return memory_usage_in_percentage | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor memory conversion logic to reduce code duplication.
The memory conversion logic is duplicated in both get_node_total_memory_in_mi
and get_node_memory_usage_in_mi
. Extract this into a helper method.
+ def _convert_to_mi(self, memory_value: str) -> float:
+ """Convert memory value to MiB.
+
+ Args:
+ memory_value: Memory value with unit (e.g., "1024Ki", "512Mi")
+
+ Returns:
+ float: Memory value in MiB
+ """
+ if "Ki" in memory_value:
+ return int(memory_value.replace("Ki", "")) / 1024
+ elif "Mi" in memory_value:
+ return int(memory_value.replace("Mi", ""))
+ return float(memory_value) # Assume MiB if no unit
def get_node_total_memory_in_mi(self, node_name):
total_memory = self.node.get_node_total_memory(node_name)
- if "Ki" in total_memory:
- total_memory = int(total_memory.replace("Ki", ""))
- total_memory = total_memory / 1024
- elif "Mi" in total_memory:
- total_memory = int(total_memory.replace("Mi", ""))
+ total_memory = self._convert_to_mi(total_memory)
logging(f'Got node {node_name} total memory: {total_memory} Mi')
return total_memory
def get_node_memory_usage_in_mi(self, node_name):
memory_usage = get_node_metrics(node_name, 'memory')
- if "Ki" in memory_usage:
- memory_usage = int(memory_usage.replace("Ki", ""))
- memory_usage = memory_usage / 1024
- elif "Mi" in memory_usage:
- memory_usage = int(memory_usage.replace("Mi", ""))
+ memory_usage = self._convert_to_mi(memory_usage)
logging(f'Got node {node_name} memory usage: {memory_usage} Mi')
return memory_usage
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_node_total_memory_in_mi(self, node_name): | |
total_memory = self.node.get_node_total_memory(node_name) | |
if "Ki" in total_memory: | |
total_memory = int(total_memory.replace("Ki", "")) | |
total_memory = total_memory / 1024 | |
elif "Mi" in total_memory: | |
total_memory = int(total_memory.replace("Mi", "")) | |
logging(f'Got node {node_name} total memory: {total_memory} Mi') | |
return total_memory | |
def get_node_memory_usage_in_mi(self, node_name): | |
memory_usage = get_node_metrics(node_name, 'memory') | |
if "Ki" in memory_usage: | |
memory_usage = int(memory_usage.replace("Ki", "")) | |
memory_usage = memory_usage / 1024 | |
elif "Mi" in memory_usage: | |
memory_usage = int(memory_usage.replace("Mi", "")) | |
logging(f'Got node {node_name} memory usage: {memory_usage} Mi') | |
return memory_usage | |
def get_node_memory_usage_in_percentage(self, node_name): | |
memory_usage_in_mi = self.get_node_memory_usage_in_mi(node_name) | |
total_memory_in_mi = self.get_node_total_memory_in_mi(node_name) | |
memory_usage_in_percentage = memory_usage_in_mi / total_memory_in_mi * 100 | |
logging(f'Got node {node_name} memory usage: {memory_usage_in_percentage} %') | |
return memory_usage_in_percentage | |
def _convert_to_mi(self, memory_value: str) -> float: | |
"""Convert memory value to MiB. | |
Args: | |
memory_value: Memory value with unit (e.g., "1024Ki", "512Mi") | |
Returns: | |
float: Memory value in MiB | |
""" | |
if "Ki" in memory_value: | |
return int(memory_value.replace("Ki", "")) / 1024 | |
elif "Mi" in memory_value: | |
return int(memory_value.replace("Mi", "")) | |
return float(memory_value) # Assume MiB if no unit | |
def get_node_total_memory_in_mi(self, node_name): | |
total_memory = self.node.get_node_total_memory(node_name) | |
total_memory = self._convert_to_mi(total_memory) | |
logging(f'Got node {node_name} total memory: {total_memory} Mi') | |
return total_memory | |
def get_node_memory_usage_in_mi(self, node_name): | |
memory_usage = get_node_metrics(node_name, 'memory') | |
memory_usage = self._convert_to_mi(memory_usage) | |
logging(f'Got node {node_name} memory usage: {memory_usage} Mi') | |
return memory_usage | |
def get_node_memory_usage_in_percentage(self, node_name): | |
memory_usage_in_mi = self.get_node_memory_usage_in_mi(node_name) | |
total_memory_in_mi = self.get_node_total_memory_in_mi(node_name) | |
memory_usage_in_percentage = memory_usage_in_mi / total_memory_in_mi * 100 | |
logging(f'Got node {node_name} memory usage: {memory_usage_in_percentage} %') | |
return memory_usage_in_percentage |
def get_node_condition(self, node_name, condition_type): | ||
node = self.get_node_by_name(node_name) | ||
for condition in node.status.conditions: | ||
if condition.type == condition_type: | ||
logging(f"Got node {node_name} condition {condition_type}: {condition}") | ||
return condition.status | ||
assert False, f"Failed to get node {node_name} condition {condition_type}: {node}" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling and add documentation.
- Replace
assert False
with proper exception handling as it can be removed by Python's -O flag. - Add docstring and type hints for better code documentation.
- def get_node_condition(self, node_name, condition_type):
+ def get_node_condition(self, node_name: str, condition_type: str) -> str:
+ """Get the status of a specific condition type for a node.
+
+ Args:
+ node_name: Name of the node to check.
+ condition_type: Type of condition to retrieve (e.g., 'Ready', 'MemoryPressure').
+
+ Returns:
+ str: The status of the condition ('True', 'False', or 'Unknown').
+
+ Raises:
+ RuntimeError: If the condition type is not found for the node.
+ """
node = self.get_node_by_name(node_name)
for condition in node.status.conditions:
if condition.type == condition_type:
logging(f"Got node {node_name} condition {condition_type}: {condition}")
return condition.status
- assert False, f"Failed to get node {node_name} condition {condition_type}: {node}"
+ raise RuntimeError(f"Failed to get node {node_name} condition {condition_type}: {node}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_node_condition(self, node_name, condition_type): | |
node = self.get_node_by_name(node_name) | |
for condition in node.status.conditions: | |
if condition.type == condition_type: | |
logging(f"Got node {node_name} condition {condition_type}: {condition}") | |
return condition.status | |
assert False, f"Failed to get node {node_name} condition {condition_type}: {node}" | |
def get_node_condition(self, node_name: str, condition_type: str) -> str: | |
"""Get the status of a specific condition type for a node. | |
Args: | |
node_name: Name of the node to check. | |
condition_type: Type of condition to retrieve (e.g., 'Ready', 'MemoryPressure'). | |
Returns: | |
str: The status of the condition ('True', 'False', or 'Unknown'). | |
Raises: | |
RuntimeError: If the condition type is not found for the node. | |
""" | |
node = self.get_node_by_name(node_name) | |
for condition in node.status.conditions: | |
if condition.type == condition_type: | |
logging(f"Got node {node_name} condition {condition_type}: {condition}") | |
return condition.status | |
raise RuntimeError(f"Failed to get node {node_name} condition {condition_type}: {node}") |
🧰 Tools
🪛 Ruff (0.8.2)
133-133: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
Which issue(s) this PR fixes:
Issue longhorn/longhorn#10203
What this PR does / why we need it:
add test case Check If Nodes Are Under Memory Pressure After Cluster Restart
Special notes for your reviewer:
Additional documentation or context
Summary by CodeRabbit
New Features
Bug Fixes
Tests