Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Broken LBU impacts current prod #380

Open
pierreozoux opened this issue Oct 18, 2024 · 6 comments
Open

[Bug]: Broken LBU impacts current prod #380

pierreozoux opened this issue Oct 18, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@pierreozoux
Copy link
Contributor

pierreozoux commented Oct 18, 2024

What happened

It seems to me that the LBU is somehow broken, and our TAM Ilane confirmed me this, that it is currently a known bug on your side. But I don't know the bug number on your tracker.

Basically, the security group in a LBU doesn't work for a new node, until we attach a public IP.

Step to reproduce

So it impacts us currently in 2 ways:

Node reboot

The first time it happened, it was in April 2024. A node of the control plane rebooted, and it came back to life, but couldn't reach the kubeAPI LBU.
To investigate, I plugged in a public IP to debug the node, and suddenly, it worked again.
To me everything seems alright in terms of security groups, so I don't understand why attaching a public IP could solve the issue. But somehow it did.

Node upgrade

Now, I want to upgrade from 1.27 to 1.28, but when the new control plane appears, it can't reach the LBU, and if I attach the public IP, same is happening, suddenly, it can.

Expected to happen

I'd like to be able to reboot a node or upgrade my cluster.

Add anything

The internal ticket on your side, in the support for me is:
374117, 378531

The fact that I can't upgrade is already worrying, but the fact that if a node reboot, my cluster will be down again is really worrying me.

cluster-api output

NA

Environment

- Kubernetes version: (use `kubectl version`): 1.27.9
- OS (e.g. from `/etc/os-release`): NA
- Kernel (e.g. `uname -a`): NA
- cluster-api-provider-outscale version: v0.3.1
- cluster-api version: v1.5.5
- Install tools: NA
- Kubernetes Distribution: NA
- Kubernetes Diestribution version: NA
@pierreozoux pierreozoux added the bug Something isn't working label Oct 18, 2024
@outscale-hmi
Copy link
Contributor

Thank you for providing details on the connectivity issues with the load balancer after a node reboot or upgrade.
Based on our review, here are some likely causes :

  • If there are NAT gateway misconfigurations or DNS resolution issues for private IPs, this could cause connectivity interruptions to the load balancer after a reboot.
  • If you used the option to delete outbound sg, this might be preventing the node from re-establishing a connection with the load balancer after a reboot.

=> Attaching a public IP triggers a reconfiguration or refresh of the network settings on the node and potentially on the LBU. This refresh makes the necessary routes or security group rules effective, enabling the node to communicate through the LBU as expected.
In other words, adding a public IP forces the LBU or the network layer to re-evaluate the connection, which temporarily resolves the issue.

To Solve the Issue Permanently, we can Implement reconcileNetworkAttachment Logic which would periodically check the network attachment status of the load balancer,
it will ensure that all necessary network attachments (e.g., routes, security group rules) are in place and consistent for each node, even as nodes are created, rebooted, or updated:

  • Automatic Re-Attachment: If the load balancer is found to be disconnected from the network after a node reboot, reconcileNetworkAttachment can automatically reattach it. This ensures the load balancer maintains its connection to the network without requiring manual intervention.
  • Self-Healing Mechanism: This function acts as a self-healing mechanism, continuously monitoring the load balancer’s connectivity. If connectivity is disrupted after reboots due to NAT, DNS, or security group issues, reconcileNetworkAttachment can restore the connection by reattaching the load balancer to the network.
  • Status Tracking and Visibility: With this logic in place, we can track the load balancer’s attachment status and log updates whenever an attachment is missing or restored. This also provides helpful visibility for diagnosing connectivity issues more effectively in the future.

@pierreozoux
Copy link
Contributor Author

@outscale-hmi thanks for your answer, I'm not sure I understand everything you said, I'd love to spend 30m/60m with you to discuss about this topic.

Do you have a matrix account somewhere? Or an email?

It seems to me you are acknowledging the LBU bug, but don't consider to fix the bug there. I think it was working sometime before April 2024. So I think it is a regression that it would be possible to fix, and I think the best plae to fix is the LBU. But, I don't have access to this, you do.

But this bug + #383 it means, we are stuck with this cluster API on outscale.

@pierreozoux
Copy link
Contributor Author

Actually, I think it is more network infra/seurity group/ VPC around the LBU, than the LBU itself.

@outscale-hmi
Copy link
Contributor

Hello @pierreozoux,
I started working to improve the reconcile of the LBU as it's missing some checks such as CheckLoadBalancerRegisterVm,
For me, if the reconcile is optimized, it can help detect any missings or misconfigured infra/sg/vpc, and reconcile.

PS: yes we can have 1hr meeting to discuss all of this.

@pierreozoux
Copy link
Contributor Author

I think this #383 will solve our upgrade issue.

I'll try and let you know, but first I need to recreate a staging cluster and migrate staging workload.

@pierreozoux
Copy link
Contributor Author

So I think, it is fixed in main, what is the ETA for the next release?

Thanks 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants