-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SDN-5297,OCPBUGS-46527: DownStream Merge Sync from 4.19 [12-19-2024] #2403
base: release-4.18
Are you sure you want to change the base?
SDN-5297,OCPBUGS-46527: DownStream Merge Sync from 4.19 [12-19-2024] #2403
Conversation
Signed-off-by: Surya Seetharaman <[email protected]>
The gateway init is only ever invoked for primary UDNs if the feature gate is enabled. We don't need to be checking this on every if inside the gatewayInit. Signed-off-by: Surya Seetharaman <[email protected]>
This reverts commit 8b94231. Signed-off-by: Surya Seetharaman <[email protected]>
- LocalNet support removed, as it's not supported at the moment. - Introduce ClusterUserDefinedNetwork CRD and types This commit changes are result of the following command: crd-ref-docs --source-path ./go-controller/pkg/crd/userdefinednetwork --config=crd-docs-config.yaml --renderer=markdown --output-path=./docs/api-reference/userdefinednetwork-api-spec.md Signed-off-by: Or Mergi <[email protected]>
By using maxUnavailable, we can make the redeploy faster. Signed-off-by: Or Shoval <[email protected]>
Signed-off-by: Or Mergi <[email protected]>
Routes via mp0 were being deleted on every ovnkube-node restart: [root@ovn-worker ~]# ip monitor route Deleted 192.72.3.0/24 dev ovn-k8s-mp0 proto kernel scope link src 192.72.3.2 Deleted broadcast 192.72.3.255 dev ovn-k8s-mp0 table local proto kernel scope link src 192.72.3.2 Deleted local 192.72.3.2 dev ovn-k8s-mp0 table local proto kernel scope host src 192.72.3.2 local 192.72.3.2 dev ovn-k8s-mp0 table local proto kernel scope host src 192.72.3.2 broadcast 192.72.3.255 dev ovn-k8s-mp0 table local proto kernel scope link src 192.72.3.2 This causes traffic outage during upgrade, as well as other unwanted side effects when pod-destined traffic is routed via default gateway route in the host. This is especially disruptive in local gateway mode. This patch removes the teardown, and then makes the synchronization of addresses and routes more robust, so that we can safely handle changes to MTU or mp0 addresses. Signed-off-by: Tim Rozet <[email protected]>
Fixes unexpected mp0 route removal during start up
Since in L2 UDNs, multiple networks are set on the rtos-/stor- ports when lb_force_snat_ip is set to router-ip it picks the lexicographically sorted subnet IP to perform the SNAT which breaks services on L2. In this fix we will start using the joinIP to explicitly SNAT traffic to. This is what we used to do before except we had to use router-ip to deal with the following case: ext->ovn-workerIP:NodePort backed by ovn-worker2 host networked pod: ext->ovn-worker->br-ex->GR(DNAT to ovn-worker2)-> GR(SNAT to joinIP)->send it back to br-ex->underlay ->ovn-worker2 .. here traffic goes out with joinIP which will break this flow. But in UDNs we don't support host-networked pods yet, much less host-networked-pod backends to services. So when that support gets added we must ensure to add flows to br-ex similar to the per network masqueradeIP flows we have but for joinIP to do another SNAT to nodeIP before sending it out. Signed-off-by: Surya Seetharaman <[email protected]>
There is a bug in OVN where LB templates do not work when lb_force_snat is not specified as router_ip. Additionally, services on UDNs are only supported in single node IC right now. Temporarily set LB templates to false until OVN is fixed. OVN Bug: https://issues.redhat.com/browse/FDP-988 Signed-off-by: Tim Rozet <[email protected]>
In a scenario ClusterUserDefinedNetwork object exist with role=primary, a primary NAD may exist in the subject namespace should exist, similar to UserDefinedNetwork object case. When looking for active network in the subject namespace, check if primary CUDN exist and refers to the subject namespace. Signed-off-by: Or Mergi <[email protected]>
In case the ovn control plane pod is restarted, (for example, node restart, upgrade, or any pod restart reason), the ipam claim Sync function created an array of all the ipams belonging to a specific network, and used named allocator AllocateIPs method to allocate them, so the allocator will reflect the current cluster state. The problem is AllocateIPs allows to allocate only one IP per given subnet, so once it is used for all the IPs in the network at once, it will fail, causing the control plane to have a crash loop. Fix it by calling AllocateIPs per each claim on its own. The claims were already created correctly, so each claim on its own is safe to call AllocateIPs. Add unit test to show the relevant assertion. Signed-off-by: Or Shoval <[email protected]>
β¦rks-case UDN: L2: Fix router ip on multiple networks set on LRP case
Signed-off-by: Or Shoval <[email protected]>
Since hostname is populated now via DHCP, we can always use the vmi name in the console expecter. Signed-off-by: Or Shoval <[email protected]>
Signed-off-by: Dan Winship <[email protected]>
Signed-off-by: Dan Winship <[email protected]>
Signed-off-by: Dan Winship <[email protected]>
ipam: Fix init flow in case there are sticky ips in the system
UDN: Patch Kubevirt CR to support managedTap binding
port management-port-related iptables code to nftables
The `AllocateIPs` method name is misleading; it receives a slice of IPs as parameter, which leads the user to think that it can allocate the IPs in the slice in a single subnet - i.e. for subnet 10.0.0.0/24, it could allocate IPs 10.0.0.2 and 10.0.0.3. That is not the case: that function allocates a **single** IP address in each of the subnets the IP allocator for said network manages. For instance, on a dual-stack network, it could handle the following subnets: 10.0.0.0/24 and fd10::/64. The function could then the invoked with ips=10.0.0.2,fd10::2 to allocate **one** IP in each of the two subnets managed by the IP allocator. Signed-off-by: Miguel Duarte Barroso <[email protected]>
NAD validations: Fix typo in error message
CreateOrReplaceLogicalRouterStaticRouteWithPredicate already does duplicate detection on static routes based on the predicate it gets as input, but the code in CreateDefaultRouteToExternal calls it with a predicate that is too restrictive: it checks against next hop, IP prefix and policy (src/dst). In corner cases where the gateway IP has changed (https://issues.redhat.com/browse/OCPBUGS-32754, for unknown reason), CreateDefaultRouteToExternal will add its static route to the gateway IP; the existing route, pointing to the incorrect gateway router IP, will stay in nbdb. Simply removing the check against the next hop in the predicate is not enough, since we would wipe out the existing route for the node subnet via mp0. Also, we should take into account that a cluster subnet can be expanded, however rare that is, by choosing a lower mask value, as long as the network address stays the same and the host prefix length is also unchanged. The new predicate skips the routes it identifies as being node subnet routes for the local node and matches against any existing cluster subnet route which might also have been expanded. Adding unit tests to capture these scenarios. Signed-off-by: Riccardo Ravaioli <[email protected]>
Signed off by: fossabot <[email protected]> Signed-off-by: Surya Seetharaman <[email protected]>
Add unit tests to document the behavior for localnet networks (with or without subnets defined) on IC/non-IC deployments. Signed-off-by: Miguel Duarte Barroso <[email protected]>
Add unit tests to assert the syncAll operation tolerates the `ErrNetworkControllerTopologyNotManaged` error, which was crash-looping the cluster manager when processing localnet networks without subnets defined. Signed-off-by: Miguel Duarte Barroso <[email protected]>
β¦it-tests cluster manager, tests: ignore topology not managed errors
The previous condition was incorrect and was ignoring host-networked pods only when `addPort` was false. This means that the controller tried to configure the logical port for new host-networked pods. Fix the condition by exiting when the pod is host networked or `addPort` is false. Signed-off-by: Patryk Diak <[email protected]>
Signed-off-by: Dmitry Porokh <[email protected]>
Today ovnkube-controller relies on ovnkube-node to annotate the mac address so that it can properly program the OVN mgmt port. In practice with UDN, this means cluster manager creates the node and allocates some subnet info to the node, both the node and the ovnkube-controller side get updates, but ovnkube-controller fails to program the first time because it is waiting for the node side to configure the mac address. This patch changes the behavior of ovn-kubernetes to calculate the MAC address for the management port from the mgmt port IP address of the first subnet for the network. For backwards compatibility, it will first attempt to read the node annotation, and if it no mgmt MAC exists, it will derive it from the mgmt IP of the subnet. Signed-off-by: Tim Rozet <[email protected]>
With the change to no longer depend on mgmt port mac address annotation on the node, the egress firewall test would unexpectedly now correctly sync the management port during the test. This would cause addAllowACLFromNode to execute and add an ACL for network policy to the node switch. That in turn would cause this egress firewall test to fail because it would see there are more than 0 ACLs on the switch. This commit fixes it by properly scoping the lookup in the egress firewall test for ACLs only applicable to egress firewall. Signed-off-by: Tim Rozet <[email protected]>
Signed-off-by: Tim Rozet <[email protected]>
L3 UDN: EgressIP hosted by primary interface (`breth0`)
This change may add 5ms delay on each informer start (on controller start). It shouldn't be an issue because this delay is negligible in compare to 100ms cache status polling interval. Signed-off-by: Dmitry Porokh <[email protected]>
Instead of failing ignore EndpointSlices without a service label set. These custom, user created slices should be ignored. Signed-off-by: Patryk Diak <[email protected]>
Calculates mgmt port MAC rather than storing
Improves pod deletion with user defined networks
UDN Gateway: Ignore EndpointSlices without a service label
this is a d/s only fix. it's needed u/s because the feature needs a kernel fix in the u/s test runners. Signed-off-by: Jamo Luhrsen <[email protected]>
OCPBUGS-41156, SDN-4930, OCPBUGS-44794, OCPBUGS-43354, OCPBUGS-43519, OCPBUGS-32754: Downstream Merge [12-04-2024]
@jluhrsen: This pull request references SDN-5297 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jluhrsen The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jluhrsen: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/payload 4.18 ci blocking |
@jluhrsen: trigger 4 job(s) of type blocking for the ci release of OCP 4.18
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8c5399a0-be68-11ef-80a2-9e6442786538-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.18
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8c5399a0-be68-11ef-80a2-9e6442786538-1 |
/retitle SDN-5297,OCPBUGS-46527: DownStream Merge Sync from 4.19 [12-19-2024] |
@jluhrsen: This pull request references SDN-5297 which is a valid jira issue. This pull request references Jira Issue OCPBUGS-46527, which is valid. 7 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
π Description
Fixes #
Additional Information for reviewers
β Checks
How to verify it