Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDN-4930: Downstream Merge [12-19-2024] #2402

Merged
merged 43 commits into from
Jan 8, 2025

Conversation

jluhrsen
Copy link
Contributor

πŸ“‘ Description

Fixes #

Additional Information for reviewers

βœ… Checks

  • My code requires changes to the documentation
  • if so, I have updated the documentation as required
  • My code requires tests
  • if so, I have added and/or updated the tests as required
  • All the tests have passed in the CI

How to verify it

jcaamano and others added 30 commits December 13, 2024 17:17
Hoping to bring improvement to the naming and interfacing of the nad
controller package to reduce confusion. No functional changes but,
obviously, broad code impact. On to the details...

Rename package network-attach-def-controller to networkmanager. The
controller struct itself has been unexported and instead two interfaces
are provided: Controller to start and stop, and NetworkManager to
provide (for now) read-only information about networks. A default
implementation is provided that assumes the default network is the only
ever existing network, to be used when multi-network capabilities are
disabled or for testing. Other unexported structs have also been renamed
accordingly as well as other types that followed the former naming scheme.

Rename package network-controller-manager to controllemanager,
understood as a package that manages other controllers that can be of any
nature: network specific, network aware or network unaware. The
interfaces that controller managers need to implement to interface with
network manager have also been renamed accordingly as well as other
types that followed the former naming scheme.

For reasons beyond my comprehension, it befell to me coming accross, and
fixing, a number of unrelated issues, like flaking unit tests, missing
nftables stubbing to be able to run the tests on the testing container,
etc... that given the broad impact of this commit, I decided to fix
here instead of bothering with new commits. While I was at it, fixed all
go vet warnings on impacted files because it bothers me to have my IDE
painted yellow. We should add that to our linter. Moving on...

Signed-off-by: Jaime CaamaΓ±o Ruiz <[email protected]>
Consolidate all getters in NetInfo becoming the single interface to
obtain network information intended to be used broadly as read-only.

Consolidate all setters in new MutableNetInfo interface. A
MutableNetInfo can only be built from a copy of NetInfo. Add setters for
route advertisements related information. This is intended to be used by
a network manager to aggregate consolidated information about the
network from multiple sources that can change over time.

Add a ReconcilableNetInfo interface acting as a receiver by which
changes done in a MutableNetInfo can be incorporated into an existing
NetInfo. A ReconcilableNetInfo can only be built from a copy of NetInfo.
It is  intended to be used by top level network controllers to reconcile
changes to network information when instructed by a network manager.

New free methods IsNetworkCompatible, DoesNetworkNeedReconciliation, and
Reconcile facilitate the coordination of the reconciliation.

The overall idea is to have a tighter control of NetInfo changes and
making sure that controllers become aware of those changes in a way that
allows the controllers to reconcile them.

Signed-off-by: Jaime CaamaΓ±o Ruiz <[email protected]>
Add the ability for network controllers to reconcile some network
information changes. Currently just changes of the VRFs the network is
leaking/advertising to. Support for reconciling NAD changes is not
included in this commit.

Currently reconciles:
- for zone network controllers to configure or not the pod IP to node IP
  SNAT on the GR for a nodes local to the zone
- for node network controllers to configure or not br-ex flows to
  redirect pod IP ingress traffic to the OVN network (except ingress to
  the management ip address). Only done for the default network, UDNs
  will be handled on a later commit.

This should be enough to provide direct ingress capabilities for the
default network in SGW mode.

Note that non-primary network controllers don't reconcile anything as
route advertising is not supported on them. Also cluster manager network
controllers don't reconcile much as they don't have the need.

Signed-off-by: Jaime CaamaΓ±o Ruiz <[email protected]>
Network manager will, upon ensuring a network, fetch relevant
information from RouteAdvertisements (such as VRFs being advertised on
and selected nodes to advertise EIPs) applying to the network and set
it on the corresponding NetInfo used to create the network controller,
or triggering a reconciliation if it was already running.

This relies on an annotation that will be set by cluster manager on NADs
pointing to the RouteAdvertisements that apply to the network (future
commit)

This will also be done for the default network. This commit assumes
cluster manager will create a dummy NAD for the default network in ovn-k
namespace were the annotaiton will be set. The NAD controller will
process the NAD and ensure the network on the network controller, just
like with any other network. The network controller gains access to the
default network controller, however, it treats that controller
differently: it's not started or stopped, obviously, just reconciled.

Signed-off-by: Jaime CaamaΓ±o Ruiz <[email protected]>
Fix kind export logs to address the following warning message:

  $ kind export logs --name ${KIND_CLUSTER_NAME} --loglevel=debug /tmp/kind/logs
  WARNING: --loglevel is deprecated, please switch to -v and -q!
  please adjust the command so it is equivalent and works on latest kind release

Signed-off-by: Flavio Fernandes <[email protected]>
This commit integrates ocp-traffic-flow-tests into the existing
GitHub E2E workflows. The control tests can now be executed as
a standalone lane.

Currently, the traffic flow tests use a specific commit from the
repository https://github.com/wizhaoredhat/ocp-traffic-flow-tests.git.
This is a temporary solution until the repository becomes part
of the ovn-kubernetes GitHub organization.

Next steps to look into, after this commit / pr:
- External and secondary network tests
- Add bandwidth thresholds that would be reasonable for Github
- Move repo to ovn-kubernetes org

Fixes: ovn-kubernetes/ovn-kubernetes#4756
Signed-off-by: Flavio Fernandes <[email protected]>
Some time echoserver in VM could take more than 20s to start up, so increase
timeout to 60 seconds to avoid flakiness.

Signed-off-by: Lei Huang <[email protected]>
Direct ingress for default network in SGW mode with route advertisements
'ubuntu-latest' now references ubuntu 24.04.
Previously, it referenced 22.04.

Signed-off-by: Martin Kennelly <[email protected]>
Statically set GH runner image to ubuntu 22.04 for tests
This reverts commit bae7288.

Signed-off-by: Jaime CaamaΓ±o Ruiz <[email protected]>
Looks like this path was missed and causes egress IP with UDNs on local
gateway mode to not function correctly. Reply traffic was being sent to
default cluster network patch port rather than the correct UDN patch
port.

Signed-off-by: Tim Rozet <[email protected]>
Previous to this change in dualstack, we are not
attempting to match the mgmt port IP family with
the EgressIP family.

Single stack clusters performed as expected.

Signed-off-by: Martin Kennelly <[email protected]>
Follow up work is required to run IPv6 CDN tests
in dualstack env because currently only IPv4 tests
are executed for dual stack clusters.
Upstream CI executes IPv6 tests for single stack IPv6
clusters only.

Also, skip test 'replies to egress IP packets that require fragmentation'
because it is broken. Tracked by downstream bug OCPBUGS-46476.

Signed-off-by: Martin Kennelly <[email protected]>
Fix for unexistent package msbuild in latest ubuntu images
Add EgressIP E2Es to Net Seg lanes
Latest metallb repo support dualstack at dev-env let's pin to it.

Signed-off-by: Enrique Llorente <[email protected]>
Add testing for:
- External clients to LB, NodePort
- Podify clients to LB

Signed-off-by: Enrique Llorente <[email protected]>
…-e2e

udn, e2e: Add external client to nodeport and loadblanacer services tests
ovs-doca requires mtu_request to be set on ports metadata for fragmented packets

Signed-off-by: Alin Gabriel Serdean <[email protected]>
PR #4907 used ginkgo v1 to utilize DescribeTable
but we already moved away from that as seen in
PR #4749

Also, executed 'go mod tidy && go mod vendor'
as go mod file was stale.

Follow up work is required to detect this scenario
in CI.

Signed-off-by: Martin Kennelly <[email protected]>
Since it will be part of overall network reconciliation when its
advertising configuration changes.

Signed-off-by: Jaime CaamaΓ±o Ruiz <[email protected]>
Used on secondary nic flows, not needed if pod network is advertised as
traffic is supposed to reach the node hosting the pod directly.

However, we add an SNAT rule for local IPT traffic flows. Otherwise the
node masquerade address is used as it is set as the source address in
the services route. Note that in this case the linux source selection
algorithm just looks at the default routing table and does not look at
non default, at least when the ip rule is based on packet marks set
through conntrack.

Signed-off-by: Jaime CaamaΓ±o Ruiz <[email protected]>
@jluhrsen
Copy link
Contributor Author

/test e2e-metal-ipi-ovn-ipv6-techpreview
/test e2e-aws-ovn-hypershift-conformance-techpreview
/test e2e-azure-ovn-techpreview
/test e2e-metal-ipi-ovn-dualstack-techpreview
/test e2e-vsphere-ovn-techpreview
/test e2e-aws-ovn-techpreview
/test e2e-gcp-ovn-techpreview
/test e2e-metal-ipi-ovn-techpreview
/test openshift-e2e-gcp-ovn-techpreview-upgrade

@jluhrsen
Copy link
Contributor Author

/hold
broken. most techpreview jobs failing. copying notes from slack thread here:

  • most (except 3) techpreview jobs failed
    -- see a lot of errors in test output about Error while dialing: dial unix /var/run/crio/crio.sock: connect: connection refused (example) and seen in the "pods should successfully create sandboxes)
    -- see a lot of messages about ingress operator going for a toss (example)
    -- nodes going notReady and/or unreachable (example)
    -- saw some UDN tests failing, but this one was odd in that it looked like it's comparing a v4 address to a v6 address
    -- this job had the most trouble (56 failures)

no clue where to start, but maybe understanding what's going on with the /var/run/crio/crio.sock message. Seems pretty low level

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 20, 2024
@jcaamano
Copy link
Contributor

/retest

@jcaamano
Copy link
Contributor

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview 3
/payload-aggregateperiodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview-serial 3
/retest

Copy link
Contributor

openshift-ci bot commented Dec 27, 2024

@jcaamano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/718ec360-c43f-11ef-9c45-1b7a290aebcf-0

@jcaamano
Copy link
Contributor

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview-serial 3

Copy link
Contributor

openshift-ci bot commented Dec 27, 2024

@jcaamano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/89858f30-c43f-11ef-80a1-4257b0c5355a-0

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Jan 2, 2025

/payload 4.19 ci blocking
/payload 4.19 nightly blocking

Copy link
Contributor

openshift-ci bot commented Jan 2, 2025

@jluhrsen: trigger 4 job(s) of type blocking for the ci release of OCP 4.19

  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7add6520-c949-11ef-8d80-ae7a55f29e16-0

trigger 13 job(s) of type blocking for the nightly release of OCP 4.19

  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.19-fips-payload-scan
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-microshift-release-4.19-periodics-e2e-aws-ovn-ocp-conformance
  • periodic-ci-openshift-microshift-release-4.19-periodics-e2e-aws-ovn-ocp-conformance-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7add6520-c949-11ef-8d80-ae7a55f29e16-1

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Jan 4, 2025

update:

there are 3 optional jobs failing:

  1. aws-ovn-kubevirt is a permafail job, I can't find a single job that's passed since it started in July. Here is a PR to make it always_run: true
  2. metal-ipi-ovn-ipv6-techpreview is failing because of new tests in 4.19 that are not ipv6 ready. this commit is what brought in the tests that are failing in 4.19 that are not there in 4.18 which explains why 4.18 is passing. It's not a required job so we could merge with that understanding. However, there is another set of tests failing around openshift-insights that also looks related to v6 issues, but doesn't seem related to us.
  3. e2e-metal-ipi-ovn-techpreview looks like it may be defaulting to v6 or dualstack. I noticed the periodic job is passing and one difference is that it's explicitly passing v4 for v4 only jobs to the DEVSCRIPTS_CONFIG. Here is a PR to do that for ovnk pre-submits. Let's get that in and re-run these jobs.

payload blocking jobs

  1. the ci jobs had some trouble and three of them had some sort of "pod pending timeout". re-running those now. The hyperhift-aws-ovn job for this group did pass though, FWIW

/payload 4.19 ci blocking

  1. the nightly jobs seem to be in decent shape with only 4 that failed (of the 13)

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node 10

I think overall if the ci blocking payload jobs come back clean as well as the e2e-aws-upgrade-ovn-single-node aggregate we may be good to go with merging this.

Copy link
Contributor

openshift-ci bot commented Jan 4, 2025

@jluhrsen: trigger 4 job(s) of type blocking for the ci release of OCP 4.19

  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bee03e30-ca31-11ef-8b95-fdd856e0f060-0

trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bee03e30-ca31-11ef-8b95-fdd856e0f060-1

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Jan 7, 2025

/test e2e-metal-ipi-ovn-techpreview
/test e2e-metal-ipi-ovn-ipv6-techpreview

Copy link
Contributor

openshift-ci bot commented Jan 7, 2025

@jluhrsen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-windows 2e03a16 link true /test e2e-aws-ovn-windows
ci/prow/okd-scos-e2e-aws-ovn 2e03a16 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-kubevirt 2e03a16 link false /test e2e-aws-ovn-kubevirt
ci/prow/security 2e03a16 link false /test security
ci/prow/e2e-metal-ipi-ovn-ipv6-techpreview 2e03a16 link false /test e2e-metal-ipi-ovn-ipv6-techpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Jan 7, 2025

all the jobs in the payload 4.19 ci blocking for gcp-ovn-upgrade failed with some docker build issue getting the kmod package. let's try again. hoping that was some random package/infra issue:

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Jan 7, 2025

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4ec2b940-cd40-11ef-8406-ffdccb236d01-0

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Jan 7, 2025

I think we should ignore the e2e-aws-upgrade-ovn-single-node aggregate job. the latest batch of 10 had 3 pass, 7 fail but even the periodic job tracked in sippy is passing under 50% of the time. I checked all 7 failures and none are UDN and none are consistent (different failures in each job).

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Jan 7, 2025

the e2e-metal-ipi-ovn-ipv6-techpreview job is failing because the tests are not v6 ready and I think we are ok with that for now knowing the fixes for those tests are coming.

otherwise if the aggregate e2e-gcp-ovn-upgrade comes back clean I think we are good to get this in.

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Jan 8, 2025

the e2e-metal-ipi-ovn-ipv6-techpreview job is failing because the tests are not v6 ready and I think we are ok with that for now knowing the fixes for those tests are coming.

otherwise if the aggregate e2e-gcp-ovn-upgrade comes back clean I think we are good to get this in.

@trozet @tssurya @jcaamano , I think we are good here. the aggregate results were 9 pass, 1 fail

can we get this in and I'll start on the next one.

@jcaamano
Copy link
Contributor

jcaamano commented Jan 8, 2025

/lgtm
/approve
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2025
Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcaamano, jluhrsen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit d8e16ec into openshift:master Jan 8, 2025
36 of 39 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ovn-kubernetes-base
This PR has been included in build ose-ovn-kubernetes-base-container-v4.19.0-202501081241.p0.gd8e16ec.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ovn-kubernetes-microshift
This PR has been included in build ovn-kubernetes-microshift-container-v4.19.0-202501081241.p0.gd8e16ec.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-ovn-kubernetes
This PR has been included in build ose-ovn-kubernetes-container-v4.19.0-202501081241.p0.gd8e16ec.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.