Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-47701:[release-4.16] cmd: PPC: support tolerating heterogeneous core IDs #1268

Open
wants to merge 1 commit into
base: release-4.16
Choose a base branch
from

Conversation

shajmakh
Copy link
Contributor

manual cherrypick of #1252

…penshift#1252)

* OCPBUGS-44372: PPC: skip comparing ProcessorCore.Index between NUMA cores  (openshift#1213)

* PPC: skip comparing ProcessorCore.Index between NUMA cores

ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.

Adjust the equality check to skip this field to fix this:

```
  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66]
```

And add a unit test to cover this scenario.

Signed-off-by: Shereen Haj <[email protected]>

* PPC: unit: rename test variables to avoid misusing

Rename test variables and add clarifying comments to avoid misusing them
while writing tests.

Signed-off-by: Shereen Haj <[email protected]>

---------

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit ad4a69f)

* PPC: fix error message (openshift#1223)

Copy-paste mistake in reporting the difference in logical processors
list; replace `.NumThreads` with `.LogicalProcessors`.

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit 7323f74)

* OCPBUGS-44372: cmd: PPC: support tolerating heterogeneous core IDs (openshift#1236)

* cmd: PPC: support tolerating heterogeneous core IDs

Problem: In a system with a specific number of cores per socket
the host gives each core a number. So far PPC would generate
performance profile only after verifying that the nodes pointed to by
the specified node pool are all of the same hardware and topology. That
is because the performance profile will apply the same configuration on all
these nodes and it is most important that they have same structure, CPUs
distribution across NUMAs and CPUs availability. This by default
includes having same core IDs for each CPU in the same NUMA cells for
all the compute nodes. for instance:

Worker-0 has NUMA-0 on which the CPUs siblings [2,66] is coupled in
one core numbered 18`.
All other workers should have this info true for them, otherwise the
tool would fail.

It was observed (for example on Intel Xeon GoldGold 6438N with 0-127
online CPUs distributed across 2 sockets, 32 cores per socket and 2 threads per core)
that core numbering can have different schemes even with a system from the same vendor,
which causes the tool to fail to generate a profile.

Suggested solution:
With further investigation, The numbering the pattern depends on the settings
of the hardware, the software and the firmware (BIOS).While core IDs may vary
nodes can still be considered having same NUMA topology taking into account
that core scope is on the single NUMA. However, core IDs can be important in
optimizing the system's performance and managing isolation of tasks, meaning if the
performance of worker-0 is not comparable to that of worker-1 before
having performance-profile applied on both, having an improved
performance on both after applying PP is not guaranteed.

In this PR, we loosen the hard requirment of having same core numbering
on same NUMA cells on different systems into a warning will be logged
out as well as a comment on the generated output. So as long as the NUMA cells
have same logical processors' count and IDs and same threads' number,
core ID equality is treated as best effort. That is because when scheduling workloads,
we care about the logical processors ids and their location on the NUMAs.

Disclaimer: We support this option to unblock business matters and is recommended
to use the generated PP with caution.

Signed-off-by: Shereen Haj <[email protected]>

* PPC tests: ensure no intersection between calculated CPUsets siblings

We want to verify automatically that the generated CPUs sets are
including also the siblings. This mainly critical in calculating the
isolated vs CPU sets on the node where when hyperthreading is enabled
CPUs siblings must belong to the same set either reserved or isolated.

Note that this is only introduing a function to calculate the siblings
and compare the result with the generated CPU sets by the PPC. But this
is already covered by the assertions of the exact expected CPUs lists.

Signed-off-by: Shereen Haj <[email protected]>

* PPC-tests: ensure CPU sets are equal for nodes that differ only in core IDs

Add a new system topology containing the NUMA and CPUs topologies of a
new node which is identical in terms of NUMAs and CPUs count and numbers
but different in core IDs. Calculating the CPU sets (isolated, reserved,
& offlined) for both nodes should result in equal sets.

Signed-off-by: Shereen Haj <[email protected]>

---------

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit da3ed75)
(cherry picked from commit aa2b326)
@openshift-ci openshift-ci bot requested review from rbaturov and Tal-or December 31, 2024 00:01
Copy link
Contributor

openshift-ci bot commented Dec 31, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shajmakh
Once this PR has been reviewed and has the lgtm label, please assign jmencak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shajmakh shajmakh changed the title [release-4.16] cmd: PPC: support tolerating heterogeneous core IDs OCPBUGS-47701:[release-4.16] cmd: PPC: support tolerating heterogeneous core IDs Dec 31, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Dec 31, 2024
@openshift-ci-robot
Copy link
Contributor

@shajmakh: This pull request references Jira Issue OCPBUGS-47701, which is invalid:

  • expected the bug to target either version "4.16." or "openshift-4.16.", but it targets "4.17.z" instead
  • expected Jira Issue OCPBUGS-47701 to depend on a bug targeting a version in 4.17.0, 4.17.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

manual cherrypick of #1252

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Dec 31, 2024
@shajmakh
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@shajmakh: This pull request references Jira Issue OCPBUGS-47701, which is invalid:

  • expected Jira Issue OCPBUGS-47701 to depend on a bug targeting a version in 4.17.0, 4.17.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@shajmakh
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Dec 31, 2024
@openshift-ci-robot
Copy link
Contributor

@shajmakh: This pull request references Jira Issue OCPBUGS-47701, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.z) matches configured target version for branch (4.16.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-44644 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-44644 targets the "4.17.z" version, which is one of the valid target versions: 4.17.0, 4.17.z
  • bug has dependents

Requesting review from QA contact:
/cc @mrniranjan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Dec 31, 2024
@openshift-ci openshift-ci bot requested a review from mrniranjan December 31, 2024 00:06
Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 3, 2025
Copy link
Contributor

openshift-ci bot commented Jan 3, 2025

@shajmakh: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants