-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-47701:[release-4.16] cmd: PPC: support tolerating heterogeneous core IDs #1268
base: release-4.16
Are you sure you want to change the base?
Conversation
…penshift#1252) * OCPBUGS-44372: PPC: skip comparing ProcessorCore.Index between NUMA cores (openshift#1213) * PPC: skip comparing ProcessorCore.Index between NUMA cores ProcessorCore.Index indicates the zero-based index of the core in the Cores slice. While core might be shown in a different order, they can still be equivalent. See: jaypipes/ghw#346. Adjust the equality check to skip this field to fix this: ``` Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66] ``` And add a unit test to cover this scenario. Signed-off-by: Shereen Haj <[email protected]> * PPC: unit: rename test variables to avoid misusing Rename test variables and add clarifying comments to avoid misusing them while writing tests. Signed-off-by: Shereen Haj <[email protected]> --------- Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit ad4a69f) * PPC: fix error message (openshift#1223) Copy-paste mistake in reporting the difference in logical processors list; replace `.NumThreads` with `.LogicalProcessors`. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit 7323f74) * OCPBUGS-44372: cmd: PPC: support tolerating heterogeneous core IDs (openshift#1236) * cmd: PPC: support tolerating heterogeneous core IDs Problem: In a system with a specific number of cores per socket the host gives each core a number. So far PPC would generate performance profile only after verifying that the nodes pointed to by the specified node pool are all of the same hardware and topology. That is because the performance profile will apply the same configuration on all these nodes and it is most important that they have same structure, CPUs distribution across NUMAs and CPUs availability. This by default includes having same core IDs for each CPU in the same NUMA cells for all the compute nodes. for instance: Worker-0 has NUMA-0 on which the CPUs siblings [2,66] is coupled in one core numbered 18`. All other workers should have this info true for them, otherwise the tool would fail. It was observed (for example on Intel Xeon GoldGold 6438N with 0-127 online CPUs distributed across 2 sockets, 32 cores per socket and 2 threads per core) that core numbering can have different schemes even with a system from the same vendor, which causes the tool to fail to generate a profile. Suggested solution: With further investigation, The numbering the pattern depends on the settings of the hardware, the software and the firmware (BIOS).While core IDs may vary nodes can still be considered having same NUMA topology taking into account that core scope is on the single NUMA. However, core IDs can be important in optimizing the system's performance and managing isolation of tasks, meaning if the performance of worker-0 is not comparable to that of worker-1 before having performance-profile applied on both, having an improved performance on both after applying PP is not guaranteed. In this PR, we loosen the hard requirment of having same core numbering on same NUMA cells on different systems into a warning will be logged out as well as a comment on the generated output. So as long as the NUMA cells have same logical processors' count and IDs and same threads' number, core ID equality is treated as best effort. That is because when scheduling workloads, we care about the logical processors ids and their location on the NUMAs. Disclaimer: We support this option to unblock business matters and is recommended to use the generated PP with caution. Signed-off-by: Shereen Haj <[email protected]> * PPC tests: ensure no intersection between calculated CPUsets siblings We want to verify automatically that the generated CPUs sets are including also the siblings. This mainly critical in calculating the isolated vs CPU sets on the node where when hyperthreading is enabled CPUs siblings must belong to the same set either reserved or isolated. Note that this is only introduing a function to calculate the siblings and compare the result with the generated CPU sets by the PPC. But this is already covered by the assertions of the exact expected CPUs lists. Signed-off-by: Shereen Haj <[email protected]> * PPC-tests: ensure CPU sets are equal for nodes that differ only in core IDs Add a new system topology containing the NUMA and CPUs topologies of a new node which is identical in terms of NUMAs and CPUs count and numbers but different in core IDs. Calculating the CPU sets (isolated, reserved, & offlined) for both nodes should result in equal sets. Signed-off-by: Shereen Haj <[email protected]> --------- Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit da3ed75) (cherry picked from commit aa2b326)
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shajmakh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@shajmakh: This pull request references Jira Issue OCPBUGS-47701, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@shajmakh: This pull request references Jira Issue OCPBUGS-47701, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@shajmakh: This pull request references Jira Issue OCPBUGS-47701, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest-required
@shajmakh: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
manual cherrypick of #1252