-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-44372: PPC: skip comparing ProcessorCore.Index between NUMA cores #1213
Conversation
ProcessorCore.Index indicates the zero-based index of the core in the Cores slice. While core might be shown in a different order, they can still be equivalent. See: jaypipes/ghw#346. Adjust the equality check to skip this field to fix this: ``` Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66] ``` And add a unit test to cover this scenario. Signed-off-by: Shereen Haj <[email protected]>
@shajmakh: This pull request references Jira Issue OCPBUGS-44372, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cherry-pick release-4.17 release-4.16 release-4.15 release-4.14 release-4.13 release-4.12 |
@shajmakh: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM once inline comments are addressed
topology2.Nodes[0].Cores[0].Index = 1 | ||
topology2.Nodes[0].Cores[1].Index = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please deepcopy/clone the topology before to change it to avoid polluting the global state to other test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before each test the topology2 is reset by design of this group of tests. I adjusted the variable names and added clarifying comments here:
74b81bd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, great. If this is guartanteed we don't need deepcopy.
"github.com/jaypipes/ghw/pkg/topology" | ||
) | ||
|
||
func TestSortTopology(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to extract the sortTopology
helper (and test a private function :\ )? Can we test the public SortedTopology()
public function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intention is to verify sorting is done properly which is the part that was exported as a private function. To test SOrtedTopology() wee need to initialize a handler object to be able to call it, and in this function ultimately fetch the nodes' topologies and perform the actual sorting on that. Thus I don't see the value of testing the SortedTopology. I agree however this was better moved to the existing test file as testing the other private functions (like ensureSameTopology).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note however this is an additional commit that doesn't relate to the bug fix so I believe it is better split to another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up PR: #1217
/jira refresh |
@shajmakh: This pull request references Jira Issue OCPBUGS-44372, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Rename test variables and add clarifying comments to avoid misusing them while writing tests. Signed-off-by: Shereen Haj <[email protected]>
a9abb69
to
74b81bd
Compare
@shajmakh: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
@@ -730,8 +732,16 @@ func ensureSameTopology(topology1, topology2 *topology.Info) error { | |||
} | |||
|
|||
for j, core1 := range cores1 { | |||
if !reflect.DeepEqual(core1, cores2[j]) { | |||
return fmt.Errorf("the CPU corres differ: %v vs %v", core1, cores2[j]) | |||
// skip comparing index because it's fine if they deffer; see https://github.com/jaypipes/ghw/issues/345#issuecomment-1620274077 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: differ
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, shajmakh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@shajmakh: Jira Issue OCPBUGS-44372: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-44372 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@shajmakh: new pull request created: #1220 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…ores (openshift#1213) * PPC: skip comparing ProcessorCore.Index between NUMA cores ProcessorCore.Index indicates the zero-based index of the core in the Cores slice. While core might be shown in a different order, they can still be equivalent. See: jaypipes/ghw#346. Adjust the equality check to skip this field to fix this: ``` Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66] ``` And add a unit test to cover this scenario. Signed-off-by: Shereen Haj <[email protected]> * PPC: unit: rename test variables to avoid misusing Rename test variables and add clarifying comments to avoid misusing them while writing tests. Signed-off-by: Shereen Haj <[email protected]> --------- Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit ad4a69f)
[ART PR BUILD NOTIFIER] Distgit: cluster-node-tuning-operator |
…ores (openshift#1213) * PPC: skip comparing ProcessorCore.Index between NUMA cores ProcessorCore.Index indicates the zero-based index of the core in the Cores slice. While core might be shown in a different order, they can still be equivalent. See: jaypipes/ghw#346. Adjust the equality check to skip this field to fix this: ``` Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66] ``` And add a unit test to cover this scenario. Signed-off-by: Shereen Haj <[email protected]> * PPC: unit: rename test variables to avoid misusing Rename test variables and add clarifying comments to avoid misusing them while writing tests. Signed-off-by: Shereen Haj <[email protected]> --------- Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit ad4a69f)
…ous core IDs (#1252) * OCPBUGS-44372: PPC: skip comparing ProcessorCore.Index between NUMA cores (#1213) * PPC: skip comparing ProcessorCore.Index between NUMA cores ProcessorCore.Index indicates the zero-based index of the core in the Cores slice. While core might be shown in a different order, they can still be equivalent. See: jaypipes/ghw#346. Adjust the equality check to skip this field to fix this: ``` Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66] ``` And add a unit test to cover this scenario. Signed-off-by: Shereen Haj <[email protected]> * PPC: unit: rename test variables to avoid misusing Rename test variables and add clarifying comments to avoid misusing them while writing tests. Signed-off-by: Shereen Haj <[email protected]> --------- Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit ad4a69f) * PPC: fix error message (#1223) Copy-paste mistake in reporting the difference in logical processors list; replace `.NumThreads` with `.LogicalProcessors`. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit 7323f74) * OCPBUGS-44372: cmd: PPC: support tolerating heterogeneous core IDs (#1236) * cmd: PPC: support tolerating heterogeneous core IDs Problem: In a system with a specific number of cores per socket the host gives each core a number. So far PPC would generate performance profile only after verifying that the nodes pointed to by the specified node pool are all of the same hardware and topology. That is because the performance profile will apply the same configuration on all these nodes and it is most important that they have same structure, CPUs distribution across NUMAs and CPUs availability. This by default includes having same core IDs for each CPU in the same NUMA cells for all the compute nodes. for instance: Worker-0 has NUMA-0 on which the CPUs siblings [2,66] is coupled in one core numbered 18`. All other workers should have this info true for them, otherwise the tool would fail. It was observed (for example on Intel Xeon GoldGold 6438N with 0-127 online CPUs distributed across 2 sockets, 32 cores per socket and 2 threads per core) that core numbering can have different schemes even with a system from the same vendor, which causes the tool to fail to generate a profile. Suggested solution: With further investigation, The numbering the pattern depends on the settings of the hardware, the software and the firmware (BIOS).While core IDs may vary nodes can still be considered having same NUMA topology taking into account that core scope is on the single NUMA. However, core IDs can be important in optimizing the system's performance and managing isolation of tasks, meaning if the performance of worker-0 is not comparable to that of worker-1 before having performance-profile applied on both, having an improved performance on both after applying PP is not guaranteed. In this PR, we loosen the hard requirment of having same core numbering on same NUMA cells on different systems into a warning will be logged out as well as a comment on the generated output. So as long as the NUMA cells have same logical processors' count and IDs and same threads' number, core ID equality is treated as best effort. That is because when scheduling workloads, we care about the logical processors ids and their location on the NUMAs. Disclaimer: We support this option to unblock business matters and is recommended to use the generated PP with caution. Signed-off-by: Shereen Haj <[email protected]> * PPC tests: ensure no intersection between calculated CPUsets siblings We want to verify automatically that the generated CPUs sets are including also the siblings. This mainly critical in calculating the isolated vs CPU sets on the node where when hyperthreading is enabled CPUs siblings must belong to the same set either reserved or isolated. Note that this is only introduing a function to calculate the siblings and compare the result with the generated CPU sets by the PPC. But this is already covered by the assertions of the exact expected CPUs lists. Signed-off-by: Shereen Haj <[email protected]> * PPC-tests: ensure CPU sets are equal for nodes that differ only in core IDs Add a new system topology containing the NUMA and CPUs topologies of a new node which is identical in terms of NUMAs and CPUs count and numbers but different in core IDs. Calculating the CPU sets (isolated, reserved, & offlined) for both nodes should result in equal sets. Signed-off-by: Shereen Haj <[email protected]> --------- Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit da3ed75)
…penshift#1252) * OCPBUGS-44372: PPC: skip comparing ProcessorCore.Index between NUMA cores (openshift#1213) * PPC: skip comparing ProcessorCore.Index between NUMA cores ProcessorCore.Index indicates the zero-based index of the core in the Cores slice. While core might be shown in a different order, they can still be equivalent. See: jaypipes/ghw#346. Adjust the equality check to skip this field to fix this: ``` Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66] ``` And add a unit test to cover this scenario. Signed-off-by: Shereen Haj <[email protected]> * PPC: unit: rename test variables to avoid misusing Rename test variables and add clarifying comments to avoid misusing them while writing tests. Signed-off-by: Shereen Haj <[email protected]> --------- Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit ad4a69f) * PPC: fix error message (openshift#1223) Copy-paste mistake in reporting the difference in logical processors list; replace `.NumThreads` with `.LogicalProcessors`. Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit 7323f74) * OCPBUGS-44372: cmd: PPC: support tolerating heterogeneous core IDs (openshift#1236) * cmd: PPC: support tolerating heterogeneous core IDs Problem: In a system with a specific number of cores per socket the host gives each core a number. So far PPC would generate performance profile only after verifying that the nodes pointed to by the specified node pool are all of the same hardware and topology. That is because the performance profile will apply the same configuration on all these nodes and it is most important that they have same structure, CPUs distribution across NUMAs and CPUs availability. This by default includes having same core IDs for each CPU in the same NUMA cells for all the compute nodes. for instance: Worker-0 has NUMA-0 on which the CPUs siblings [2,66] is coupled in one core numbered 18`. All other workers should have this info true for them, otherwise the tool would fail. It was observed (for example on Intel Xeon GoldGold 6438N with 0-127 online CPUs distributed across 2 sockets, 32 cores per socket and 2 threads per core) that core numbering can have different schemes even with a system from the same vendor, which causes the tool to fail to generate a profile. Suggested solution: With further investigation, The numbering the pattern depends on the settings of the hardware, the software and the firmware (BIOS).While core IDs may vary nodes can still be considered having same NUMA topology taking into account that core scope is on the single NUMA. However, core IDs can be important in optimizing the system's performance and managing isolation of tasks, meaning if the performance of worker-0 is not comparable to that of worker-1 before having performance-profile applied on both, having an improved performance on both after applying PP is not guaranteed. In this PR, we loosen the hard requirment of having same core numbering on same NUMA cells on different systems into a warning will be logged out as well as a comment on the generated output. So as long as the NUMA cells have same logical processors' count and IDs and same threads' number, core ID equality is treated as best effort. That is because when scheduling workloads, we care about the logical processors ids and their location on the NUMAs. Disclaimer: We support this option to unblock business matters and is recommended to use the generated PP with caution. Signed-off-by: Shereen Haj <[email protected]> * PPC tests: ensure no intersection between calculated CPUsets siblings We want to verify automatically that the generated CPUs sets are including also the siblings. This mainly critical in calculating the isolated vs CPU sets on the node where when hyperthreading is enabled CPUs siblings must belong to the same set either reserved or isolated. Note that this is only introduing a function to calculate the siblings and compare the result with the generated CPU sets by the PPC. But this is already covered by the assertions of the exact expected CPUs lists. Signed-off-by: Shereen Haj <[email protected]> * PPC-tests: ensure CPU sets are equal for nodes that differ only in core IDs Add a new system topology containing the NUMA and CPUs topologies of a new node which is identical in terms of NUMAs and CPUs count and numbers but different in core IDs. Calculating the CPU sets (isolated, reserved, & offlined) for both nodes should result in equal sets. Signed-off-by: Shereen Haj <[email protected]> --------- Signed-off-by: Shereen Haj <[email protected]> (cherry picked from commit da3ed75) (cherry picked from commit aa2b326)
ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.
Adjust the equality check to skip this field to fix this:
And add a unit test to cover this scenario and the sortTopology function.
Signed-off-by: Shereen Haj [email protected]