Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-44372: PPC: skip comparing ProcessorCore.Index between NUMA cores #1213

Merged
merged 2 commits into from
Nov 18, 2024

Conversation

shajmakh
Copy link
Contributor

ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.

Adjust the equality check to skip this field to fix this:

  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

And add a unit test to cover this scenario and the sortTopology function.

Signed-off-by: Shereen Haj [email protected]

ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.

Adjust the equality check to skip this field to fix this:

```
  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66]
```

And add a unit test to cover this scenario.

Signed-off-by: Shereen Haj <[email protected]>
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Nov 11, 2024
@openshift-ci-robot
Copy link
Contributor

@shajmakh: This pull request references Jira Issue OCPBUGS-44372, which is invalid:

  • expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.

Adjust the equality check to skip this field to fix this:

 Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

And add a unit test to cover this scenario and the sortTopology function.

Signed-off-by: Shereen Haj [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 11, 2024
@openshift-ci openshift-ci bot requested review from jmencak and yanirq November 11, 2024 12:58
@shajmakh
Copy link
Contributor Author

/cherry-pick release-4.17 release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

@openshift-cherrypick-robot

@shajmakh: once the present PR merges, I will cherry-pick it on top of release-4.17 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.17 release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

LGTM once inline comments are addressed

Comment on lines 1478 to 1479
topology2.Nodes[0].Cores[0].Index = 1
topology2.Nodes[0].Cores[1].Index = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please deepcopy/clone the topology before to change it to avoid polluting the global state to other test cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before each test the topology2 is reset by design of this group of tests. I adjusted the variable names and added clarifying comments here:
74b81bd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, great. If this is guartanteed we don't need deepcopy.

"github.com/jaypipes/ghw/pkg/topology"
)

func TestSortTopology(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to extract the sortTopology helper (and test a private function :\ )? Can we test the public SortedTopology() public function?

Copy link
Contributor Author

@shajmakh shajmakh Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention is to verify sorting is done properly which is the part that was exported as a private function. To test SOrtedTopology() wee need to initialize a handler object to be able to call it, and in this function ultimately fetch the nodes' topologies and perform the actual sorting on that. Thus I don't see the value of testing the SortedTopology. I agree however this was better moved to the existing test file as testing the other private functions (like ensureSameTopology).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note however this is an additional commit that doesn't relate to the bug fix so I believe it is better split to another PR

Copy link
Contributor Author

@shajmakh shajmakh Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up PR: #1217

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2024
@shajmakh
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 13, 2024
@openshift-ci-robot
Copy link
Contributor

@shajmakh: This pull request references Jira Issue OCPBUGS-44372, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mrniranjan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from mrniranjan November 13, 2024 11:05
Rename test variables and add clarifying comments to avoid misusing them
while writing tests.

Signed-off-by: Shereen Haj <[email protected]>
Copy link
Contributor

openshift-ci bot commented Nov 14, 2024

@shajmakh: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@@ -730,8 +732,16 @@ func ensureSameTopology(topology1, topology2 *topology.Info) error {
}

for j, core1 := range cores1 {
if !reflect.DeepEqual(core1, cores2[j]) {
return fmt.Errorf("the CPU corres differ: %v vs %v", core1, cores2[j])
// skip comparing index because it's fine if they deffer; see https://github.com/jaypipes/ghw/issues/345#issuecomment-1620274077
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: differ

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2024
Copy link
Contributor

openshift-ci bot commented Nov 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, shajmakh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit ad4a69f into openshift:master Nov 18, 2024
17 checks passed
@openshift-ci-robot
Copy link
Contributor

@shajmakh: Jira Issue OCPBUGS-44372: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-44372 has been moved to the MODIFIED state.

In response to this:

ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.

Adjust the equality check to skip this field to fix this:

 Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

And add a unit test to cover this scenario and the sortTopology function.

Signed-off-by: Shereen Haj [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@shajmakh: new pull request created: #1220

In response to this:

/cherry-pick release-4.17 release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

shajmakh added a commit to shajmakh/cluster-node-tuning-operator that referenced this pull request Nov 18, 2024
…ores (openshift#1213)

* PPC: skip comparing ProcessorCore.Index between NUMA cores

ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.

Adjust the equality check to skip this field to fix this:

```
  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66]
```

And add a unit test to cover this scenario.

Signed-off-by: Shereen Haj <[email protected]>

* PPC: unit: rename test variables to avoid misusing

Rename test variables and add clarifying comments to avoid misusing them
while writing tests.

Signed-off-by: Shereen Haj <[email protected]>

---------

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit ad4a69f)
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-node-tuning-operator
This PR has been included in build cluster-node-tuning-operator-container-v4.19.0-202411181037.p0.gad4a69f.assembly.stream.el9.
All builds following this will include this PR.

shajmakh added a commit to shajmakh/cluster-node-tuning-operator that referenced this pull request Dec 12, 2024
…ores (openshift#1213)

* PPC: skip comparing ProcessorCore.Index between NUMA cores

ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.

Adjust the equality check to skip this field to fix this:

```
  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66]
```

And add a unit test to cover this scenario.

Signed-off-by: Shereen Haj <[email protected]>

* PPC: unit: rename test variables to avoid misusing

Rename test variables and add clarifying comments to avoid misusing them
while writing tests.

Signed-off-by: Shereen Haj <[email protected]>

---------

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit ad4a69f)
openshift-merge-bot bot pushed a commit that referenced this pull request Dec 13, 2024
…ous core IDs (#1252)

* OCPBUGS-44372: PPC: skip comparing ProcessorCore.Index between NUMA cores  (#1213)

* PPC: skip comparing ProcessorCore.Index between NUMA cores

ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.

Adjust the equality check to skip this field to fix this:

```
  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]
```

And add a unit test to cover this scenario.

Signed-off-by: Shereen Haj <[email protected]>

* PPC: unit: rename test variables to avoid misusing

Rename test variables and add clarifying comments to avoid misusing them
while writing tests.

Signed-off-by: Shereen Haj <[email protected]>

---------

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit ad4a69f)

* PPC: fix error message (#1223)

Copy-paste mistake in reporting the difference in logical processors
list; replace `.NumThreads` with `.LogicalProcessors`.

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit 7323f74)

* OCPBUGS-44372: cmd: PPC: support tolerating heterogeneous core IDs (#1236)

* cmd: PPC: support tolerating heterogeneous core IDs

Problem: In a system with a specific number of cores per socket
the host gives each core a number. So far PPC would generate
performance profile only after verifying that the nodes pointed to by
the specified node pool are all of the same hardware and topology. That
is because the performance profile will apply the same configuration on all
these nodes and it is most important that they have same structure, CPUs
distribution across NUMAs and CPUs availability. This by default
includes having same core IDs for each CPU in the same NUMA cells for
all the compute nodes. for instance:

Worker-0 has NUMA-0 on which the CPUs siblings [2,66] is coupled in
one core numbered 18`.
All other workers should have this info true for them, otherwise the
tool would fail.

It was observed (for example on Intel Xeon GoldGold 6438N with 0-127
online CPUs distributed across 2 sockets, 32 cores per socket and 2 threads per core)
that core numbering can have different schemes even with a system from the same vendor,
which causes the tool to fail to generate a profile.

Suggested solution:
With further investigation, The numbering the pattern depends on the settings
of the hardware, the software and the firmware (BIOS).While core IDs may vary
nodes can still be considered having same NUMA topology taking into account
that core scope is on the single NUMA. However, core IDs can be important in
optimizing the system's performance and managing isolation of tasks, meaning if the
performance of worker-0 is not comparable to that of worker-1 before
having performance-profile applied on both, having an improved
performance on both after applying PP is not guaranteed.

In this PR, we loosen the hard requirment of having same core numbering
on same NUMA cells on different systems into a warning will be logged
out as well as a comment on the generated output. So as long as the NUMA cells
have same logical processors' count and IDs and same threads' number,
core ID equality is treated as best effort. That is because when scheduling workloads,
we care about the logical processors ids and their location on the NUMAs.

Disclaimer: We support this option to unblock business matters and is recommended
to use the generated PP with caution.

Signed-off-by: Shereen Haj <[email protected]>

* PPC tests: ensure no intersection between calculated CPUsets siblings

We want to verify automatically that the generated CPUs sets are
including also the siblings. This mainly critical in calculating the
isolated vs CPU sets on the node where when hyperthreading is enabled
CPUs siblings must belong to the same set either reserved or isolated.

Note that this is only introduing a function to calculate the siblings
and compare the result with the generated CPU sets by the PPC. But this
is already covered by the assertions of the exact expected CPUs lists.

Signed-off-by: Shereen Haj <[email protected]>

* PPC-tests: ensure CPU sets are equal for nodes that differ only in core IDs

Add a new system topology containing the NUMA and CPUs topologies of a
new node which is identical in terms of NUMAs and CPUs count and numbers
but different in core IDs. Calculating the CPU sets (isolated, reserved,
& offlined) for both nodes should result in equal sets.

Signed-off-by: Shereen Haj <[email protected]>

---------

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit da3ed75)
shajmakh added a commit to shajmakh/cluster-node-tuning-operator that referenced this pull request Dec 30, 2024
…penshift#1252)

* OCPBUGS-44372: PPC: skip comparing ProcessorCore.Index between NUMA cores  (openshift#1213)

* PPC: skip comparing ProcessorCore.Index between NUMA cores

ProcessorCore.Index indicates the zero-based index of the core in the
Cores slice. While core might be shown in a different order, they can still
be equivalent. See: jaypipes/ghw#346.

Adjust the equality check to skip this field to fix this:

```
  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU cores differ: processor core openshift#20 (2 threads), logical processors [2 66] vs processor core openshift#20 (2 threads), logical processors [2 66]
```

And add a unit test to cover this scenario.

Signed-off-by: Shereen Haj <[email protected]>

* PPC: unit: rename test variables to avoid misusing

Rename test variables and add clarifying comments to avoid misusing them
while writing tests.

Signed-off-by: Shereen Haj <[email protected]>

---------

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit ad4a69f)

* PPC: fix error message (openshift#1223)

Copy-paste mistake in reporting the difference in logical processors
list; replace `.NumThreads` with `.LogicalProcessors`.

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit 7323f74)

* OCPBUGS-44372: cmd: PPC: support tolerating heterogeneous core IDs (openshift#1236)

* cmd: PPC: support tolerating heterogeneous core IDs

Problem: In a system with a specific number of cores per socket
the host gives each core a number. So far PPC would generate
performance profile only after verifying that the nodes pointed to by
the specified node pool are all of the same hardware and topology. That
is because the performance profile will apply the same configuration on all
these nodes and it is most important that they have same structure, CPUs
distribution across NUMAs and CPUs availability. This by default
includes having same core IDs for each CPU in the same NUMA cells for
all the compute nodes. for instance:

Worker-0 has NUMA-0 on which the CPUs siblings [2,66] is coupled in
one core numbered 18`.
All other workers should have this info true for them, otherwise the
tool would fail.

It was observed (for example on Intel Xeon GoldGold 6438N with 0-127
online CPUs distributed across 2 sockets, 32 cores per socket and 2 threads per core)
that core numbering can have different schemes even with a system from the same vendor,
which causes the tool to fail to generate a profile.

Suggested solution:
With further investigation, The numbering the pattern depends on the settings
of the hardware, the software and the firmware (BIOS).While core IDs may vary
nodes can still be considered having same NUMA topology taking into account
that core scope is on the single NUMA. However, core IDs can be important in
optimizing the system's performance and managing isolation of tasks, meaning if the
performance of worker-0 is not comparable to that of worker-1 before
having performance-profile applied on both, having an improved
performance on both after applying PP is not guaranteed.

In this PR, we loosen the hard requirment of having same core numbering
on same NUMA cells on different systems into a warning will be logged
out as well as a comment on the generated output. So as long as the NUMA cells
have same logical processors' count and IDs and same threads' number,
core ID equality is treated as best effort. That is because when scheduling workloads,
we care about the logical processors ids and their location on the NUMAs.

Disclaimer: We support this option to unblock business matters and is recommended
to use the generated PP with caution.

Signed-off-by: Shereen Haj <[email protected]>

* PPC tests: ensure no intersection between calculated CPUsets siblings

We want to verify automatically that the generated CPUs sets are
including also the siblings. This mainly critical in calculating the
isolated vs CPU sets on the node where when hyperthreading is enabled
CPUs siblings must belong to the same set either reserved or isolated.

Note that this is only introduing a function to calculate the siblings
and compare the result with the generated CPU sets by the PPC. But this
is already covered by the assertions of the exact expected CPUs lists.

Signed-off-by: Shereen Haj <[email protected]>

* PPC-tests: ensure CPU sets are equal for nodes that differ only in core IDs

Add a new system topology containing the NUMA and CPUs topologies of a
new node which is identical in terms of NUMAs and CPUs count and numbers
but different in core IDs. Calculating the CPU sets (isolated, reserved,
& offlined) for both nodes should result in equal sets.

Signed-off-by: Shereen Haj <[email protected]>

---------

Signed-off-by: Shereen Haj <[email protected]>
(cherry picked from commit da3ed75)
(cherry picked from commit aa2b326)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants