Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHEL-70446: [viostor] Fix returning SUCCESS when really -ENOSPC (regression) #1281

Merged
merged 1 commit into from
Feb 11, 2025

Conversation

benyamin-codez
Copy link
Contributor

  1. Refactored to only return TRUE when add_buffer_req_status == VQ_ADD_BUFFER_SUCCESS.

  2. Pro-actively refactors calls to InterlockedIncrement() back to the
    add_buffer_req_status == VQ_ADD_BUFFER_SUCCESS path.

Impacts: RhelDoFlush(), RhelDoReadWrite(), RhelDoUnMap() and RhelGetSerialNumber().

Regression introduced with PR #1174 (379f291).
Will likely resolve issue #1204 (RHEL-70446)
Unlikely to resolve issue #907.

My bad... 8^d

1. Refactored to only return TRUE when add_buffer_req_status == VQ_ADD_BUFFER_SUCCESS.

2. Pro-actively refactors calls to InterlockedIncrement() back to the add_buffer_req_status == VQ_ADD_BUFFER_SUCCESS path.

Impacts: RhelDoFlush(), RhelDoReadWrite(), RhelDoUnMap() and RhelGetSerialNumber().

Regression introduced with PR virtio-win#1174 (379f291). Will likely resolve issue virtio-win#1204 (RHEL-70446), unlikely to resolve issue virtio-win#907.

My bad... 8^d

Signed-off-by: benyamin-codez <[email protected]>
benyamin-codez referenced this pull request Jan 31, 2025
Backports fixes and improvements from vioscsi PRs #1150 and #1162

virtqueue struct vq was also removed in favour of adaptExt->vq[QueueNumber],
which results in a minor performance increase.

Signed-off-by: benyamin-codez <[email protected]>
@kostyanf14 kostyanf14 changed the title [viostor] Fix returning SUCCESS when really -ENOSPC (regression) RHEL-70446: [viostor] Fix returning SUCCESS when really -ENOSPC (regression) Jan 31, 2025
@kostyanf14
Copy link
Member

Added RHEL-70446 to PR title for tracking on Jira side as this PR can be related

@benyamin-codez
Copy link
Contributor Author

Added RHEL-70446 to PR title for tracking on Jira side as this PR can be related

Thanks @kostyanf14.

Just wondering if we should update contributing.md, per:

6. If you are a Red Hat contributor, you must include [Jira](https://issues.redhat.com/) key in the commit message
7. Prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: NetKVM: implementing dynamic NDIS version support".

... to include the the Jira key even if you aren't a Red Hat contributor, but know of the issue, e.g.:

6. If you are a Red Hat contributor, you MUST include the [Jira](https://issues.redhat.com/) key in the commit message
7. If you are a NOT a Red Hat contributor, but know of a relevant Jira key, please prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: [NetKVM] implementing dynamic NDIS version support".

Let me know. I'm happy to raise a suitable PR if a change is warranted. 8^d

Best regards,
Ben

@kostyanf14
Copy link
Member

Added RHEL-70446 to PR title for tracking on Jira side as this PR can be related

Thanks @kostyanf14.

Just wondering if we should update contributing.md, per:

6. If you are a Red Hat contributor, you must include [Jira](https://issues.redhat.com/) key in the commit message
7. Prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: NetKVM: implementing dynamic NDIS version support".

... to include the the Jira key even if you aren't a Red Hat contributor, but know of the issue, e.g.:

6. If you are a Red Hat contributor, you MUST include the [Jira](https://issues.redhat.com/) key in the commit message
7. If you are a NOT a Red Hat contributor, but know of a relevant Jira key, please prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: [NetKVM] implementing dynamic NDIS version support".

Let me know. I'm happy to raise a suitable PR if a change is warranted. 8^d

Best regards, Ben

@YanVugenfirer

@YanVugenfirer
Copy link
Collaborator

Added RHEL-70446 to PR title for tracking on Jira side as this PR can be related

Thanks @kostyanf14.

Just wondering if we should update contributing.md, per:

6. If you are a Red Hat contributor, you must include [Jira](https://issues.redhat.com/) key in the commit message
7. Prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: NetKVM: implementing dynamic NDIS version support".

... to include the the Jira key even if you aren't a Red Hat contributor, but know of the issue, e.g.:

6. If you are a Red Hat contributor, you MUST include the [Jira](https://issues.redhat.com/) key in the commit message
7. If you are a NOT a Red Hat contributor, but know of a relevant Jira key, please prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: [NetKVM] implementing dynamic NDIS version support".

Let me know. I'm happy to raise a suitable PR if a change is warranted. 8^d

Best regards, Ben

Sound like a good idea

@benyamin-codez
Copy link
Contributor Author

Just wondering if we should update contributing.md....

Sound like a good idea

May I present PR #1288. 8^d

@kostyanf14
Copy link
Member

@YanVugenfirer
HCK-CI/viostor-Win11_24H2x64_host_uefi_viommu - finished. Flush test - hangs on, so I kill CI

@benyamin-codez
Copy link
Contributor Author

HCK-CI/viostor-Win11_24H2x64_host_uefi_viommu - finished. Flush test - hangs on, so I kill CI

@YanVugenfirer & @kostyanf14, just to be clear, this is an issue with the HCK and not the driver, yes?

In my previous investigation of the logs of similar failures, the problem seemed to be mostly 0x80070002 errors
It seemed to me the root cause(s) could be a sudden power loss, BSOD or maybe a disk stuck in offline mode.

A disk stuck in OFFLINE mode could be due to inadequate BTL reset handling. More commentary in this dropdown.

I note my comments re CompleteRequestWithStatus().
Is the following order of operations correct?

CompletePendingRequests(DeviceExtension);
    ....
    CompleteRequestWithStatus(DeviceExtension, (PSRB_TYPE)currSrb, SRB_STATUS_BUS_RESET);
    ....
CompleteRequestWithStatus(DeviceExtension, (PSRB_TYPE)Srb, SRB_STATUS_SUCCESS);

It seems vioscsi is working. The equivalent code path appears to be:

CompletePendingRequestsOnReset(DeviceExtension);
    ....
    CompleteRequest(DeviceExtension, (PSRB_TYPE)currSrb);

Backporting from vioscsi might reveal the source of any request completion / reset failure...
It appears CompleteRequest() in vioscsi is a more primitive form of CompleteRequestWithStatus() in viostor.
So when the offending code is removed, we might have found our suspect.

It's perhaps also worth pointing out that:

  1. The Flush Test states "Test runs about 21 scenarios initiating sudden power loss each time.". Do we simulate this?
  2. Disk Verification (LOGO) has a fixed runtime of 72 hours...
  3. A failed Disk Stress (LOGO) test might cause a hung Flush Test if DiskIo.exe causes a disk to stay OFFLINE.

Copy link
Collaborator

@vrozenfe vrozenfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YanVugenfirer YanVugenfirer merged commit 2bcf04f into virtio-win:master Feb 11, 2025
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants