RHEL-70446: [viostor] Fix returning SUCCESS when really -ENOSPC (regression) #1281

benyamin-codez · 2025-01-31T14:56:59Z

Refactored to only return TRUE when add_buffer_req_status == VQ_ADD_BUFFER_SUCCESS.
Pro-actively refactors calls to InterlockedIncrement() back to the
add_buffer_req_status == VQ_ADD_BUFFER_SUCCESS path.

Impacts: RhelDoFlush(), RhelDoReadWrite(), RhelDoUnMap() and RhelGetSerialNumber().

Regression introduced with PR #1174 (379f291).
Will likely resolve issue #1204 (RHEL-70446)
Unlikely to resolve issue #907.

My bad... 8^d

1. Refactored to only return TRUE when add_buffer_req_status == VQ_ADD_BUFFER_SUCCESS. 2. Pro-actively refactors calls to InterlockedIncrement() back to the add_buffer_req_status == VQ_ADD_BUFFER_SUCCESS path. Impacts: RhelDoFlush(), RhelDoReadWrite(), RhelDoUnMap() and RhelGetSerialNumber(). Regression introduced with PR virtio-win#1174 (379f291). Will likely resolve issue virtio-win#1204 (RHEL-70446), unlikely to resolve issue virtio-win#907. My bad... 8^d Signed-off-by: benyamin-codez <[email protected]>

Backports fixes and improvements from vioscsi PRs #1150 and #1162 virtqueue struct vq was also removed in favour of adaptExt->vq[QueueNumber], which results in a minor performance increase. Signed-off-by: benyamin-codez <[email protected]>

kostyanf14 · 2025-01-31T15:23:37Z

Added RHEL-70446 to PR title for tracking on Jira side as this PR can be related

benyamin-codez · 2025-02-03T10:33:33Z

Added RHEL-70446 to PR title for tracking on Jira side as this PR can be related

Thanks @kostyanf14.

Just wondering if we should update contributing.md, per:

6. If you are a Red Hat contributor, you must include [Jira](https://issues.redhat.com/) key in the commit message
7. Prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: NetKVM: implementing dynamic NDIS version support".

... to include the the Jira key even if you aren't a Red Hat contributor, but know of the issue, e.g.:

6. If you are a Red Hat contributor, you MUST include the [Jira](https://issues.redhat.com/) key in the commit message
7. If you are a NOT a Red Hat contributor, but know of a relevant Jira key, please prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: [NetKVM] implementing dynamic NDIS version support".

Let me know. I'm happy to raise a suitable PR if a change is warranted. 8^d

Best regards,
Ben

kostyanf14 · 2025-02-03T12:50:18Z

Added RHEL-70446 to PR title for tracking on Jira side as this PR can be related

Thanks @kostyanf14.

Just wondering if we should update contributing.md, per:
6. If you are a Red Hat contributor, you must include [Jira](https://issues.redhat.com/) key in the commit message
7. Prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: NetKVM: implementing dynamic NDIS version support".
... to include the the Jira key even if you aren't a Red Hat contributor, but know of the issue, e.g.:
6. If you are a Red Hat contributor, you MUST include the [Jira](https://issues.redhat.com/) key in the commit message
7. If you are a NOT a Red Hat contributor, but know of a relevant Jira key, please prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: [NetKVM] implementing dynamic NDIS version support".
Let me know. I'm happy to raise a suitable PR if a change is warranted. 8^d

Best regards, Ben

@YanVugenfirer

YanVugenfirer · 2025-02-03T13:22:25Z

Added RHEL-70446 to PR title for tracking on Jira side as this PR can be related

Thanks @kostyanf14.

Just wondering if we should update contributing.md, per:
6. If you are a Red Hat contributor, you must include [Jira](https://issues.redhat.com/) key in the commit message
7. Prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: NetKVM: implementing dynamic NDIS version support".
... to include the the Jira key even if you aren't a Red Hat contributor, but know of the issue, e.g.:
6. If you are a Red Hat contributor, you MUST include the [Jira](https://issues.redhat.com/) key in the commit message
7. If you are a NOT a Red Hat contributor, but know of a relevant Jira key, please prefix commit messages with the Jira key first, and then the affected component. For example: "RHELMISC-8923: [NetKVM] implementing dynamic NDIS version support".
Let me know. I'm happy to raise a suitable PR if a change is warranted. 8^d

Best regards, Ben

Sound like a good idea

benyamin-codez · 2025-02-03T15:27:46Z

Just wondering if we should update contributing.md....

Sound like a good idea

May I present PR #1288. 8^d

kostyanf14 · 2025-02-04T08:42:12Z

@YanVugenfirer
HCK-CI/viostor-Win11_24H2x64_host_uefi_viommu - finished. Flush test - hangs on, so I kill CI

benyamin-codez · 2025-02-04T15:40:35Z

HCK-CI/viostor-Win11_24H2x64_host_uefi_viommu - finished. Flush test - hangs on, so I kill CI

@YanVugenfirer & @kostyanf14, just to be clear, this is an issue with the HCK and not the driver, yes?

In my previous investigation of the logs of similar failures, the problem seemed to be mostly 0x80070002 errors
It seemed to me the root cause(s) could be a sudden power loss, BSOD or maybe a disk stuck in offline mode.

A disk stuck in OFFLINE mode could be due to inadequate BTL reset handling. More commentary in this dropdown.

I note my comments re CompleteRequestWithStatus().
Is the following order of operations correct?

CompletePendingRequests(DeviceExtension);
    ....
    CompleteRequestWithStatus(DeviceExtension, (PSRB_TYPE)currSrb, SRB_STATUS_BUS_RESET);
    ....
CompleteRequestWithStatus(DeviceExtension, (PSRB_TYPE)Srb, SRB_STATUS_SUCCESS);

It seems vioscsi is working. The equivalent code path appears to be:

CompletePendingRequestsOnReset(DeviceExtension);
    ....
    CompleteRequest(DeviceExtension, (PSRB_TYPE)currSrb);

Backporting from vioscsi might reveal the source of any request completion / reset failure...
It appears CompleteRequest() in vioscsi is a more primitive form of CompleteRequestWithStatus() in viostor.
So when the offending code is removed, we might have found our suspect.

It's perhaps also worth pointing out that:

The Flush Test states "Test runs about 21 scenarios initiating sudden power loss each time.". Do we simulate this?
Disk Verification (LOGO) has a fixed runtime of 72 hours...
A failed Disk Stress (LOGO) test might cause a hung Flush Test if DiskIo.exe causes a disk to stay OFFLINE.

vrozenfe

looks as it passed https://issues.redhat.com/browse/RHEL-78185

benyamin-codez requested review from vrozenfe and YanVugenfirer as code owners January 31, 2025 14:57

This was referenced Jan 31, 2025

viostor: cause qemu virtio_error when an uncompleted request times out #907

Open

[viostor] io hung can not recover #1204

Closed

kostyanf14 changed the title ~~[viostor] Fix returning SUCCESS when really -ENOSPC (regression)~~ RHEL-70446: [viostor] Fix returning SUCCESS when really -ENOSPC (regression) Jan 31, 2025

benyamin-codez mentioned this pull request Feb 1, 2025

[viostor] Backport vioscsi SendSRB fixes and improvements #1174

Merged

vrozenfe approved these changes Feb 11, 2025

View reviewed changes

YanVugenfirer approved these changes Feb 11, 2025

View reviewed changes

YanVugenfirer merged commit 2bcf04f into virtio-win:master Feb 11, 2025
4 of 6 checks passed

benyamin-codez mentioned this pull request Feb 20, 2025

[vioscsi] Fix SendSRB regression and refactor for optimum performance [viostor] Backport SendSRB improvements #1135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RHEL-70446: [viostor] Fix returning SUCCESS when really -ENOSPC (regression) #1281

RHEL-70446: [viostor] Fix returning SUCCESS when really -ENOSPC (regression) #1281

benyamin-codez commented Jan 31, 2025

kostyanf14 commented Jan 31, 2025

benyamin-codez commented Feb 3, 2025

kostyanf14 commented Feb 3, 2025

YanVugenfirer commented Feb 3, 2025

benyamin-codez commented Feb 3, 2025

kostyanf14 commented Feb 4, 2025

benyamin-codez commented Feb 4, 2025

vrozenfe left a comment

RHEL-70446: [viostor] Fix returning SUCCESS when really -ENOSPC (regression) #1281

RHEL-70446: [viostor] Fix returning SUCCESS when really -ENOSPC (regression) #1281

Conversation

benyamin-codez commented Jan 31, 2025

kostyanf14 commented Jan 31, 2025

benyamin-codez commented Feb 3, 2025

kostyanf14 commented Feb 3, 2025

YanVugenfirer commented Feb 3, 2025

benyamin-codez commented Feb 3, 2025

kostyanf14 commented Feb 4, 2025

benyamin-codez commented Feb 4, 2025

vrozenfe left a comment

Choose a reason for hiding this comment