Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C#] Regression on 0.5.0 with DML #1071

Open
azchohfi opened this issue Nov 19, 2024 · 15 comments
Open

[C#] Regression on 0.5.0 with DML #1071

azchohfi opened this issue Nov 19, 2024 · 15 comments
Labels

Comments

@azchohfi
Copy link
Contributor

Describe the bug
C# Version 0.5.0 broke DML models, such as microsoft--Phi-3-mini-4k-instruct-onnx directml-int4-awq-block-128.
The model loads, but the Generator's constructor throws an Access violation exception.

To Reproduce
Steps to reproduce the behavior:

  1. Try running Phi3 Sample with DML
  2. Exception line 98

Expected behavior
Works just as 0.4.0.

Desktop (please complete the following information):

  • OS: Windows 11 (24H2)
@elephantpanda
Copy link

elephantpanda commented Nov 19, 2024

Well, DML didn't really work before in 0.40 . I mean it works up to a point then breaks.
I was just about to update to 0.5 myself. Thanks for the warning. 🥲

I took a look at the closed pull requests and didn't see anything relating to any DML fixes which is dissapointing.

@elephantpanda
Copy link

elephantpanda commented Nov 19, 2024

Just updated my code to 0.51 to try it out c# directml using same model as OP.
After loading the model it crashes. (Didn't crash with 0.40)

Quadro P5000 GPU

Same line:
generator = new Generator(model, generatorParams);

=================================================================
	Native Crash Reporting
=================================================================
Got a UNKNOWN while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
	Managed Stacktrace:
=================================================================
	  at <unknown> <0xffffffff>
	  at Microsoft.ML.OnnxRuntimeGenAI.NativeMethods:OgaCreateGenerator <0x00097>
	  at Microsoft.ML.OnnxRuntimeGenAI.Generator:.ctor <0x0004a>
	  at Main:StartGeneration <0x00612>
	  at <Start>d__10:MoveNext <0x002ca>
	  at MoveNextRunner:InvokeMoveNext <0x00091>
	  at System.Threading.ExecutionContext:RunInternal <0x001b5>
	  at System.Threading.ExecutionContext:Run <0x0002a>
	  at MoveNextRunner:Run <0x000ca>
	  at <>c:<.cctor>b__7_0 <0x00039>
	  at WorkRequest:Invoke <0x00023>
	  at UnityEngine.UnitySynchronizationContext:Exec <0x0018a>
	  at UnityEngine.UnitySynchronizationContext:ExecuteTasks <0x0007a>
	  at System.Object:runtime_invoke_void <0x0007c>
=================================================================
Received signal SIGSEGV
Crash!!!

@skyline75489
Copy link
Contributor

@RyanUnderhill This is the one we caught with the validation pipeline. I thought it was the same error but turns out it wasn't. This crash is reason why there's no log message printed. I can reproduce this locally.

@skyline75489

This comment has been minimized.

@skyline75489

This comment has been minimized.

@jiaxuwu2021
Copy link

jiaxuwu2021 commented Nov 19, 2024

We get very similar error/exception since v0.4.0 of Microsoft.ML.OnnxRuntimeGenAI.DirectML nuget package
2024-11-19T13:58:49.3749334+08:00 An unhandled exception has occurred while executing the request. error: [Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(353)\onnxruntime.dll!00007FFCC8BA254C: (caller: 00007FFCC8B9B7CF) Exception(1) tid(240c) 80070057 The parameter is incorrect. , at Microsoft.ML.OnnxRuntimeGenAI.Generator.ComputeLogits()

also fail to run with v0.5.0 or v0.5.1, but all good if downgrade Microsoft.ML.OnnxRuntimeGenAI.DirectML to v0.3.0.


EDIT: finally I found a workaround, add this property to csproj

<PropertyGroup>
  <SelfContained>true</SelfContained>
</PropertyGroup>

or add this one to csproj

<PropertyGroup>
    <PlatformTarget>x64</PlatformTarget>
</PropertyGroup>

and then clean/rebuild the whole project

@skyline75489
Copy link
Contributor

@elephantpanda Is the crash in 0.5.1 or 0.5.0? We had some progress but we might need more to fix.

@azchohfi
Copy link
Contributor Author

For our scenario, downgrading from 0.5.1 to 0.5.0 fixed the issue, so @elephantpanda is probably having a separate issue.

@elephantpanda
Copy link

elephantpanda commented Nov 19, 2024

@elephantpanda Is the crash in 0.5.1 or 0.5.0? We had some progress but we might need more to fix.

I am using 0.5.1 (I have never tried 0.5.0)
Image

Presumably it's the same issue as it's the same line it crashes on.

BTW, just tried this in CPU mode and it works fine so only crashes in DML mode.

@skyline75489
Copy link
Contributor

@elephantpanda You could try 0.5.0 first. We're preparing a 0.5.2 patch release that should fix the crash.

@elephantpanda
Copy link

elephantpanda commented Nov 22, 2024

I installed 0.5.0 onnxruntimegenai.directml and onnxruntimeGenai.managed keeping the other libraries the same.

It now doesn't crash. It just outputs the first token then fails on the second token:

OnnxRuntimeGenAIException: Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\framework\execution_frame.cc:173 onnxruntime::IExecutionFrame::GetOrCreateNodeOutputMLValue shape && tensor.Shape() == *shape was false. OrtValue shape verification failed. Current shape:{1,32,12,96} Requested shape:{1,32,2048,96}

I'll just wait for the patch I think.

@jiaxuwu2021
Copy link

Unfortunately v0.5.2 is still not working to me
Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(353)\onnxruntime.dll!00007FF825C04F3C: (caller: 00007FF825BFE1CF) Exception(3) tid(e25c) 80070057 The parameter is incorrect.
at Microsoft.ML.OnnxRuntimeGenAI.Generator.ComputeLogits()

I found a workaround, add this property to csproj

<PropertyGroup>
  <SelfContained>true</SelfContained>
</PropertyGroup>

or add this one to csproj

<PropertyGroup>
    <PlatformTarget>x64</PlatformTarget>
</PropertyGroup>

and then clean/rebuild the whole project

@skyline75489
Copy link
Contributor

@jiaxuwu2021 This looks similar to #833

@elephantpanda
Copy link

@jiaxuwu2021 This looks similar to #833

I can't see the similarity. #833 is to do with certain prompts failing.

@ambroser53
Copy link

Unfortunately v0.5.2 is still not working to me Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(353)\onnxruntime.dll!00007FF825C04F3C: (caller: 00007FF825BFE1CF) Exception(3) tid(e25c) 80070057 The parameter is incorrect. at Microsoft.ML.OnnxRuntimeGenAI.Generator.ComputeLogits()

I found a workaround, add this property to csproj

<PropertyGroup>
  <SelfContained>true</SelfContained>
</PropertyGroup>

or add this one to csproj

<PropertyGroup>
    <PlatformTarget>x64</PlatformTarget>
</PropertyGroup>

and then clean/rebuild the whole project

This works with main currently to fix some of my models (i.e. Qwen2.5-1.5B in fp16) but not with others. i.e. with Qwen2.5-7B in int4 I get this other (similar) error:

2024-12-16 11:09:00.3839541 [E:onnxruntime:onnxruntime-genai, sequential_executor.cc:505 onnxruntime::ExecuteKernel] Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntime.dll!00007FFC98A186C6: (caller: 00007FFC98A4D69D) Exception(3) tid(3ba0) 80070057 The parameter is incorrect.

notice that the error is taking place at AbiCustomRegistry.cpp(519) instead. Is there any known fix for this? I can send the models in question plus code and/or dlls if anyone thinks they can investigate a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants