-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure predicate cache is reset when control flow leaves block #4274
base: main
Are you sure you want to change the base?
Conversation
To be fair, this is not my favourite fix because it risks missing places where the cache should be reset but isn't and it will be a pain to debug cases that fail if you don't keep this issue in the back of your mind. But the predicate cache is only valid during IR generation so we cannot use any of the backend stuff like Spill functions. |
38e7ebb
to
6045a0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting way to fix it, but I'm fine with it.
@@ -164,6 +168,9 @@ void OpDispatchBuilder::FADD(OpcodeArgs, IR::OpSize Width, bool Integer, OpDispa | |||
if (Op->Src[0].IsNone()) { // Implicit argument case | |||
auto Offset = Op->OP & 7; | |||
auto St0 = 0; | |||
if (!ReducedPrecisionMode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand why this check is here but I don't quite get it, won't this code only run for full precision anyway? I would have thought reduced would take
void OpDispatchBuilder::FADDF64(OpcodeArgs, IR::OpSize Width, bool Integer, OpDispatchBuilder::OpResult ResInST0) { |
Also just thinking through perhaps it would be cleaner to do this in FlushRegisterCache (!SRAOnly case) and to save the predicate when spilling/filling regs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - that's a great point. My intention was to merge (FADD
and FADDF64
) those and I didn't but thought I did.
I did attempt to do this through FlushRegisterCache
but this one is called at _Break()
. This seems to be called by INTOp
and we don't want to reset the cache in these cases - when I attempted this, it was basically resetting the cache too often, which is why the commit calls ResetInitPredicateCache()
alongside FlushRegisterCache
mostly everywhere but not on Break()
.
I am happy to hear suggestions on how to improve this if you have any ideas. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you mention saving the predicate reg, but if we do managed to reset the predicate register in FlushRegisterCache
then there was no need to save the predicate in spilling/filling regs. The reason for not spilling/filling the predicate register is because afaiu doing this for predicate registers takes a few instructions and would most likely void the optimization that we are attempting to do, so it's better to simply reset the cache and then regenerate the predicate with a single ptrue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you mention saving the predicate reg, but if we do managed to reset the predicate register in
FlushRegisterCache
then there was no need to save the predicate in spilling/filling regs.
Hehe, yeah this is true :)
The reason for not spilling/filling the predicate register is because afaiu doing this for predicate registers takes a few instructions and would most likely void the optimization that we are attempting to do, so it's better to simply reset the cache and then regenerate the predicate with a single
ptrue
.
Right yeah, I missed that we'd have to save all the predicate registers since there's no way to know which have changed - though looking through only one predicate type is ever cached, so the cache could be replaced with statically allocating a single extra predicate register. Then that is always spilled, even if a few such registers are needed in the future I think the cost would be negligible.
I did have some ideas of having the predicate register set to spill be a required argument for Spill/FillSRA and to pass that in as an argument to IROps that end up calling them, but this is definitely beyond what is necessary here
I did attempt to do this through FlushRegisterCache but this one is called at _Break(). This seems to be called by INTOp and we don't want to reset the cache in these cases - when I attempted this, it was basically resetting the cache too often, which is why the commit calls ResetInitPredicateCache() alongside FlushRegisterCache mostly everywhere but not on Break().
I'm quite surprised break/INT is being called that often in x87 code, what ends up causing this? If that can be figured then clearing the predicate cache for all !JITDispatch ops in the IR json (I think this is what you are trying to achieve by manually annotating x87 ops? Correct me if I'm wrong) could work, there's also cases like cpuid, ludiv etc that I think would need to clear the cache but don't currently. I personally prefer the former solution I mentioned since it's nice to have the guarantee that spilling and filling registers in a handler won't cause issues but will leave the choice up to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was looking at what's causing INTOp to be called and I am a little bit mistified myself about this. Take the test:
%ifdef CONFIG
{
"RegData": {
"MM7": ["0x8000000000000000", "0x4000"]
}
}
%endif
lea rdx, [rel data]
fld tword [rdx + 8 * 0]
lea rdx, [rel data2]
lea rax, [rdx + 8 * 0]
fstp tword [rax]
fld tword [rdx + 8 * 0]
hlt
align 8
data:
dt 2.0
dq 0
data2:
dt 0.0
dq 0
This calls fld twice (with fstp in-between), you should only need to reset the predicate cache at the beginning of the block. However, here if you call the reset function from FlushRegisterCache, you end up resetting it a couple of times due to INTOp calling it. I have not yet figured out why it's being called but it's related to threading. It comes from Core.cpp:668
:
std::invoke(Fn, Thread->OpDispatcher, DecodedInfo);
The backtrace from ResetInitPredicateCache shows:
Breakpoint 1, FEXCore::IR::IREmitter::ResetInitPredicateCache (this=0x7ff6ef7000)
at /home/steamvr/dev/FEX/FEXCore/Source/Interface/IR/IREmitter.h:77
77 InitPredicateCache.clear();
(gdb) bt
#0 FEXCore::IR::IREmitter::ResetInitPredicateCache (this=0x7ff6ef7000)
at /home/steamvr/dev/FEX/FEXCore/Source/Interface/IR/IREmitter.h:77
#1 0x0000005555723324 in FEXCore::IR::OpDispatchBuilder::FlushRegisterCache (this=0x7ff6ef7000, SRAOnly=false)
at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/OpcodeDispatcher.h:1179
#2 0x0000005555794a34 in FEXCore::IR::OpDispatchBuilder::INTOp (this=0x7ff6ef7000, Op=0x7fe2e00300)
at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/OpcodeDispatcher.cpp:4693
#3 0x000000555573eb6c in std::__invoke_impl<void, void (FEXCore::IR::OpDispatchBuilder::*&)(FEXCore::X86Tables::DecodedInst const*), FEXCore::Core::NonMovableUniquePtr<FEXCore::IR::OpDispatchBuilder>&, FEXCore::X86Tables::DecodedInst const*&> (
__f=@0x7fffffdb50: (void (FEXCore::IR::OpDispatchBuilder::*)(FEXCore::IR::OpDispatchBuilder * const, const FEXCore::X86Tables::DecodedInst *)) 0x5555794860 <FEXCore::IR::OpDispatchBuilder::INTOp(FEXCore::X86Tables::DecodedInst const*)>, __t=...,
__args=@0x7fffffdc20: 0x7fe2e00300) at /usr/lib/gcc/aarch64-linux-gnu/14/../../../../include/c++/14/bits/invoke.h:74
#4 0x000000555573eacc in std::__invoke<void (FEXCore::IR::OpDispatchBuilder::*&)(FEXCore::X86Tables::DecodedInst const*), FEXCore::Core::NonMovableUniquePtr<FEXCore::IR::OpDispatchBuilder>&, FEXCore::X86Tables::DecodedInst const*&> (
__fn=@0x7fffffdb50: (void (FEXCore::IR::OpDispatchBuilder::*)(FEXCore::IR::OpDispatchBuilder * const, const FEXCore::X86Tables::DecodedInst *)) 0x5555794860 <FEXCore::IR::OpDispatchBuilder::INTOp(FEXCore::X86Tables::DecodedInst const*)>,
__args=@0x7fffffdc20: 0x7fe2e00300, __args=@0x7fffffdc20: 0x7fe2e00300)
at /usr/lib/gcc/aarch64-linux-gnu/14/../../../../include/c++/14/bits/invoke.h:96
#5 0x0000005555723eb8 in std::invoke<void (FEXCore::IR::OpDispatchBuilder::*&)(FEXCore::X86Tables::DecodedInst const*), FEXCore::Core::NonMovableUniquePtr<FEXCore::IR::OpDispatchBuilder>&, FEXCore::X86Tables::DecodedInst const*&> (
__fn=@0x7fffffdb50: (void (FEXCore::IR::OpDispatchBuilder::*)(FEXCore::IR::OpDispatchBuilder * const, const FEXCore::X86Tables::DecodedInst *)) 0x5555794860 <FEXCore::IR::OpDispatchBuilder::INTOp(FEXCore::X86Tables::DecodedInst const*)>,
__args=@0x7fffffdc20: 0x7fe2e00300, __args=@0x7fffffdc20: 0x7fe2e00300)
at /usr/lib/gcc/aarch64-linux-gnu/14/../../../../include/c++/14/functional:120
#6 0x000000555572247c in FEXCore::Context::ContextImpl::GenerateIR (this=0x7ff6e57000, Thread=0x7ff6ee3000, GuestRIP=65536,
ExtendedDebugInfo=false, MaxInst=0) at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/Core.cpp:668
#7 0x00000055557248f4 in FEXCore::Context::ContextImpl::CompileCode (this=0x7ff6e57000, Thread=0x7ff6ee3000, GuestRIP=65536,
MaxInst=0) at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/Core.cpp:801
#8 0x0000005555724e84 in FEXCore::Context::ContextImpl::CompileBlock (this=0x7ff6e57000, Frame=0x7ff6ee3090, GuestRIP=65536,
MaxInst=0) at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/Core.cpp:846
Do you understand what's happening? I will keep trying to understand this in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hummm... maybe this INTOp is not actually happening in between the instructions that are relevant so it is probably a red-herring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, actually leaning on FlushRegisterCache works once I moved all the predicate cache resets onto IR generation time. I was initially doing some of the resets during IR generation time and others inside the x87 stack opt. pass and this didn't work. I think it's much better now. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good - it's worth double checking this works for cpuid/ludiv, I'm not entirely sure if flushregistercache ends up being called in such cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also fallback inst handlers for x87 ops), I might be misunderstanding something here though
Marking this as draft to avoid merging by accident- need to think about what @bylaws said. |
6045a0d
to
6d46c75
Compare
Whenever the control float leaves the block, it might clobber the predicate register so we reset the cache whenever that happens. The difficulty here is that the cache is valid only during IR generation so we need to make sure we catch all the cases during this pass where the execution might leave the block. Fixes FEX-Emu#4264
6d46c75
to
5e0886e
Compare
Whenever the control float leaves the block, it might clobber the
predicate register so we reset the cache whenever that happens.
The difficulty here is that the cache is valid only during IR generation
so we need to make sure we catch all the cases during this pass where
the execution might leave the block.
Fixes #4264