-
Notifications
You must be signed in to change notification settings - Fork 131
[GR-67626] Optimize host inlining of attribute lookup in GraalPy #524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
graalvmbot
wants to merge
34
commits into
master
Choose a base branch
from
bd/host-inlining-getattr
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* GetAttributeNode was just an extra unnecessary wrapper around GetFixedAttributeNode.
* Seems host inlining was somehow already removing those as the node cost does not change before and after, when looking at e.g. com.oracle.graal.python.nodes.object.GetClassNodeGen.execute com.oracle.graal.python.nodes.object.GetClassNodeGen$Inlined.execute com.oracle.graal.python.builtins.objects.object.ObjectBuiltinsFactory$ClassNodeFactory$ClassNodeGen.execute
* The generated GetClassNodeGen.fallbackGuard_ had a node cost of 316. * Much better for host inlining: com.oracle.graal.python.nodes.object.GetClassNodeGen#execute 688 -> 556 com.oracle.graal.python.nodes.object.GetClassNodeGen.Inlined#execute 1366 -> 1230 com.oracle.graal.python.builtins.objects.object.ObjectBuiltinsFactory.ClassNodeFactory.ClassNodeGen#execute 579 -> 485 Truffle host inlining completed after 2 rounds. Graph cost changed from 439 to 688 after inlining: Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute] INLINE com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen.executeImpl(Node, PythonAbstractObject) cost 213, subTreeCost 201, treeInvokes 2, reason within budget] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getPythonObject(Node, PythonObject, HiddenAttr$ReadNode) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getNativeObject(...) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] INLINE com.oracle.graal.python.nodes.object.GetClassNodeGen.fallbackGuard_(int, Node, Object) cost 316, subTreeCost 316, treeInvokes 0, reason within budget] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Object, GetRegisteredClassNode) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] -> Truffle host inlining completed after 2 rounds. Graph cost changed from 426 to 556 after inlining: Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute] INLINE com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen.executeImpl(Node, PythonAbstractObject) cost 213, subTreeCost 201, treeInvokes 2, reason within budget] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getPythonObject(Node, PythonObject, HiddenAttr$ReadNode) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getNativeObject(...) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Object, GetRegisteredClassNode) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] * Using `@Cached IsForeignObjectNode isForeignObjectNode` + `isForeignObjectNode.execute(inliningTarget, object)` is a lot worse: com.oracle.graal.python.nodes.object.GetClassNodeGen#execute 688 -> 698 com.oracle.graal.python.nodes.object.GetClassNodeGen.Inlined#execute 1366 -> 1637 com.oracle.graal.python.builtins.objects.object.ObjectBuiltinsFactory.ClassNodeFactory.ClassNodeGen#execute 579 -> 622 Truffle host inlining completed after 2 rounds. Graph cost changed from 461 to 698 after inlining: Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute] INLINE com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen.executeImpl(Node, PythonAbstractObject) cost 213, subTreeCost 201, treeInvokes 2, reason within budget] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getPythonObject(Node, PythonObject, HiddenAttr$ReadNode) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getNativeObject(...) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] INLINE com.oracle.graal.python.nodes.object.IsForeignObjectNodeGen$Inlined.execute(Node, Object) cost 253, subTreeCost 223, treeInvokes 0, reason within budget] INLINE com.oracle.graal.python.nodes.object.IsForeignObjectNodeGen$Inlined.fallbackGuard_(int, Node, Object) cost 140, subTreeCost -, treeInvokes -, reason null] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Node, Object, IsForeignObjectNode, GetRegisteredClassNode) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff]
* It makes no difference for graalpy --experimental-options --engine.Compilation=false graalpython/com.oracle.graal.python.benchmarks/python/harness.py graalpython/com.oracle.graal.python.benchmarks/python/micro/attribute-access.py --live-results -i 10 50
… to TypeBuiltins.GetattributeNode * Faster on graalpy --experimental-options --engine.Compilation=false graalpython/com.oracle.graal.python.benchmarks/python/harness.py graalpython/com.oracle.graal.python.benchmarks/python/micro/attribute-access.py --live-results -i 10 50 AVG (all runs): 1.237s -> 1.218s
…avoid extra Object[] allocations in interpreter
* CastToTruffleStringChecked1Node is little overhead.
* Much better for host inlining as CastToTruffleStringChecked*Node has a 1 @InliningCutoff but CastToTruffleStringNode has 3 @InliningCutoff.
…ingCutoff * And use it in LookupAttributeInMRONode.Dynamic. * Much better for host inlining, TruffleString.EqualNode is huge for host inlining.
…t-inline it * GetPythonObjectClassNode needs 3 references, so it might be best to not DSL-inline to save footprint. OTOH, this case is frequent and not DSL-inlining it also means not host inlining it due to GR-66471. So we DSL-inline it in order to host-inline it. * Actually the DSL is smart and creates a Specialization data class here, so there is no footprint overhead and this is win-win.
* It generates more than 300 method calls with host inlining and blows up the budget. * The common case by far is to not call this node at all, as the attribute is typically available directly on the module, and there is no need to call __getattr__.
* These methods all lead to unwind, better to cutoff early and spend the host inlining budget on the non-exception cases. * Host inlining already CUTOFF some method because of "leads to unwind" but it does not realize all of this logic leads to unwind.
* Some usages used the uncached DynamicObjectLibrary which has a @TruffleBoundary on getShapeFlags(). Actually it seems to not have the boundary due to having a copy in the code, but that's brittle and wouldn't work anymore with a @GenerateUncached node instead. * The library is used on various classes which might have different shapes, so the caching on the Shape seems not very useful.
…ject#pythonClass * This simplifies GetClassNode a lot, and it still folds in single-context mode. * In multi-context mode it would basically never fold as the object wouldn't be constant. * This saves 2 reads (getShape().getFlags() & CLASS_CHANGED_FLAG) in multi-context mode. * This saves a DynamicObject read (1-2 reads: getShape() + Location#get potentially in extension array) when the class has been changed. * Use Shape#getDynamicType() instead of a constant property and flags: just a field read from the Shape instead of a property map lookup + Location#get. Also easier to keep in sync as everything is in a single field. * Do not make the cached reference to the class weak in GetClassNode, the Shape through the constant property already referenced it strongly and the Shape still does with dynamicType. * Much better for host inlining of GetClassNode and notably 2 @InliningCutoff vs 3 before: com.oracle.graal.python.nodes.object.GetClassNodeGen#execute 598 -> 482 com.oracle.graal.python.nodes.object.GetClassNodeGen.Inlined#execute 2151 -> 2035 com.oracle.graal.python.builtins.objects.object.ObjectBuiltinsFactory.ClassNodeFactory.ClassNodeGen#execute 657 -> 554 Truffle host inlining completed after 2 rounds. Graph cost changed from 426 to 598 after inlining: Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute] INLINE com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen$Inlined.executeImpl(Node, PythonAbstractObject) cost 376, subTreeCost 243, treeInvokes 2, reason within budget] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.doReadHiddenAttrNode(Node, PythonObject, HiddenAttr$ReadNode) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.doNativeObject(...) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Object, GetRegisteredClassNode) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] -> Truffle host inlining completed after 2 rounds. Graph cost changed from 363 to 482 after inlining: Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute] INLINE com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen$Inlined.executeImpl(Node, PythonAbstractObject) cost 319, subTreeCost 186, treeInvokes 1, reason within budget] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.doNativeObject(...) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff] CUTOFF com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Object, GetRegisteredClassNode) cost -, subTreeCost -, treeInvokes -, reason method annotated with @InliningCutoff]
…cType() * DynamicObjectLibrary#setDynamicType() does not allow null, and also has a non-null default (ObjectType.DEFAULT).
…ngle @InliningCutoff
…nMRONode in a different node * Better for host inlining of GetFixedAttributeNodeGen.execute: Before: Root[com.oracle.graal.python.nodes.attributes.GetFixedAttributeNodeGen.execute] INLINE com.oracle.graal.python.nodes.attributes.GetFixedAttributeNode.doIt(...) cost 59, subTreeCost 6333, treeInvokes 42, reason within budget] After: Root[com.oracle.graal.python.nodes.attributes.GetFixedAttributeNodeGen.execute] INLINE com.oracle.graal.python.nodes.attributes.GetFixedAttributeNode.doIt(...) cost 59, subTreeCost 4355, treeInvokes 24, reason within budget]
* Avoiding a LookupAttributeInMRONode.Dynamic is not worth this duplication.
* Instead of having to do this optimization in callers of GetDictIfExistsNode.
…ObjectNode * So the correct semantics and the native type optimization is always used, reviewing usages revealed several incorrect usages. * Subclasses make host inlining unable to see through any `@Cached ReadAttributeFromObjectNode`. * Add ReadAttributeFromModuleNode for the simple case of reading from a PythonModule. * Remove extra specialization for PythonModule in ReadAttributeFromObjectNode: * The specialization below readObjectAttribute() does exactly the same optimization, so it is redundant. * Also checked by running micro/function-call-sized.py on jvm-ce-libgraal with and without --engine.Compilation=false that the performance is the same. * Add execute() overloads for ReadAttributeFromObjectNode, more efficent and also clearer which kind of input is passed.
* Using the correct subclass is prone to errors. * Subclasses prevent host inlining.
7671234
to
8b10695
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.