Skip to content

[GR-67626] Optimize host inlining of attribute lookup in GraalPy #524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

graalvmbot
Copy link
Collaborator

No description provided.

eregon added 22 commits July 25, 2025 14:01
* GetAttributeNode was just an extra unnecessary wrapper around GetFixedAttributeNode.
* Seems host inlining was somehow already removing those as the node cost does not change before and after,
  when looking at e.g.
  com.oracle.graal.python.nodes.object.GetClassNodeGen.execute
  com.oracle.graal.python.nodes.object.GetClassNodeGen$Inlined.execute
  com.oracle.graal.python.builtins.objects.object.ObjectBuiltinsFactory$ClassNodeFactory$ClassNodeGen.execute
* The generated GetClassNodeGen.fallbackGuard_ had a node cost of 316.
* Much better for host inlining:
  com.oracle.graal.python.nodes.object.GetClassNodeGen#execute 688 -> 556
  com.oracle.graal.python.nodes.object.GetClassNodeGen.Inlined#execute 1366 -> 1230
  com.oracle.graal.python.builtins.objects.object.ObjectBuiltinsFactory.ClassNodeFactory.ClassNodeGen#execute 579 -> 485

  Truffle host inlining completed after 2 rounds. Graph cost changed from 439 to 688 after inlining:
  Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute]
    INLINE   com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen.executeImpl(Node, PythonAbstractObject)              cost  213, subTreeCost  201, treeInvokes    2, reason within budget]
      CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getPythonObject(Node, PythonObject, HiddenAttr$ReadNode) cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
      CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getNativeObject(...)                                     cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
    INLINE   com.oracle.graal.python.nodes.object.GetClassNodeGen.fallbackGuard_(int, Node, Object)                                                cost  316, subTreeCost  316, treeInvokes    0, reason within budget]
    CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Object, GetRegisteredClassNode)                                          cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
  ->
  Truffle host inlining completed after 2 rounds. Graph cost changed from 426 to 556 after inlining:
  Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute]
    INLINE   com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen.executeImpl(Node, PythonAbstractObject)              cost  213, subTreeCost  201, treeInvokes    2, reason within budget]
      CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getPythonObject(Node, PythonObject, HiddenAttr$ReadNode) cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
      CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getNativeObject(...)                                     cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
    CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Object, GetRegisteredClassNode)                                          cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]

* Using `@Cached IsForeignObjectNode isForeignObjectNode` + `isForeignObjectNode.execute(inliningTarget, object)` is a lot worse:
  com.oracle.graal.python.nodes.object.GetClassNodeGen#execute 688 -> 698
  com.oracle.graal.python.nodes.object.GetClassNodeGen.Inlined#execute 1366 -> 1637
  com.oracle.graal.python.builtins.objects.object.ObjectBuiltinsFactory.ClassNodeFactory.ClassNodeGen#execute 579 -> 622

  Truffle host inlining completed after 2 rounds. Graph cost changed from 461 to 698 after inlining:
  Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute]
    INLINE   com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen.executeImpl(Node, PythonAbstractObject)              cost  213, subTreeCost  201, treeInvokes    2, reason within budget]
      CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getPythonObject(Node, PythonObject, HiddenAttr$ReadNode) cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
      CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.getNativeObject(...)                                     cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
    INLINE   com.oracle.graal.python.nodes.object.IsForeignObjectNodeGen$Inlined.execute(Node, Object)                                             cost  253, subTreeCost  223, treeInvokes    0, reason within budget]
      INLINE   com.oracle.graal.python.nodes.object.IsForeignObjectNodeGen$Inlined.fallbackGuard_(int, Node, Object)                               cost  140, subTreeCost    -, treeInvokes    -, reason null]
    CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Node, Object, IsForeignObjectNode, GetRegisteredClassNode)               cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
* It makes no difference for
  graalpy --experimental-options --engine.Compilation=false graalpython/com.oracle.graal.python.benchmarks/python/harness.py graalpython/com.oracle.graal.python.benchmarks/python/micro/attribute-access.py --live-results -i 10 50
… to TypeBuiltins.GetattributeNode

* Faster on graalpy --experimental-options --engine.Compilation=false graalpython/com.oracle.graal.python.benchmarks/python/harness.py graalpython/com.oracle.graal.python.benchmarks/python/micro/attribute-access.py --live-results -i 10 50
  AVG (all runs): 1.237s -> 1.218s
…avoid extra Object[] allocations in interpreter
* CastToTruffleStringChecked1Node is little overhead.
* Much better for host inlining as CastToTruffleStringChecked*Node has a 1 @InliningCutoff
  but CastToTruffleStringNode has 3 @InliningCutoff.
…ingCutoff

* And use it in LookupAttributeInMRONode.Dynamic.
* Much better for host inlining, TruffleString.EqualNode is huge for host inlining.
…t-inline it

* GetPythonObjectClassNode needs 3 references, so it might be best to not DSL-inline  to save footprint.
  OTOH, this case is frequent and not DSL-inlining it also means not host inlining it due to GR-66471.
  So we DSL-inline it in order to host-inline it.
* Actually the DSL is smart and creates a Specialization data class here, so there is no footprint overhead and this is win-win.
* It generates more than 300 method calls with host inlining and blows up the budget.
* The common case by far is to not call this node at all,
  as the attribute is typically available directly on the module,
  and there is no need to call __getattr__.
* These methods all lead to unwind, better to cutoff early
  and spend the host inlining budget on the non-exception cases.
* Host inlining already CUTOFF some method because of "leads to unwind"
  but it does not realize all of this logic leads to unwind.
* Some usages used the uncached DynamicObjectLibrary which has a @TruffleBoundary on getShapeFlags().
  Actually it seems to not have the boundary due to having a copy in the code,
  but that's brittle and wouldn't work anymore with a @GenerateUncached node instead.
* The library is used on various classes which might have different shapes,
  so the caching on the Shape seems not very useful.
…ject#pythonClass

* This simplifies GetClassNode a lot, and it still folds in single-context mode.
* In multi-context mode it would basically never fold as the object wouldn't be constant.
* This saves 2 reads (getShape().getFlags() & CLASS_CHANGED_FLAG) in multi-context mode.
* This saves a DynamicObject read (1-2 reads: getShape() + Location#get potentially in extension array) when the class has been changed.
* Use Shape#getDynamicType() instead of a constant property and flags: just a field read from the Shape instead of a property map lookup + Location#get.
  Also easier to keep in sync as everything is in a single field.
* Do not make the cached reference to the class weak in GetClassNode,
  the Shape through the constant property already referenced it strongly and the Shape still does with dynamicType.
* Much better for host inlining of GetClassNode and notably 2 @InliningCutoff vs 3 before:
  com.oracle.graal.python.nodes.object.GetClassNodeGen#execute 598 -> 482
  com.oracle.graal.python.nodes.object.GetClassNodeGen.Inlined#execute 2151 -> 2035
  com.oracle.graal.python.builtins.objects.object.ObjectBuiltinsFactory.ClassNodeFactory.ClassNodeGen#execute 657 -> 554

  Truffle host inlining completed after 2 rounds. Graph cost changed from 426 to 598 after inlining:
  Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute]
    INLINE   com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen$Inlined.executeImpl(Node, PythonAbstractObject)           cost  376, subTreeCost  243, treeInvokes    2, reason within budget]
      CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.doReadHiddenAttrNode(Node, PythonObject, HiddenAttr$ReadNode) cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
      CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.doNativeObject(...)                                           cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
    CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Object, GetRegisteredClassNode)                                               cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
  ->
  Truffle host inlining completed after 2 rounds. Graph cost changed from 363 to 482 after inlining:
  Root[com.oracle.graal.python.nodes.object.GetClassNodeGen.execute]
    INLINE   com.oracle.graal.python.nodes.object.GetClassNodeGen$GetPythonObjectClassNodeGen$Inlined.executeImpl(Node, PythonAbstractObject) cost  319, subTreeCost  186, treeInvokes    1, reason within budget]
      CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode$GetPythonObjectClassNode.doNativeObject(...)                                 cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
    CUTOFF   com.oracle.graal.python.nodes.object.GetClassNode.getForeign(Object, GetRegisteredClassNode)                                     cost    -, subTreeCost    -, treeInvokes    -, reason method annotated with @InliningCutoff]
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 25, 2025
eregon added 7 commits July 25, 2025 17:50
…cType()

* DynamicObjectLibrary#setDynamicType() does not allow null, and also has a non-null default (ObjectType.DEFAULT).
…nMRONode in a different node

* Better for host inlining of GetFixedAttributeNodeGen.execute:
  Before:
  Root[com.oracle.graal.python.nodes.attributes.GetFixedAttributeNodeGen.execute]
    INLINE   com.oracle.graal.python.nodes.attributes.GetFixedAttributeNode.doIt(...) cost   59, subTreeCost 6333, treeInvokes   42, reason within budget]
  After:
  Root[com.oracle.graal.python.nodes.attributes.GetFixedAttributeNodeGen.execute]
    INLINE   com.oracle.graal.python.nodes.attributes.GetFixedAttributeNode.doIt(...) cost   59, subTreeCost 4355, treeInvokes   24, reason within budget]
* Avoiding a LookupAttributeInMRONode.Dynamic is not worth this duplication.
eregon added 5 commits July 25, 2025 17:50
* Instead of having to do this optimization in callers of GetDictIfExistsNode.
…ObjectNode

* So the correct semantics and the native type optimization is always used,
  reviewing usages revealed several incorrect usages.
* Subclasses make host inlining unable to see through any `@Cached ReadAttributeFromObjectNode`.
* Add ReadAttributeFromModuleNode for the simple case of reading from a PythonModule.
* Remove extra specialization for PythonModule in ReadAttributeFromObjectNode:
  * The specialization below readObjectAttribute() does exactly the same optimization,
    so it is redundant.
  * Also checked by running micro/function-call-sized.py on jvm-ce-libgraal
    with and without --engine.Compilation=false that the performance is the same.
* Add execute() overloads for ReadAttributeFromObjectNode, more efficent and also clearer which kind of input is passed.
* Using the correct subclass is prone to errors.
* Subclasses prevent host inlining.
@graalvmbot graalvmbot force-pushed the bd/host-inlining-getattr branch from 7671234 to 8b10695 Compare July 25, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants