Skip to content

Improve torch_xla.compile documentation #9194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sdasgup3
Copy link
Collaborator

fixes #8859

@sdasgup3 sdasgup3 force-pushed the gh-8859 branch 2 times, most recently from 69266ed to 890336c Compare May 19, 2025 19:08
@qihqi qihqi requested a review from mikegre-google May 20, 2025 18:12

## Basic Usage
- **Recompilation Overhead**: Non-core operations (e.g., data preprocessing)
could leak into the graph, triggering expensive recompilations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... can leak ....

- `torch_xla.compile` is optimized for PyTorch/XLA training workflows. Designed
to work efficiently with the XLA backend for iterative training, it's the
recommended API for compiling training loops due to its observed performance
advantages. Best practice dictates enclosing the complete training step—forward
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best practice is to enclose the complete training step-forward pass, ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL.


The benchmark results unequivocally demonstrate that eager mode combined with
`torch_xla.compile` achieves performance parity with the traditional LazyTensor
tracing mode, both yielding `147` tokens/s. This empirically validates the claim
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... This shows that the new API provides a better user experience without a performance penalty. " I'm trying to keep sentence length down to ease readibility. I don't think the last phrase ("no-regret ...") is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some wordsmiths. PTAL


## Basic Usage
- **Recompilation Overhead**: Non-core operations (e.g., data preprocessing) can
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be clearer to explain that these non-core operations are all recorded. Maybe this phrasing:

- **Recompilation Overhead**: Whenever any part of the captured graph changes, `torch_xla.sync()` will
recompile the whole graph. Changes in non-core operations (e.g., data preprocessing) thus trigger expensive
recompilations.


To address these issues, PyTorch/XLA introduces an experimental eager mode
(enabled via `torch_xla.experimental.eager_mode(True)`) and the
`torch_xla.compile` API. **This shift aligns PyTorch/XLA more closely with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to emphasize this sentence? Seems a bit jarring to me but just a minor opinion.

performance**. Eager mode is likely to become the default in future releases.

- **Eager Mode**: Executes operations immediately, enhancing flexibility and
debugging but at a performance cost for core tasks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "core tasks" is vague. Could simply remove: "[...] but at a performance cost."

This varying overhead means pure eager mode is not intended for main training or
inference loops. Its utility lies in non-core tasks like data preprocessing,
random number generation, custom utilities, or debugging, where immediate
execution is prioritized over throughput.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to call out that torch_xla.compile is independently useful, even not in eager mode. That's why torchprime does: https://github.com/AI-Hypercomputer/torchprime/blob/31c450e82c6273f50f9815351f6fbebb42903f58/torchprime/torch_xla_models/train.py#L330

Wrapping the training loop in torch_xla.compile provides a few benefits:

  • There's no mark_step anywhere
  • Dataloading operations don't leak into the training loop graph. This benefit is similar to if eager mode was turned on. The only difference is that the dataloading operations are captured into a separate graph as opposed to running eagerly.
  • torch_xla.compile(full_graph=True) will catch accidental graph breaks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since torch_xla.compile isn't really tied to eager mode, I think it could be clearer to rename the document heading as such. Alternatively, we could move the contents about torch_xla.compile to a separate compile.md, and then talk about its interaction with eager mode in this markdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve torch_xla.compile documentation
4 participants