Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Avoid unnecessary URL processing while parsing links #13132

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ichard26
Copy link
Member

@ichard26 ichard26 commented Dec 27, 2024

There are three optimizations in this commit, in descending order of impact:

  • If the file URL in the "project detail" response is already absolute, then avoid calling urljoin() as it's expensive (mostly because it calls urlparse() on both of its URL arguments) and does nothing. While it'd be more correct to check whether the file URL has a scheme, we'd need to parse the URL which is what we're trying to avoid in the first place. Anyway, by simply checking if the URL starts with http[s]://, we can avoid slow urljoin() calls for PyPI responses.

  • Replacing urllib.parse.urlparse() with urllib.parse.urlsplit() in _ensure_quoted_url(). The URL parsing functions are equivalent for our needs1. However, urlsplit() isfaster, and we achieve better cache utilization of its internal cache if we call it directly2.

  • Calculating the Link.path property in advance as it's very hot.

Footnotes

  1. we don't care about URL parameters AFAIK (which are different than the query component!)

  2. urlparse() calls urlsplit() internally, but it passes the authority parameter (unlike any of our calls) so it bypasses the cache.

There are three optimizations in this commit, in descending order of
impact:

- If the file URL in the "project detail" response is already absolute,
  then avoid calling urljoin() as it's expensive (mostly because it
  calls urlparse() on both of its URL arguments) and does nothing. While
  it'd be more correct to check whether the file URL has a scheme, we'd
  need to parse the URL which is what we're trying to avoid in the first
  place. Anyway, by simply checking if the URL starts with http[s]://,
  we can avoid slow urljoin() calls for PyPI responses.

- Replacing urllib.parse.urlparse() with urllib.parse.urlsplit() in
  _ensure_quoted_url(). The URL parsing functions are equivalent for our
  needs[^1]. However, urlsplit() isfaster, and we achieve better cache
  utilization of its internal cache if we call it directly[^2].

- Calculating the Link.path property in advance as it's very hot.

[^1]: we don't care about URL parameters AFAIK (which are different than
  the query component!)

[^2]: urlparse() calls urlsplit() internally, but it passes the authority
  parameter (unlike any of our calls) so it bypasses the cache.
@ichard26 ichard26 added the type: performance Commands take too long to run label Dec 27, 2024
@ichard26
Copy link
Member Author

ichard26 commented Dec 27, 2024

As an example for where this matters, with #13128 already applied, this saves about 1600 ms while collecting and resolving a list of homeassistant dependencies:

Command: python -m cProfile -o profile2.pstats -m pip install -r temp/homeassistant/requirements.txt --dry-run

Profile (before)

image

Profile (after)

image

Method Before After
Link.from_json 3310 ms 1730 ms
LinkEvaluator.evaluate_link 1040 ms 990 ms

And while this depends on network performance (so please look at the time elapsed, not the percentages which may be off), the entire command takes ~16-18 seconds, so the savings are significant.

# correct to parse the file URL and check if it has a scheme, but the
# slow URL parsing urljoin() does is what we're trying to avoid in the
# first place, so we only check for the http[s]:// prefix.)
if file_url.startswith(("https://", "http://")):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we don't need to support other schemes here? What about file://?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire function is meant to be an optimization. We used to call urljoin() for every single file URL, but that was pointless for PyPI simple pages as their file URLs are already absolute (and thus urljoin() would return said URL unchanged). So it's not a matter of support, but rather is it worth to optimize for file://? How common are file:// URLs in a simple index response? They'd only make sense with a local dev server (maybe devpi?)

I did just check and apparently relative file:// URLs are not a thing, so it'd be totally fine to also check for that scheme, but I'm curious to whether you know where they'd show up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not optimising file:// is fine to me but we should make this clear in the function name (e.g. call this _absolute_http_url)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this function still return an absolute url regardless? I thought this optimization was an implementation detail?

@ichard26 ichard26 added this to the 25.0 milestone Dec 31, 2024
@ichard26 ichard26 requested a review from uranusjr January 1, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants