Skip to content

Refactor sort tiebreaker implementation#2781

Open
evansd wants to merge 10 commits into
mainfrom
evansd/sorting
Open

Refactor sort tiebreaker implementation#2781
evansd wants to merge 10 commits into
mainfrom
evansd/sorting

Conversation

@evansd
Copy link
Copy Markdown
Contributor

@evansd evansd commented Apr 30, 2026

This takes a completely different, and overall simpler, approach to implementing our "sort tiebreaking" logic.

The previous code worked by rewriting the supplied query model to inject additional sort operations. It has been the source of various bugs, attempted bug fixes, and new bugs caused by those attempted fixes. It currently has a bug we don't know how to fix (although hopefully one that is very unlikely to directly impact users at the moment).

The root cause of the complexity here is that the kind of graph transform this involves is not "referentially transparent" (almost certain the wrong term here, but the one that came to mind). That is, it doesn't just involve replacing references to node X with node Y; it involves replacing X with Y1 when it occurs as a child of Z and Y2 when it doesn't.

This is much harder problem than the other kinds of transformation we do. Obviously it's not intractable, but neither is it straightforward; and if we can avoid it altogether than our code will be simpler and less buggy.

An alternative solution is, rather than trying to implement the behaviour we want by rewriting the query, to make it the responsibility of the query engines to do the right thing. This has some major advantages:

  • It is much simpler.
  • To the extent that it is not entirely simple, the complexity is contained in the in-memory query engine.
  • Previously all the really nasty code was shared between the SQL and in-memory query engines meaning that the generative tests weren't really exercising it: if there was a bug then it would generally affect all the engines in the same way so it wouldn't be exposed. The SQL and in-memory engines now have quite divergent implementations of the same intended behaviour so it's much easier for the generative tests to expose any issues.

I have thrown Hypothesis at this locally and it hasn't found anything so far:

GENTEST_MAX_DEPTH=25 GENTEST_EXAMPLES=10000 \
  pytest tests/generative/test_query_model.py::test_query_model

I've also tried to make sure that this tiebreaking behaviour, and the reasoning behind it, is adequately documented in the codebase itself rather than just in issues/PRs.

evansd added 4 commits April 30, 2026 18:40
We exercise this a bit in the spec tests, but the constraints of the
spec tests (in particular the single result column) makes it hard to do
so fully.
At present we're forced to do the right thing here by the query model
validation rules because we handle the tiebreaking logic by rewriting
the query itself. But we may not always do that and so we want an
integration test to enforce that this works.
This is captured in discussion on issues, and explained a bit in
comments elsewhere, but I don't think this was explained properly in one
place in the source code before.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 1, 2026

Deploying databuilder-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: e916917
Status: ✅  Deploy successful!
Preview URL: https://871c0072.databuilder.pages.dev
Branch Preview URL: https://evansd-sorting.databuilder.pages.dev

View logs

such operation with the set of columns that are selected from it elsewhere in the
query. It would be nicer not to have to do this, but given the above constraints I
think it's the best practical solution.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this comment! I think I'd figured all of this out reviewing the previous PR, but it was still confusing and it was nice to read it explicitly explained.
It might be worth commenting that the "pre-processing of the query graph and annotate each such operation with the set of columns that are selected from it elsewhere in the query" is done in the query enqines' sort methods (those reference this function, but not vice versa). If I was just reading this comment in isolation, I think I'd expect for rewrite_sorts to be doing more than it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants