Refactor sort tiebreaker implementation#2781
Open
evansd wants to merge 10 commits into
Open
Conversation
We exercise this a bit in the spec tests, but the constraints of the spec tests (in particular the single result column) makes it hard to do so fully.
At present we're forced to do the right thing here by the query model validation rules because we handle the tiebreaking logic by rewriting the query itself. But we may not always do that and so we want an integration test to enforce that this works.
This is captured in discussion on issues, and explained a bit in comments elsewhere, but I don't think this was explained properly in one place in the source code before.
Deploying databuilder-docs with
|
| Latest commit: |
e916917
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://871c0072.databuilder.pages.dev |
| Branch Preview URL: | https://evansd-sorting.databuilder.pages.dev |
This will allow us to resolve ties after other sorts have already been applied which means we can do it without having to rewrite the query.
rebkwok
approved these changes
May 11, 2026
| such operation with the set of columns that are selected from it elsewhere in the | ||
| query. It would be nicer not to have to do this, but given the above constraints I | ||
| think it's the best practical solution. | ||
| """ |
Contributor
There was a problem hiding this comment.
Thank you for this comment! I think I'd figured all of this out reviewing the previous PR, but it was still confusing and it was nice to read it explicitly explained.
It might be worth commenting that the "pre-processing of the query graph and annotate each such operation with the set of columns that are selected from it elsewhere in the query" is done in the query enqines' sort methods (those reference this function, but not vice versa). If I was just reading this comment in isolation, I think I'd expect for rewrite_sorts to be doing more than it is now.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This takes a completely different, and overall simpler, approach to implementing our "sort tiebreaking" logic.
The previous code worked by rewriting the supplied query model to inject additional sort operations. It has been the source of various bugs, attempted bug fixes, and new bugs caused by those attempted fixes. It currently has a bug we don't know how to fix (although hopefully one that is very unlikely to directly impact users at the moment).
The root cause of the complexity here is that the kind of graph transform this involves is not "referentially transparent" (almost certain the wrong term here, but the one that came to mind). That is, it doesn't just involve replacing references to node X with node Y; it involves replacing X with Y1 when it occurs as a child of Z and Y2 when it doesn't.
This is much harder problem than the other kinds of transformation we do. Obviously it's not intractable, but neither is it straightforward; and if we can avoid it altogether than our code will be simpler and less buggy.
An alternative solution is, rather than trying to implement the behaviour we want by rewriting the query, to make it the responsibility of the query engines to do the right thing. This has some major advantages:
I have thrown Hypothesis at this locally and it hasn't found anything so far:
I've also tried to make sure that this tiebreaking behaviour, and the reasoning behind it, is adequately documented in the codebase itself rather than just in issues/PRs.