-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Deduplicating recursive CTE implementation #18254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Rely on aggregate GroupValues abstraction to build a hash table of the emitted rows that is used to deduplicate We might make things a bit more efficient by rewriting a hash table wrapper just for deduplication, but this implementation should give a fair baseline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective this is a very nice and concise solution to the problem.
Furthermore, from my understanding this should also correctly terminate the recursion as only each unique row is pushed into the WorkTable and at some point (as it can be seen in the closure example) this will reach a fix point.
What I am also thinking about is test coverage. My gut feeling says there should be some test cases in the SQLite test suite that cover distinct recursion. Would this cause the extended test suite to fail? Ideally, this solution passes all these test cases now! 🥳 However, I am a bit unsure how this is setup currently.
Thank you!
CAVEAT: I am by no means a DataFusion (nor recurisve query) expert so take my comments with a grain of salt.
Rely on aggregate GroupValues abstraction to build a hash table of the emitted rows that is used to deduplicate
We might make things a bit more efficient by rewriting a hash table wrapper just for deduplication, but this implementation should give a fair baseline
Which issue does this PR close?
UNIONin recursive CTE #18140.Rationale for this change
Implements deduplicating recursive CTE (i.e.
UNIONinside ofWITH RECURSIVE) using a hash table. I reuse the one from aggregates to avoid rebuilding a full wrapper and specialization for types. Each time a batch is returned by the static or the recursive terms of the CTE, the hash table is used to remove already seen rows before emitting the rows and keeping them in memory for the next recursion step.What changes are included in this PR?
Reusing
GroupValuestrait implementations inside ofRecursiveQueryExecto get deduplication working.Are these changes tested?
Yes, some sqllogictests have been added, including ones that would lead to infinite recursion is deduplication where disabled.
Are there any user-facing changes?
No