-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Rewrite datafusion-sqlancer
in Rust
#14535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I bet we are not the only project that would like to have SQLLancer type support in Rust Perhaps we can join forces (aka with Risingwave) ps. @Xuanwo -- here is another example of "rewrite it in Rust!" |
That's really cool! Databend will definitely be interesting in building this project. So does iceberg-rust. |
Hello, I am interested in applying to work on this project for GSoC. After reading through #11030 , it looks like the three testing oracles that have been implemented from SQLancer are NoREC, TLP, and PQS. Were those chosen because they were the easiest to implement, or was there something about how they test Datafusion specifically? |
👋🏼 They're implemented first because
To make fuzzing more specific to DataFusion, I think the most needed is configuration fuzzing or data source fuzzing. To make |
Thank you very much for your quick and thorough response. I'll keep digging |
Hi @Xuanwo 😄 , how's it going? |
Is your feature request related to a problem or challenge?
This a project idea for GSoC 2025 #14478
datafusion-sqlancer
is a SQL level fuzz testing implementation for DataFusion. #11030Current implementation status
datafusion-sqlancer
has covered partial SQL features, and data types, and implemented 3 relatively simple testing oracles1. With occasional manual runs, around 50 bugs have been found.The implementation is in Java, and it's a fork of the original SQLancer.
Why rewrite in Rust
The SQLancer was first implemented in Java for very good reasons: it has to test the effectiveness of several testing oracles on many major databases, JDBC is a common interface.
DataFusion's SQLancer implementation now is done by extending SQLancer framework, it has saved us some effort to do CLI parsing, result comparison, etc.
There are several reasons I think it's a good idea to rewrite in Rust at this point:
sqllogictests
datafusion-sqlancer
consists of two modules: random query generation, and property validation for test oracles. Those properties can also be applied to enhance existing SQL tests. If we have those properties implemented in Rust, enhancing existingsqllogictest
s would be easier.Now only 3 simple test oracles have been implemented, and I believe there are around 10 novel SQL testing algorithms have been proposed, one example is
Equivalent Expression Transformation
(https://www.usenix.org/conference/osdi24/presentation/jiang). EET I think is very suitable to enhance existing SQL tests.Overall, I think it's a good time to switch to native rust implementation before implementing more complex testing algorithms.
One thing we simplify is now we don't have to use JDBC to connect the testing framework and DataFusion core, configuration fuzzing can be easier, and there might be some existing code we can reuse.
DataFusion ecosystem is mainly in Rust, IMO it would be easier to find people to help if the testing framework is written in Rust instead of Java.
Describe the solution you'd like
See #11030 for the background
Statement
)sqllogictest
frameworkDescribe alternatives you've considered
The project idea proposed above I believe is advanced in terms of difficulty.
A medium level project can be extending existing implementation with more SQL/types support, and implement more test oracles, also with better CI integration.
I'm also open to a fully LLM-based alternative, however I don't have a very good idea so far. Reference https://fuzz4all.github.io/
Additional context
No response
Footnotes
https://github.com/apache/datafusion/issues/11030 has a minimal example for testing oracle
NoREC
↩The text was updated successfully, but these errors were encountered: