[SPARK-52545][SQL] Standardize double-quote escaping to follow SQL specification#51242
[SPARK-52545][SQL] Standardize double-quote escaping to follow SQL specification#51242dengziming wants to merge 7 commits intoapache:masterfrom
Conversation
|
Can we check ANSI standard, and if they clarify this syntax? Then I think we can cover this change under ANSI flag. |
"" as " in double-quoted literal
Hello @HyukjinKwon . I made a quick check, it seems ANSI standard supports using single-quote-quote as literals and double-quote-quote as identifiers, for example, I will make a more careful check, but how can we support 2 different dialects using one antlr lexer config, could you give a demo PR which implemented similar functions, and I can copy. |
|
If it complies ANSI, let's cover this under ANSI configuration, |
|
Hi @HyukjinKwon For string literal: For identifier: I don't think I need to cover this under ANSI configuration, because #38147 already did it. Our behaviors will be as follows (dqi= doubleQuotedIdentifiers): I only changed the way we parse string concatenation, both in ANSI-mode and hive-mode, we should always treat 'O''Reilly' as O'Reilly, the previous way of treating 'O''Reilly' as concat(O, Reilly) can be regarded as a bug. ping @gengliangwang to take a look as you are the author of #38147 |
|
ping @gengliangwang , also cc @cloud-fan and @srielau since you participated in #38022 |
I like it in general. My worry is that the lexer solution does not permit control via a config. So we have a potential to break someone. A solution that hooks into the implicit concat code would allow you control. |
|
@srielau Thank you for your review, If I understand correctly, you mean that we may have some users write So we want to make these SQL bug-compatible by adding a config before |
|
|
||
|
|
||
| -- !query | ||
| select 1 from "not_""exists" |
There was a problem hiding this comment.
do we have similar behavior for the backtick quoting?
I'm not worried about the use of double quote for identifiers. I'm worried about thr use of double and single quotes for strings. |
I think you are considering situations like |
|
|
@srielau Thank you for your clarification, and I got your idea, we have 2 choices to implement this.
I tried solution 2, you can take a look at my latest commit, I will submit some test cases after we aggre on the solution. |
| .internal() | ||
| .doc("When true, consecutive quotes in string literals (e.g. \"\" or '') are treated as " + | ||
| "literal characters rather than escape sequences. This preserves pre-Spark 3.0 behavior " + | ||
| "where \"a\"\"b\" would be parsed as 'a\"b' instead of 'ab'.") |
There was a problem hiding this comment.
to make it more readable, can we use single quote string literal here?
There was a problem hiding this comment.
Yes, the original doc also contained some error and I refined them.
|
|
||
| def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS) | ||
|
|
||
| def legacyConsecutiveStringLiterals: Boolean = getConf(LEGACY_CONSECUTIVE_STRING_LITERALS) |
There was a problem hiding this comment.
it's only used once, we can inline it.
| assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa") | ||
|
|
||
| // Double-quote escaping ("") | ||
| assert(unescapeSQLString("\"\"\"aa\"\"\"") == "\"aa\"") |
There was a problem hiding this comment.
can we write the test with scala three quotes string literal """xxx"""?
There was a problem hiding this comment.
That can make it more readable, done.
|
|
||
|
|
||
| -- !query | ||
| select 1 from "not_""exists" |
There was a problem hiding this comment.
This PR updates the string literal parsing. We don't need to test so many different SQL statements that contain strig literal, but simply SELECT string literal and focus more on the string literal itself
SELECT 'a''b'
SELECT 'a' 'b'
...
There was a problem hiding this comment.
It's indeed overkill to test these double-quoted identifiers here so I removed most of the unrelated tests, but I still remained select 1 from "not_""exists" since this PR also changed the behavior when handling double-quoted identifiers.
dengziming
left a comment
There was a problem hiding this comment.
@cloud-fan Thank you for your review, I resolved your comments, PTAL again.
| .internal() | ||
| .doc("When true, consecutive quotes in string literals (e.g. \"\" or '') are treated as " + | ||
| "literal characters rather than escape sequences. This preserves pre-Spark 3.0 behavior " + | ||
| "where \"a\"\"b\" would be parsed as 'a\"b' instead of 'ab'.") |
There was a problem hiding this comment.
Yes, the original doc also contained some error and I refined them.
|
|
||
| def escapedStringLiterals: Boolean = getConf(ESCAPED_STRING_LITERALS) | ||
|
|
||
| def legacyConsecutiveStringLiterals: Boolean = getConf(LEGACY_CONSECUTIVE_STRING_LITERALS) |
| assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa") | ||
|
|
||
| // Double-quote escaping ("") | ||
| assert(unescapeSQLString("\"\"\"aa\"\"\"") == "\"aa\"") |
There was a problem hiding this comment.
That can make it more readable, done.
|
|
||
|
|
||
| -- !query | ||
| select 1 from "not_""exists" |
There was a problem hiding this comment.
It's indeed overkill to test these double-quoted identifiers here so I removed most of the unrelated tests, but I still remained select 1 from "not_""exists" since this PR also changed the behavior when handling double-quoted identifiers.
| assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa") | ||
|
|
||
| // Double-quote escaping ("", '') | ||
| assert(unescapeSQLString(""""a""a"""") == "a\"a") |
There was a problem hiding this comment.
| assert(unescapeSQLString(""""a""a"""") == "a\"a") | |
| assert(unescapeSQLString(""" "a""a" """) == """a"a""") |
There was a problem hiding this comment.
Done, also improved the lines below.
|
|
||
|
|
||
| -- !query | ||
| SELECT "S""par""k", "S\"par\"k", 'S""par""k' |
There was a problem hiding this comment.
can we test the cases when there are spaces in the middle?
There was a problem hiding this comment.
That's a good idea, I added some cases with various usage of " and ' and ' '.
| assert(unescapeSQLString("\"abc\\uAAAXa\"") == "abcuAAAXa") | ||
|
|
||
| // Double-quote escaping ("", '') | ||
| assert(unescapeSQLString(""""a""a"""") == """a"a""") |
There was a problem hiding this comment.
super nit
| assert(unescapeSQLString(""""a""a"""") == """a"a""") | |
| assert(unescapeSQLString(""" "a""a" """.trim) == """ a"a """.trim) |
then it's cleaner what is the input and output
|
|
||
|
|
||
| -- !query | ||
| SELECT "S""par""k", "S\"par\"k", 'S""par""k' |
There was a problem hiding this comment.
super nit
| SELECT "S""par""k", "S\"par\"k", 'S""par""k' | |
| SELECT "S""par""k" AS c1, "S\"par\"k" AS c2, 'S""par""k' AS c3 |
to make it easier to match the result with the schema.
There was a problem hiding this comment.
That's nitpicky and reasonable. I fixed them all.
|
The failed tests are unrelated, thanks, merging to master! |

What changes were proposed in this pull request?
This PR standardizes Spark's string literal parsing behavior to align with other database systems (MySQL, Hive, etc.) by implementing proper quote escaping rules:
Previous Behavior (Inconsistent):
New Behavior (Standard-Compliant):
Why are the changes needed?
Current behavior incorrectly treats consecutive quotes as string concatenation rather than escaping, now aligns with major databases:
MySQL: Reference
Hive: Hplsql.g4 implementation
Does this PR introduce any user-facing change?
Yes.
SET spark.sql.legacy.consecutiveStringLiterals.enabled=true;How was this patch tested?
Added some SQL tests based on the existing test file.
Was this patch authored or co-authored using generative AI tooling?
No.