This repository was archived by the owner on Jun 2, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
Fix identifier (un)escaping #47
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
d9fa2b2
Improve MySQL string unquoting an move it to the tokenizer level
JanJakes 7e972f1
Fix default value formatting in SHOW CREATE TABLE, improve tests
JanJakes 240bbf8
Support table, column, and index comments, and test encoding
JanJakes 931c82b
Improve escaping clarity and docs
JanJakes 6e5a8f5
Implement support for NO_BACKSLASH_ESCAPES SQL mode
JanJakes 5196a05
Add a test for quote_mysql_utf8_string_literal()
JanJakes 20f82be
Improve invalid UTF-8 test cases and their docs
JanJakes File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,33 @@ | |
* and consumed by WP_MySQL_Parser during the parsing process. | ||
*/ | ||
class WP_MySQL_Token extends WP_Parser_Token { | ||
/** | ||
* Whether the NO_BACKSLASH_ESCAPES SQL mode is enabled. | ||
* | ||
* @var bool | ||
*/ | ||
private $sql_mode_no_backslash_escapes_enabled; | ||
|
||
/** | ||
* Constructor. | ||
* | ||
* @param int $id Token type. | ||
* @param int $start Byte offset in the input where the token begins. | ||
* @param int $length Byte length of the token in the input. | ||
* @param string $input Input bytes from which the token was parsed. | ||
* @param bool $sql_mode_no_backslash_escapes_enabled Whether the NO_BACKSLASH_ESCAPES SQL mode is enabled. | ||
*/ | ||
public function __construct( | ||
int $id, | ||
int $start, | ||
int $length, | ||
string $input, | ||
bool $sql_mode_no_backslash_escapes_enabled | ||
) { | ||
parent::__construct( $id, $start, $length, $input ); | ||
$this->sql_mode_no_backslash_escapes_enabled = $sql_mode_no_backslash_escapes_enabled; | ||
} | ||
|
||
/** | ||
* Get the name of the token. | ||
* | ||
|
@@ -24,6 +51,123 @@ public function get_name(): string { | |
return $name; | ||
} | ||
|
||
/** | ||
* Get the real unquoted value of the token. | ||
* | ||
* @return string The token value. | ||
*/ | ||
public function get_value(): string { | ||
$value = $this->get_bytes(); | ||
if ( | ||
WP_MySQL_Lexer::SINGLE_QUOTED_TEXT === $this->id | ||
|| WP_MySQL_Lexer::DOUBLE_QUOTED_TEXT === $this->id | ||
|| WP_MySQL_Lexer::BACK_TICK_QUOTED_ID === $this->id | ||
) { | ||
// Remove bounding quotes. | ||
$quote = $value[0]; | ||
$value = substr( $value, 1, -1 ); | ||
|
||
/* | ||
* When the NO_BACKSLASH_ESCAPES SQL mode is enabled, we only need to | ||
* handle escaped bounding quotes, as the other characters preserve | ||
* their literal values. | ||
*/ | ||
if ( $this->sql_mode_no_backslash_escapes_enabled ) { | ||
return str_replace( $quote . $quote, $quote, $value ); | ||
} | ||
|
||
/** | ||
* Unescape MySQL escape sequences. | ||
* | ||
* MySQL string literals use backslash as an escape character, and | ||
* the string bounding quotes can also be escaped by being doubled. | ||
* | ||
* The escaping is done according to the following rules: | ||
* | ||
* 1. Some special character escape sequences are recognized. | ||
* For example, "\n" is a newline character, "\0" is ASCII NULL. | ||
* 2. A specific treatment is applied to "\%" and "\_" sequences. | ||
* This is due to their special meaning for pattern matching. | ||
* 3. Other backslash-prefixed characters resolve to their literal | ||
* values. For example, "\x" represents "x", "\\" represents "\". | ||
* | ||
* Despite looking similar, these rules are different from the C-style | ||
* string escaping, so we cannot use "strip(c)slashes()" in this case. | ||
* | ||
* See: https://dev.mysql.com/doc/refman/8.4/en/string-literals.html | ||
*/ | ||
$backslash = chr( 92 ); | ||
$replacements = array( | ||
/* | ||
* MySQL special character escape sequences. | ||
*/ | ||
( $backslash . '0' ) => chr( 0 ), // An ASCII NULL character (\0). | ||
( $backslash . "'" ) => chr( 39 ), // A single quote character ('). | ||
( $backslash . '"' ) => chr( 34 ), // A double quote character ("). | ||
( $backslash . 'b' ) => chr( 8 ), // A backspace character. | ||
( $backslash . 'n' ) => chr( 10 ), // A newline (linefeed) character (\n). | ||
( $backslash . 'r' ) => chr( 13 ), // A carriage return character (\r). | ||
( $backslash . 't' ) => chr( 9 ), // A tab character (\t). | ||
( $backslash . 'Z' ) => chr( 26 ), // An ASCII 26 (Control+Z) character. | ||
|
||
/* | ||
* Normalize escaping of "%" and "_" characters. | ||
* | ||
* MySQL has unusual handling for "\%" and "\_" in all string literals. | ||
* While other sequences follow the C-style escaping ("\?" is "?", etc.), | ||
* "\%" resolves to "\%" and "\_" resolves to "\_" (unlike in C strings). | ||
* | ||
* This means that "\%" behaves like "\\%", and "\_" behaves like "\\_". | ||
* To preserve this behavior, we need to add a second backslash here. | ||
* | ||
* From https://dev.mysql.com/doc/refman/8.4/en/string-literals.html: | ||
* > The \% and \_ sequences are used to search for literal instances | ||
* > of % and _ in pattern-matching contexts where they would otherwise | ||
* > be interpreted as wildcard characters. If you use \% or \_ outside | ||
* > of pattern-matching contexts, they evaluate to the strings \% and | ||
* > \_, not to % and _. | ||
*/ | ||
( $backslash . '%' ) => $backslash . $backslash . '%', | ||
( $backslash . '_' ) => $backslash . $backslash . '_', | ||
|
||
/* | ||
* Preserve a double backslash as-is, so that the trailing backslash | ||
* is not consumed as the beginning of an escape sequence like "\n". | ||
* | ||
* Resolving "\\" to "\" will be handled in the next step, where all | ||
* other backslash-prefixed characters resolve to their literal values. | ||
*/ | ||
( $backslash . $backslash ) | ||
=> $backslash . $backslash, | ||
|
||
/* | ||
* The bounding quotes can also be escaped by being doubled. | ||
*/ | ||
( $quote . $quote ) => $quote, | ||
); | ||
|
||
/* | ||
* Apply the replacements. | ||
* | ||
* It is important to use "strtr()" and not "str_replace()", because | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Such a brilliant find ❤️ |
||
* "str_replace()" applies replacements one after another, modifying | ||
* intermediate changes rather than just the original string: | ||
* | ||
* - str_replace( [ 'a', 'b' ], [ 'b', 'c' ], 'ab' ); // 'cc' (bad) | ||
* - strtr( 'ab', [ 'a' => 'b', 'b' => 'c' ] ); // 'bc' (good) | ||
*/ | ||
$value = strtr( $value, $replacements ); | ||
|
||
/* | ||
* A backslash with any other character represents the character itself. | ||
* That is, \x evaluates to x, \\ evaluates to \, and \🙂 evaluates to 🙂. | ||
*/ | ||
$preg_quoted_backslash = preg_quote( $backslash ); | ||
$value = preg_replace( "/$preg_quoted_backslash(.)/u", '$1', $value ); | ||
} | ||
return $value; | ||
} | ||
|
||
/** | ||
* Get the token representation as a string. | ||
* | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like that using
get_value()
is lazy and can generally nicely work for any token type where we need to interpret or normalize any values, I'm wondering how to solve theNO_BACKSLASH_ESCAPES
SQL mode.It's a very simple IF, but in the token instance, we just know nothing about SQL modes 🤔 The tokenizer knows it, so it could pass in a flag, or use a different token instance, but that makes it a bit less elegant.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be a constructor argument? The mode is already determined when the token is created. If that was a boolean flag baked into the Token instance, we could still keep the
get_value()
method argument-less.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Done in 6e5a8f5.