Skip to content

CAST(VARCHAR as JSON) should escape Unicode characters #7118

@mbasmanova

Description

@mbasmanova

Bug description

Velox currently doesn't escape Unicode characters when casting to JSON.

presto> select U&'\+01F64F';
 _col0
-------
 🙏
(1 row)

presto> select cast(U&'\+01F64F' as json);
     _col0
----------------
 "\uD83D\uDE4F"
(1 row)

Velox:

testCastToJson<StringView>(VARCHAR(), {"\U0001F64F"}, {"\"\\ud83d\\ude4f\""});

at 0: expected "\ud83d\ude4f", but got "🙏"

Velox uses folly::json::escapeString to case string to json. This function allows to specify configuration options including

  // If true, non-ASCII utf8 characters would be encoded as \uXXXX:
  // - if the code point is in [U+0000..U+FFFF] => encode as a single \uXXXX
  // - if the code point is > U+FFFF => encode as 2 UTF-16 surrogate pairs.
  bool encode_non_ascii{false};

The only difference with Presto is that folly::json::escapeString uses lowercase hex digits, while Presto uses uppercase.

CC: @zacw7 @aditi-pandit @amitkdutta @kagamiori @kevinwilfong

Related: FasterXML/jackson-core#717

System information

n/a

Relevant logs

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageNewly created issue that needs attention.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions