Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control characters in text field breaking usql output #509

Open
trantor opened this issue Dec 30, 2024 · 6 comments
Open

Control characters in text field breaking usql output #509

trantor opened this issue Dec 30, 2024 · 6 comments

Comments

@trantor
Copy link

trantor commented Dec 30, 2024

Hello @kenshaw .
I've been dealing with control characters that are present in a text field of a database I need to access.
As I can see in this code https://github.com/xo/tblfmt/blame/1af8a162785fd2d26eddb90fbd8ad9d407b3408d/fmt.go#L389 instead of being outputted literally and then, for instance, properly encoded in JSON output, they are rendered as, for instance \x1c for the U+001C character.
Apart from behaving differently from every other tool I've used on the database in question (they all output the literal character), it then completely breaks the JSON output, by putting in it the illegal \x sequence.
The other options in the switch block of the aforementioned code do not seem much better. Is there a way to just get the raw data in the output? I can't find one in the documentation.
There is also the fact that the various output format, JSON for instance, will encode characters in different ways (e.g. the surrogate pairs JSON uses for characters outside the BMP).
Also, silently modifying the contents of the data in an arbitrary way without the user being aware of it does seem an approach prone to nasty surprises for the user.

@kenshaw
Copy link
Member

kenshaw commented Jan 1, 2025

The formatter can be fixed, fairly easily, to accommodate this for JSON output. It would be helpful if you could share an example of the row(s) with the bad JSON data, and what other tools do. For the most part, it should be simply changing \x to \u escape codes.

@kenshaw
Copy link
Member

kenshaw commented Jan 2, 2025

I've pushed a change to tblfmt that should fix this issue for you. Can you please locally update tblfmt in the usql repo, and test against your data set:

$ git clone https://github.com/xo/usql.git && cd usql
$ go get github.com/xo/[email protected]
$ ./build.sh -b && ./usql

Let me know if this fixes all the issues for you. I'll tag a change with this shortly.

@trantor
Copy link
Author

trantor commented Jan 2, 2025

@kenshaw I had done much the same as a fix to have a working output. I will try your change as soon as I can manage. However I am wondering about the approach that doesn't differentiate between JSON output and others and does the encoding before choosing the output format.. I've also noticed that characters outside BMP are rendered as-is and there is no option to force ASCII-only output for JSON.

@trantor
Copy link
Author

trantor commented Jan 6, 2025

@kenshaw I've checked out the JSON output and it seems to be correct now.
What I don't understand is how one can get the output text as-is, i.e. raw without usql tampering with it.
Also, it's unclear to me what the rationale is to display \x, \u or \U escapes for character according specific unicode ranges, especially with no mention whatsoever of this choice anywhere in the documentation. And what happens with other machine-readable outputs that are not JSON?

@kenshaw
Copy link
Member

kenshaw commented Jan 6, 2025

JSON is a specific encoding standard -- IIRC, it's quite old (predating UTF-8), not friendly to non ASCII characters, and has some non-intuitive encoding requirements.

If you want "raw" character codes from your database, it would probably be better to use the database's actual client, and other tools to encode it to JSON. Alternately, you might want to try the CSV output.

@trantor
Copy link
Author

trantor commented Jan 6, 2025

@kenshaw I think I might not have expressed myself properly.
When I talked about raw output, I wasn't talking about JSON output.
I was talking about the text output. I noticed how the CSV output prints out the data as they actually are, which is good news.
However I don't understand the choice of those 3 different \x, \u or \U escapes in the text (and html) output (in what situation are those three used/useful?).
Also, I think html output should have character entities at least for non-printable characters.
As an aside, it would be nice to have JSON output with a choice whether to encode all non-ASCII characters through \u escapes or not. It's fine as it is now, mind you, since one can always process it with a tool like jq downstream, but it could be useful to have the choice in some situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants