Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "stream did not contain valid UTF-8" using String::from_utf8_lossy #380

Merged
merged 1 commit into from
Sep 26, 2024

Conversation

krakow10
Copy link
Contributor

@krakow10 krakow10 commented Jan 6, 2024

I am using rbx_binary::from_reader(input) to load files in my rust program, and some of the files make the function spit out the error "stream did not contain valid UTF-8".

I decided that I'm proficient enough at Rust to find out how to fix this pesky issue myself, and here's what I've come up with. I found two places which would produce this error and have patched them using the String::from_utf8_lossy function.

Here is an image of an offending section of a Script.Source property after it was mercilessly butchered into valid UTF-8:
image

@krakow10 krakow10 requested a review from Dekkonot as a code owner January 6, 2024 07:32
@krakow10 krakow10 changed the title Fix "stream did not contain valid UTF-8" using String::from_utf8_lossy WIP: Fix "stream did not contain valid UTF-8" using String::from_utf8_lossy Jan 6, 2024
@krakow10
Copy link
Contributor Author

krakow10 commented Jan 6, 2024

I'm testing this and it didn't entirely fix the issue, only one case. I will keep investigating.

@krakow10 krakow10 marked this pull request as draft January 6, 2024 07:45
@Dekkonot
Copy link
Member

Dekkonot commented Jan 6, 2024

Hi, what problem are you running into and trying to fix? Right now I have no idea beyond that you're evidently having trouble with non-UTF8 names for Instances?

@krakow10
Copy link
Contributor Author

krakow10 commented Jan 6, 2024

^ Added String::from_utf8_lossy to Name property

I have .rbxl files which, when I decode with

rbx_binary::from_reader(input)

Throw this UTF-8 error.

@Dekkonot
Copy link
Member

Dekkonot commented Jan 6, 2024

Just to clarify, is there a reason why you want/need invalid UTF-8 for the name property on Instances?

We've had discussions about this before internally and it's unclear if it's actually beneficial to support names like that. If we were going to, we'd probably want to swap to using a string replacement like BStr over lossy conversions though.

@krakow10
Copy link
Contributor Author

krakow10 commented Jan 6, 2024

So far I have found the issue with both Name and Source property. My current project is that I'm converting the 1006 .rbxl files I downloaded off my place version history into git version history, but I've also run into this issue with map files for my actual game which I wanted to scan for scripts. So it's both reading files I created a very long time ago, and reading files that I didn't create.

@krakow10 krakow10 changed the title WIP: Fix "stream did not contain valid UTF-8" using String::from_utf8_lossy Fix "stream did not contain valid UTF-8" using String::from_utf8_lossy Jan 6, 2024
@krakow10
Copy link
Contributor Author

krakow10 commented Jan 6, 2024

All 1006 files have decoded with these changes! I now have a git history all the way back to 2015!

@krakow10 krakow10 marked this pull request as ready for review January 6, 2024 09:14
@krakow10
Copy link
Contributor Author

krakow10 commented Jan 6, 2024

I updated the original message with some contextual information, sorry about the initial lack of clarity.

@Dekkonot
Copy link
Member

Going to tag @kennethloeffler in so that he can share his thoughts on this. I don't want to leave you hanging on this and between us we should be able to make a final call on how we want to solve this problem.

@kennethloeffler
Copy link
Member

We've had a bit of discussion about this aspect of rbx-dom already: #313.

A lossy conversion is okay, but is... well, lossy. Roblox's implementation doesn't do any conversions whatsoever on String property contents, and diverging from that doesn't feel great to me. It could lead to busted results - for example, when Lua code expects a certain non UTF-8 instance name, when binary data is stored in a StringValue, etc.

I think I'm okay with doing a lossy conversion so rbx-dom can successfully-ish decode more files, as long as we document it and print a warning when we have to do it so that there are no murky surprises. But we should really consider bringing rbx-dom's String properties up to parity with Roblox's by representing them as Vec<u8> or similar. That is a breaking change, so we'll have to think about how it fits into the next major version.

Copy link
Member

@kennethloeffler kennethloeffler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import std::borrow::Cow at the top of rbx_binary/src/deserializer/state.rs and then apply the following suggestions. Also make sure to run rustfmt before you push - that's why the CI failed!

@kennethloeffler
Copy link
Member

Do we need to do anything to rbx_xml for this? @Dekkonot

@krakow10
Copy link
Contributor Author

How about making lossy string conversion an optional decoder setting for now? There are lots of maps on my game that simply don't work because of this, so I'm using this patched version. I can agree that effectively destroying strings by default can be seen as a bug that should not be relied on.

@kennethloeffler
Copy link
Member

How about making lossy string conversion an optional decoder setting for now? There are lots of maps on my game that simply don't work because of this, so I'm using this patched version. I can agree that effectively destroying strings by default can be seen as a bug that should not be relied on.

Meh, it's better than the behavior we have now (which is just to fall over and exit without producing a result). I think it should be fine until we properly support non UTF-8 String properties. Most of the time, these sorts of strings are present only because of "virus" instances, which no one wants or really cares about anyway.

Make sure to import std::borrow::Cow too, right now this branch will not compile

@Dekkonot
Copy link
Member

This needs an entry in the changelog but otherwise looks good to me.


Do we need to do anything to rbx_xml for this?

Unicode in general doesn't currently work in rbx_xml so no, but only because we're in a bad spot anyway. @kennethloeffler

@krakow10
Copy link
Contributor Author

krakow10 commented Feb 25, 2024

Squashed and rebased

@krakow10 krakow10 force-pushed the utf8_lossy branch 2 times, most recently from 061e32e to 3e4be0d Compare April 24, 2024 08:51
@krakow10
Copy link
Contributor Author

Added changelog

@krakow10
Copy link
Contributor Author

Does Roblox use UTF-16? I know there's a crate for that, maybe that's desirable instead of BStr.

@Dekkonot
Copy link
Member

No, Roblox uses UTF-8 encoding. It just also allows binary data to be put into strings. What we really need to do is not pretend that Roblox strings are always UTF-8. That's outside of scope for this PR though.

@Dekkonot
Copy link
Member

@kennethloeffler When you have a moment, can you take a look at this again? Turns out it needs your approval, haha.

@krakow10
Copy link
Contributor Author

Rebased onto master to resolve merge conflicts.

@Dekkonot Dekkonot merged commit 0f2edcf into rojo-rbx:master Sep 26, 2024
3 checks passed
@krakow10 krakow10 deleted the utf8_lossy branch September 30, 2024 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants