-
-
Notifications
You must be signed in to change notification settings - Fork 884
The Big UTF16 String Rewrite #5339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Here is an example of this: var ba = new ByteArray();
ba.writeUTFBytes("example");
ba.writeByte(0);
ba.writeUTFBytes("test");
var s = ba.toString();
trace(s);
trace(s.length); Traces:
Shows that the strings are null terminated |
Oh, that's clever ^^
Well, if the length is still 12 it means that the string itself doesn't care about interior nulls, but that some other APIs do. If we want to reproduce this, we should check all the ways strings can be used, and look where |
Yeah your right I got that a bit wrong, they store the length and the underlying buffer separately, so it's not null terminated. I'm looking through AVMPlus, it looks more like this. When we call toString it makes a call to this function to create the string. The null terminating part seems to only happen during the call to |
8b77d63
to
6cd716d
Compare
fb03a07
to
8c56c5a
Compare
Okay, this is ready for review, at last! I've renamed the custom string types to There are still improvements that could be made, but this PR is already big enough and I believe extra improvements are better made in future PRs:
|
43337f7
to
989322f
Compare
989322f
to
174295b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your massive effort on this! It's like we have a real VM :-)
My only high-level worry is the ergonomic hit from avoiding DSTs. It'd be a lot of churn re-adding all of the &
s if things improve in the future, if we learn for certain that we can soundly use DST wstr
s.
So I'm curious what your feeling is, or the Rust community's feeling in general. Is it expected to satisfy stacked borrows now for some potential future where it's finalized; or is it okay to do the unsafe-fu now for some potential future where it can be made sound?
Otherwise, it's mainly a bunch of style nits and some pre-existing issues.
-
To simplify
if avm_str == WStr::from_units(b"literal")
, could we, say,impl<'gc, N: usize> PartialEq<&[u8; N]> for AvmString<'gc>
? Then hopefully all of those becomeif avm_str == b"literal"
. -
Should
WStr
indices, lengths etc. beu32
instead ofusize
? Not sure if this makes it more or less annoying to work with, but recently I've wanted to be more strict with matching Flash's types.
Future thoughts you may already be considering:
- I'm sure there'll be more to pack into
WStr
eventually, so packing the wide bit into len on 32-bit may be a little overkill -- we'll probably want more bytes to work with anyway. - Following from the above, it seems like there's a path to merging
WStr
andAvmString
eventually. Static vs. GCd vs. owned are some more bits to pack in there.
String can contains null bytes, Flash strings cannot (not entirely sure? there's no way to create a \0 from the AVM, but maybe loading a \0 from the constant pool works?)
In SWF and AVM1 bytecode, the strings are null-terminated, so you can't inject a zero byte in the middle of a string. (DefineFont tag font name is the exception, as I've seen these with zero bytes in the wild).
In AVM2 bytecode, the strings are all length-prefixed, so you can have a zero byte in them. For example, you can do var s = "AB\u0000CD";
. For display purposes such as trace, this will stop at the zero byte, but doing "AB\0CD" != "AB" will return false.
Sorry for the length of the review, been working on it all day! Thanks for the patience.
8bf6d95
to
3e684bc
Compare
@Herschel I addressed most of the review's comments, and left some explanations in response. I'll answer your more general remarks and questions in a later message ;). |
3e684bc
to
a38b8b2
Compare
Here's my take on this subject:
Personally, I think the unsoundness is "sufficiently-theoretic", and the ergomonic gains big enough that making
Good idea; should I rebase the PR to add this, or wait for another PR?
Personally, I see the 2³¹-length limitation as a sort of implementation detail, so I think keeping consistency with the rest of
My thinking here is that (Actually, there is one optimization possible: on 64-bits targets, there are 32 unused bits in |
Let's go ahead and use the DST, and keep a close eye on how the situation goes with stacked borrows. I feel like it's more likely that we'll be able to guarantee soundness in the long term, and by being another project in the wild relying on this behavior, it adds some gentle pressure for the unsafe WG to provide a way to ensure soundness. Worst case, we can just switch over when it becomes necessary. If this seems like it'd be very tedious to change, I'd be okay with merging without the DST, and possibly changing later. I'll let you decide on this point.
This significantly improves readability, so let's go for it now. And cross fingers that custom literal suffixes comes along soon. 😄
I see the argument about For comparison, avmplus bakes all of this into their one Thanks again for your work and patience. All of the changes LGTM! |
Doing the
Well, grepping the codebase for |
a38b8b2
to
b36bb89
Compare
I've rebased on |
b36bb89
to
7de1f45
Compare
I've made the change from |
This allows `Str::{find, rfind, split}` to accept multiple types
This is a little tricky, because we have to map the utf8 indices returned by the regex engine to utf16 indices usable by Ruffle. To limit the impact on performance, the regex, the string we're currently matching on, and the last known (utf8, utf16) positions are cached, avoiding extra utf8 conversions in common use cases where a single string is repeatedly searched with increasing `lastIndex`.
This has the nice side-effect of reducing string cloning, because we can just pass AvmStrings around instead.
Also remove some useless back-and-forth conversions between AvmString and String
This avoids converting the string to UTF8 if it can't possibly be a float
5656a71
to
d6850ac
Compare
d6850ac
to
6420227
Compare
Note: This PR is based on #4285.
Currently, there's a mismatch between the type used by Ruffle to represent Flash strings (Rust's
String
) and the actual semantics of Flash stringsThe differences:
String
is UTF8, but Flash strings are UTF16, with unpaired surrogates allowedString
can contains null bytes, Flash strings cannot (not entirely sure? there's no way to create a\0
from the AVM, but maybe loading a\0
from the constant pool works?)String
is unlimited in length (in practice), but Flash strings can't have more than 2³¹-1 chars.This PR aims to fix 1 and 3 (but not 2) by introducing a new UTF16 string type,
ruffle_core::string::Str
, and by replacing all usages ofString
in AVM code with it.This string type comes in several flavors:
Str<'a>
for immutable slices,StrMut<'a>
for mutable slices,StrBuf
andBoxedStr
for owned strings, andAvmString
for GC'd strings in AVM1/2.These types store internally either a
[u8]
or a[u16]
, depending the string contents, and manage conversions between these two forms mostly transparently. This choice was made instead of only storing[u16]
for two reasons:[u16]
would be a waste of memory;AvmString
s without copying the buffer contents. (Another option would be to add a proc-macro to embed UTF16 strings directly, e.g.wstr!("foobar")
instead of"foobar"
);However, this comes with some drawbacks:
usize
s.Str<'a>
andStrMut<'a>
, instead of using a custom DST and having&'a Str
and&'a mut Str
.Deref
orDerefMut
, forcing us to duplicate strings function on all string types with a macro.Str
DST, and avoid these two issues, but this requires some black magic that may be unsound, depending on the exact rules of the (future) Rust memory model regarding provenance (see https://github.com/moulins/ruffle/tree/ruffle-string-unsound for a proof-of-concept).This PR isn't finished yet: the new string types are implemented (modulo some missing APIs), and are being used in some modules of
ruffle_core
, but many places still useString
.