Skip to content

Baked data: use VarULE to store data when specified #6133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Feb 25, 2025

Conversation

sffc
Copy link
Member

@sffc sffc commented Feb 16, 2025

#5230

I did it by creating a trait, MaybeAsVarULE, which is implemented on data structs, similar to how Bake and Serialize are implemented on data structs.

The trait has an associated type that I can leverage in the baked exporter without having to do dyn Any downcasting.

I tried to keep my understanding of @robertbastian's position in mind while writing this PR (don't add anything hyper-specific to bake in the provider crate), and I think I achieved this, since the thing I added is on the same level as the existing format-specific traits.

See the hello world data for an example. I didn't apply it to the other data markers yet.

@sffc sffc requested review from Manishearth, robertbastian and a team as code owners February 16, 2025 00:55
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this design of sticking it on the export traits is good: it avoids cluttering core data marker infra whilst still providing extra metadata for export.

@sffc
Copy link
Member Author

sffc commented Feb 16, 2025

Created #6135 to reduce the diff

robertbastian pushed a commit that referenced this pull request Feb 17, 2025
Toward #5230

Reduces the diff in #6133

Not intended to be controversial.

Autosubmit = true
@sffc sffc requested a review from robertbastian February 17, 2025 18:02
@sffc sffc requested a review from Manishearth February 17, 2025 19:59
@sffc
Copy link
Member Author

sffc commented Feb 17, 2025

This PR is passing CI and ready from my perspective. I am unclear whether the open comments are bikeshed issues or fundamental design issues, and how you hope that they be resolved.

@Manishearth
Copy link
Member

@sffc From my side, the trait naming is a bikeshed that can be deferred, and the macro opt-in/opt-out is a concern I'd prefer solved here but is not a major design issue.

The overall design looks good.

Manishearth
Manishearth previously approved these changes Feb 18, 2025
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to land from my end, would prefer the trait be renamed, but can happen later.

@sffc sffc removed the request for review from echeran February 23, 2025 17:23
Comment on lines 12 to 14
/// Some data structs can be represented compactly as a single [`VarULE`],
/// such as `str` or a packed pattern. This trait allows for data providers
/// to use storage optimizations for such types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: you frame this as "storage optimizations", but that's not why we're doing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is a compile time improvement, at the expense of runtime performance

I have not seen any data that this reduces storage size

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a storage optimization that reduces compile times and binary size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the word "storage". The point of this remark is that it can be used for optimizations at the discretion of the data provider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in the WG meeting, I previously shared what I consider sufficient preliminary data in the main issue #5230, and more data can be gathered once this lands and is integrated into places such as datetime.

@sffc sffc requested a review from robertbastian February 24, 2025 14:51
@robertbastian robertbastian dismissed their stale review February 24, 2025 15:15

I still think this is incomplete

@sffc
Copy link
Member Author

sffc commented Feb 25, 2025

There have been over 100 comments on this PR, all of which have been resolved, except possibly for the open question about data collection. Anything else blocking this from landing?

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine to merge for now, but I consider benchmarking this 2.0 blocking

@robertbastian
Copy link
Member

can you post the semver diffs that this locks us in to?

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see this landing!

@Manishearth Manishearth merged commit dabb344 into unicode-org:main Feb 25, 2025
28 checks passed
@sffc sffc deleted the maybe-as-varule branch February 26, 2025 14:39
@sffc
Copy link
Member Author

sffc commented Feb 26, 2025

can you post the semver diffs that this locks us in to?

Changed APIs:

  1. impl<M> UpcastDataPayload<M> for ExportMarker gains a bound on MaybeEncodeAsVarULE

New APIs:

  1. DataPayload<ExportMarker> gains a pub fn tokenize_encoded_seq
  2. impl<'a> ZeroFrom<'a, str> for HelloWorld<'a>
  3. New traits icu_provider::ule::MaybeAsVarULE, icu_provider::ule::MaybeEncodeAsVarULE
  4. New struct icu_provider_baked::zerotrie::DataForVarULEs
  5. New macro icu_provider::data_struct_new (I think we should have this macro regardless)
  6. unsafe impl ULE for () (something we can/should have regardless)

sffc added a commit that referenced this pull request Feb 26, 2025
@sffc sffc mentioned this pull request Feb 26, 2025
Manishearth pushed a commit that referenced this pull request Feb 26, 2025
@Manishearth
Copy link
Member

unsafe impl ULE for () (something we can/should have regardless)

I actually disagree with this one and have a mild preference for using a custom ZST here, but didn't think it that big a deal. ZSTs are weird and there are multiple ways one might expect a () ZST to behave here.

Happy to make that change if people agree.

@robertbastian
Copy link
Member

impl UpcastDataPayload for ExportMarker gains a bound on MaybeEncodeAsVarULE

This has given me some confusing errors while migrating this. The place where this will fail if the macro hasn't been called is in impl_dynamic_data_provider inside a registry callback in icu_provider_source. If we add the bound on DataMarker::DataStruct, it should give much clearer error messages, and it's the natural place for this, if we say data_struct! needs to be called on each data struct.

@sffc
Copy link
Member Author

sffc commented Feb 28, 2025

unsafe impl ULE for () (something we can/should have regardless)

I actually disagree with this one and have a mild preference for using a custom ZST here, but didn't think it that big a deal. ZSTs are weird and there are multiple ways one might expect a () ZST to behave here.

Happy to make that change if people agree.

What are the multiple ways one might expect impl ULE for () to behave?

The one I implemented was "Ok if the bytes are empty, Err if they are nonempty". I guess one could also implement "Always Err"?

@Manishearth
Copy link
Member

Yes, effectively. It's the difference between whether people plan to actually use it or just have it be a sentinel.

@sffc
Copy link
Member Author

sffc commented Mar 1, 2025

My intuition is that since &() is a valid thing to have, then we should allow it, so the ULE should allow empty bytes to represent ().

@Manishearth
Copy link
Member

Even &() is mostly "valid" in the sense that it's a thing that makes sense in some generic contexts: it would be silly to exclude () from generic contexts just because its references are strange. But the ULEverse already excludes lots of types, its driven by explicit inclusion.

I don't really see a purpose for (): ULE beyond as a placeholder in some generic contexts, and there I see two ways for it to work, as a "never" type or as a stand in similar to how Vec<()> is basically just a usize. For that, I'd prefer having an explicit type that makes it clear what's going on. When I see (): ULE I don't have much clarity at all.

We ended up releasing this in zerovec anyway (I didn't want to further block landing this) so there's not much to do about undoing it without a breaking release.

But basically: I definitely do not think this impl is "something we can/should have regardless", and in the medium term I'd like to deprecate it.

sffc added a commit that referenced this pull request Mar 4, 2025
Follow-up from #6133

Originally discussed in #1180

We already say this in the `VarULE` docs. I am adding additional docs
about the EncodeAsVarULE/ZeroFrom parallelism in the zerovec universe on
the docs of those specific traits, too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants