Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for outputting integers as hexadecimal #812

Open
MatthewMorozov opened this issue Nov 15, 2024 · 10 comments
Open

Add support for outputting integers as hexadecimal #812

MatthewMorozov opened this issue Nov 15, 2024 · 10 comments
Labels
A-edit Area: TOML editing API A-output Area: Outputting TOML C-enhancement Category: Raise on the bar on expectations

Comments

@MatthewMorozov
Copy link

When using serde to serialize a data structure containing integers (u32, i16, etc) to TOML format, I would like the fields to be written in the generated TOML file in hexadecimal format. TOML format supports hexadecimal numbers, and this feature would make it easier to read fields such as 32-bit addresses. This may could be an option such as #[serde_with_format = "{:0X}"] above integer-type fields.

@epage epage added C-enhancement Category: Raise on the bar on expectations A-output Area: Outputting TOML A-edit Area: TOML editing API labels Nov 15, 2024
@epage
Copy link
Member

epage commented Nov 15, 2024

Some prior discussion at #781.

I don't see us supporting customization of formatting through serde. Serde doesn't natively support it. To support this, we'd need a newtype that serializes to a struct that toml recognizes as an integer with formatting. We'd support deserializing from either the struct or a number. This has some odd cases in serde, particularly that its not compatible with any other serialization format. Not just that the formatting gets ignored but that the number won't be sererialized correctly.

We can support this in toml_edit and people could deserialize to a DocumentMut and then apply a custom formatter. We'd just need to work out the API for customizing an integer's repr.

@epage
Copy link
Member

epage commented Mar 3, 2025

In #781, I mention the following API ideas

  • Separate APIs for each. When applying botth, we'd either need to store the format style (would rather not do this) or parse the repr to see what was done and re-generate
  • A single format call for integers, taking in a non-exhaustive IntegerFormat
  • A single format call for any scalar, taking in a non-exhaustive ScalarFormat. This could then store all the settings for string formatting (see Add a way to customize the literal-ness (single/double quotes) of a string #590), etc

#837 was posted with Formatted<i64>::new_hex_lower, etc. This has the benefit of communicating whether the formatting was applied. It has the downsides of

  • Requires creating a new formatted
  • No Intos for Formatted<i64>
  • No indication of how we would handle digit separator

Should we also allow reflecting on what the current style is?

@pastelmind
Copy link

pastelmind commented Mar 3, 2025

Hi. I submitted #837 and would like to offer my opinions on the matter:

  • Should we let the user directly generate the string representation of an integer?
    • Probably not. If we did, we would have to manually verify whether the repr is valid and matches the inner i64 value. Alternatively, we concede that the user-provided repr may be invalid. Neither seems desirable.
  • Should we allow choosing between uppercase/lowercase for hex numbers?
    • Yes. Emitting all-uppercase or all-lowercase is a common requirement. It would likely be the most requested feature for hexadecimal numbers, and is fairly trivial to implement, making it a low-hanging fruit.
  • Should we allow arbitrary case? (e.g. 0xDEADbeef)
    • I don't see value in this. Demand for this feature would be low, and implementing it would be more difficult.
  • Should we allow customization of digit separators?
    • I don't know. I anticipate moderate demand for this feature. However, I don't believe there exists a set of separator styles that covers the majority of use cases. Do users want 1234_5678, 12_34_56_78, or 12_345_678?
    • Implementing this would require effort, too. std::fmt does not support this, so we'd have to use something like num-format or build something ourselves.
    • Note that digit separators are technically tangential to choosing a numeral system (hex/oct/bin), as all of them can have digit separators. We may also have to consider allowing digit separators in floating-point numbers.

As for the API design itself, I have little to offer. In #837 I chose to add Formatted::new_hex_upper(), new_oct(), etc. because it caused less friction than redesigning the toml_edit AST. An IntegerFormat or StringFormat object--involving a builder, perhaps?--sounds good, too. A single ScalarFormat that covers multiple Value types feels ambitious.

Personally, I would love to see that hexadecimal/octal/binary numbers are supported first, and then try to come up with a solution for the more complex formatting decisions (digit separators, single/double-quote strings, etc.). Getting a perfect solution should not preclude a workable one.

@epage
Copy link
Member

epage commented Mar 3, 2025

Should we allow customization of digit separators?

  • I don't know. I anticipate significant demand for this feature. However, I don't believe there exists a set of separator styles that covers the majority of use cases. Do users want 1234_5678, 12_34_56_78, or 12_345_678?

In your examples, they are always grouped by a standard convention. For example, hex is usually grouped by 4s while decimal is usually grouped by 3s, starting from the right.

This could easily be handled by just having a "group by" option. This can trivially be implemented by post-processing a string (starting from the back, insert a _ after every N digits).

Note that digit separators are technically tangential to choosing a numeral system (hex/oct/bin), as all of them can have digit separators. We may also have to consider allowing digit separators in floating-point numbers.

...

Personally, I would love to see that hexadecimal/octal/binary numbers are supported first, and then try to come up with a solution for the more complex formatting decisions (digit separators, single/double-quote strings, etc.). Getting a perfect solution should not preclude a workable one.

Yes but how do you select more than one? If I say "group in 4's" and then say "format as hex", the latter will overwrite the first. If we are ok with being order dependent, then we could re-parse the repr, check the base, and then handle it.

This isn't about being perfect but whether the design scales.

@pastelmind
Copy link

In your examples, they are always grouped by a standard convention. For example, hex is usually grouped by 4s while decimal is usually grouped by 3s, starting from the right.

This could easily be handled by just having a "group by" option. This can trivially be implemented by post-processing a string (starting from the back, insert a _ after every N digits).

The Indian numbering system (mentioned in TOML 1.0 spec!) groups the rightmost three digits, then each two digits afterwards. Likewise, when representing USD in number of cents, one may wish to group the two rightmost digits, then each three digits afterwards.

12_34_56_789  # Indian
1_234_567_89  # USD in cents

The point is, coming up with a solution that satisfies enough users is difficult.

Yes but how do you select more than one? If I say "group in 4's" and then say "format as hex", the latter will overwrite the first. If we are ok with being order dependent, then we could re-parse the repr, check the base, and then handle it.

This isn't about being perfect but whether the design scales.

I don't understand what you mean by "overwrite"? I was under the impression that Value and Formatted<T> are generally immutable (not accounting for decor) for users of the toml_edit crate. If I choose "group in 4's" to generate a Value, how can I later overwrite it with "format as hex"?


Continuing the NumberFormat idea: I suppose we could create opaque builders like IntegerFormat, StringFormat, etc. These might be used like this:

let formatter = IntegerFormat::new()
    .hex_upper()               // or hex_lower(), oct(), bin()
    .separator( /* ... */ )
    .other_options_we_may_add_later();
let v: Value = formatter.try_format(123i64)?;

let formatter = StringFormat::new()
    .basic()                   // or basic_multiline(), literal(), literal_multiline()
    .other_options_we_may_add_later();
let v: Value = formatter.try_format("Hello,\nworld!")?;

@epage
Copy link
Member

epage commented Mar 3, 2025

I don't understand what you mean by "overwrite"? I was under the impression that Value and Formatted are generally immutable (not accounting for decor) for users of the toml_edit crate. If I choose "group in 4's" to generate a Value, how can I later overwrite it with "format as hex"?

Decor is mutable and the Repr can be as well, to a limited degree. The main constraint is keeping the Repr and the actual value in-sync. Any mutation that supports that is fine to do.

Even if you treat it as "visibly immutable", that leads to builder methods like Formatted::new(5).to_hex_lower().to_digits(3) but internally, it will be mutating things and each stage will re-do the formatting of the previous stage.

@pastelmind
Copy link

pastelmind commented Mar 3, 2025

Decor is mutable and the Repr can be as well, to a limited degree. The main constraint is keeping the Repr and the actual value in-sync. Any mutation that supports that is fine to do.

Even if you treat it as "visibly immutable", that leads to builder methods like Formatted::new(5).to_hex_lower().to_digits(3) but internally, it will be mutating things and each stage will re-do the formatting of the previous stage.

I see. In which case a builder would still work and avoid internal mutation of the repr. To clarify, by "builder" I meant a builder for a IntegerFormat, not a builder for a Formatted (or a builder for a Value).

// Note: I am not calling methods on Value itself,
//       but rather the IntegerFormat type.
let formatter = IntegerFormat::new()
    .hex_upper()
    .separator( /* ... */ )
    .other_options_we_may_add_later();

// Generate value and repr in single step.
// no need to worry about order of operations.
let v: Value = formatter.try_format(123i64)?; // formatted as 0x7B

@epage
Copy link
Member

epage commented Mar 4, 2025

I don't see the formatter being a factory for Values. That feels out of place and incongruous with how people will be using all of this.

The most likely API I see is Formatted<i64> having a &mut self method for applying formatting. This would work well with VisitMut.

I can also see having a chained self method on Formatted<i64> for easy creation from scratch. I can see this include From impls to make easy creation of a Value or an Item and to work naturally with our APIs that effectively accept impl Into<Item> or impl Into<Value>.

One question is how do we report that a format didn't apply.

  • For &mut self, it might be a must_use bool?
  • For self, likely Result<Self, Self>

@pastelmind
Copy link

I don't see the formatter being a factory for Values. That feels out of place and incongruous with how people will be using all of this.

It might be better to incrementally build a separate IntegerFormatOptions object and pass it to a method that performs the formatting in one step:

let mut f = Formatted::new(123i64);

let opts = IntegerFormatOptions::new()
    .to_hex_lower()
    .separator(/* ... */);
f.format(opts)?;

One question is how do we report that a format didn't apply.

  • For &mut self, it might be a must_use bool?
  • For self, likely Result<Self, Self>

Is there precedent for a &mut self method that reports success/failure as bool? It is ultimately your call to make, of course 🤷

IMO the only scenarios I can think of where formatting would fail are:

  • Attempting to format a negative number as hex/oct/decimal

Regarding your earlier question:

Should we also allow reflecting on what the current style is?

I feel we could add those separately from the formatting API. For now, the user can directly examine the current repr with display_repr().

@epage
Copy link
Member

epage commented Mar 4, 2025

It might be better to incrementally build a separate IntegerFormatOptions object and pass it to a method that performs the formatting in one step:

I'd just call that IntegerFormat.

Is there precedent for a &mut self method that reports success/failure as bool? It is ultimately your call to make, of course 🤷

"failure" is relative. People might be fine with a best effort.

This wouldn't be too different than HashSet::insert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-edit Area: TOML editing API A-output Area: Outputting TOML C-enhancement Category: Raise on the bar on expectations
Projects
None yet
Development

No branches or pull requests

3 participants