-
-
Notifications
You must be signed in to change notification settings - Fork 18
Protobuf Formatter #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| private const FULL32 = 0xffffffff; | ||
| private const INT32_MAX = '4294967296'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private const FULL32 = 0xffffffff; | |
| private const INT32_MAX = '4294967296'; | |
| private const Full32 = 0xffffffff; | |
| private const Int32Max = '4294967296'; |
(And wherever those are used.)
Yes that's not standard, but I find PascalCase constants far easier to work with.
| } | ||
| } | ||
|
|
||
| throw new \RuntimeException("Unexpected end of data while parsing varint"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be more specific, custom exceptions, and use a create() method that saves the context data to properties. See the other exceptions in Serde for examples.
|
|
||
| namespace Crell\Serde\Formatter\Protobuf; | ||
|
|
||
| class RunningValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably have a somewhat more descriptive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for this to be a streaming formatter? It's binary and writing linearly. Or have one of each, with shared traits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for this to be a streaming formatter?
That's a pretty brilliant idea. It can certainly be one.
| return new RunningValue(); | ||
| } | ||
|
|
||
| public function serializeFinalize(mixed $runningValue, ClassSettings $classDef): string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can include a more precise type for $runningValue in a docblock. Please do so. (See the other formatters for examples.)
| switch ($typeField) { | ||
| case null: | ||
| if(!mb_check_encoding($next, 'UTF-8')) { | ||
| throw new \RuntimeException('Cannot encode non-utf8 string as protobuf'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on exceptions. (And elsewhere.)
| if (extension_loaded('bcmath')) { | ||
| $bytes = bcmod($next, self::INT32_MAX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's new in 8.4, but would it make sense to use Number here instead? (I'm OK with the protobuf formatter specifically requiring PHP 8.4.)
| return $runningValue; | ||
| } | ||
|
|
||
| $typeField = $field->typeField === null ? null : get_class($field->typeField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there has got to be a better way of doing this. I'm not sure what, but this feels so fugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thought, possibly wrong: TypeFields are really for the Handler layer. If we have format-specific extended information, then perhaps it makes sense to have a FormatField that's parallel to it, as noted in the previous issue. Then there can be a Protobuf-specific field, one of whose properties is type, which is an enum of all of the new attributes. That would make this check simpler, and not be confusing as we wouldn't have a dozen type fields that only do something if using Protobuf, and therefore preclude using the typefield for something else. (Every field can have only one typefield attribute.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh; the original does something like this so that each attribute is actually an enum and it is handled as a match. That is probably why it looks kinda weird. It requires special handling, though... the common implementation of Serde won’t work with it (calling "newInstance" on a constant is an Error). I’m not sure if you would be interested in that approach, but it does clean up the code quite nicely (especially when emulating tagged unions). Looking at the original code, it needs some support in the attribute-utils package.
Now that I think about it, I might propose using enums in attributes as a PHP RFC. Then it won't feel so hacky in a few years.
If we have format-specific extended information, then perhaps it makes sense to have a FormatField that's parallel to it
This makes a ton of sense. It would also probably be useful for other rich formats like xml/soap.
a dozen type fields that only do something if using Protobuf,
TypeFields are really for the Handler layer
TBH, they are implemented in the handler layer of the larger implementation because we use two different serializers -- one for protobuf and another for json. I'm reimplementing them in the formatter because it makes using SerdeCommon possible for protobufs. Otherwise, you have to create a custom Serde with a whole slew of handlers. One solution might be to move this back to the handlers, then have SerdeCommon be for text-based formats, and a SerdeBinary wired up to be used for binary formats. For other binary formats (eg., msgpack and friends), you'll probably need most of these types as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh; the original does something like this so that each attribute is actually an enum and it is handled as a match.
Oh sweet lord, I am both impressed and horrified. 😅 Handling this natively somehow sounds much better than trying to do it in user-space. (Though it would probably be possible to bake it into AttributeUtils, I don't know that it's wise to do so.)
TBH, they are implemented in the handler layer of the larger implementation because we use two different serializers -- one for protobuf and another for json. I'm reimplementing them in the formatter because it makes using SerdeCommon possible for protobufs. Otherwise, you have to create a custom Serde with a whole slew of handlers. One solution might be to move this back to the handlers, then have SerdeCommon be for text-based formats, and a SerdeBinary wired up to be used for binary formats. For other binary formats (eg., msgpack and friends), you'll probably need most of these types as well.
Well, the intent for Serde is deliberately that you can easily assemble whatever combination of (de)formatters and [im|ex]porters you want, as long as they obey the interfaces. SerdeCommon is just that: My estimation of what most typical use cases will cover. But there's absolutely nothing preventing you from wiring up your own that uses all custom handlers and formatters (using Serde) or making your own subclass of Serde that does so. That's an explicit feature. I've actually been pondering doing a "high performance" set of imp/exp handlers that assume an all-at-once-array and only simple transformations and so can skip about 40 layers of nested function calls. So a SerdeBinary is entirely reasonable, whether it's first-party provided or not.
There's 2 catches, though:
- The Formatter interface assumes a JSON-esque set of export types. That's sufficient for most any text-based serialization format, but possibly not binary, it seems. However, that list wasn't designed to be extensible. While technically you could add an extra interface, like
BinaryFormatter, that adds more format methods, theSerializerclass wouldn't know about it, so it would still only be typed asFormatter. Changing that puts us squarely into 2.0 territory. (And I would want to do a 2.0 of AttributeUtils that adds some more flexibility using PHP 8.4 before I did that, and I'm not sure exactly all of what I want to do there.) - As mentioned,
TypeFieldis singular per field. That means specifying, say,#[UInt8]would be incompatible with also specifying#[PositiveInt]or#[DoctrineId](something I know another Serde user is doing). That effectively makes an class using such attributes format-dependent, whereas the intent is for it to always be format independent. Hence why I've been consideringFormatFieldas a per-format multi-value alternative.
So it sounds like doing it "right" would involve going all-in on Serde as a framework. I'm not entirely sure what all that entails, though, and there's enough users that backward compatibility becomes a factor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've actually been pondering doing a "high performance" set of imp/exp handlers that assume an all-at-once-array and only simple transformations and so can skip about 40 layers of nested function calls.
If you think about it, a JSON/array output is basically just text ... or if you squint at it, can be represented as a template. And PHP is really good at templates. 😈
The hard part is sanitization and serialization to the template which this library has all the plumbing for; now that I think about it.
The Formatter interface assumes a JSON-esque set of export types. That's sufficient for most any text-based serialization format, but possibly not binary, it seems. However, that list wasn't designed to be extensible.
I’m not even sure it could be considered complete even if you tried. MsgPack has (u)int8/16/32 etc. MD5 is a uint128, and sha256 is a uint256. Some can be represented in PHP without any information loss; while others have to be stored as a binary string or can be partially represented (i.e. a uint32 can’t be fully represented on a 32-bit machine, but it can be fully stored — it is just that the second half of the range is represented as a negative number; but it can be fully represented on a 64-bit machine).
So, you would probably need some kind of interface to represent a type conversion: TypeConversion that can be specified by a formatter. So, JSON would use IntToString to capture the integer in a string representation, which would then get passed to an appropriate template handler which would handle sanitization/encoding
{"id": $id, "name": "$name", "balance": $float}where in protobuf it would use different type conversions but look somewhat similar in a template form:
$idTag$id$nameTag$nameLen$name$balanceTag$floatIn some ways it is kind of unfortunate that we have json_encode — it is too darn powerful.
Using json_encode, you would probably use an "Identity" TypeConversion because json_encode will handle the entire pipeline.
Sorry — I’m totally off topic here, but I feel like I spent so long writing it down that I should send it anyway.
As mentioned, TypeField is singular per field. That means specifying, say, #[UInt8] would be incompatible with also specifying #[PositiveInt] or #[DoctrineId]
I did notice that; but hadn’t thought through the implications. That is not ideal...
I suspect that isn’t an easy-ish fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making TypeField multi-value would be... invasive. It would break a lot of assumptions, and ask a lot of questions along the lines of "so which one do I use right now?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be enough to pass on the raw ReflectionProperty so tooling can access the raw attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReflectionProperty is not serializable. It's exposed to the Property attribute, which can pull out the pieces it wants. (This is AttributeUtils functionality.) But once that's cached, Reflection is not called again.
| return $runningValue; | ||
| } | ||
|
|
||
| public function serializeString(mixed $runningValue, Field $field, ?string $next): mixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's enough going on here that the switch cases should be split up to their own private methods. (Which would probably enable using a match() instead.)
|
And of course the whole thing needs gobs of tests. |
Description
This contains and merges with #84.
This adds the formatter for protobuf and provides serialization for all the different types introduced in #84. Each type has slightly different encoding which needs to be correct for other systems to communicate with PHP. The only issue is that PHP doesn’t support unsigned integers. This means that any negative number in PHP going to the unsigned integer types is not actually negative (see: twos-complement), but the "sign-bit" is used as part of the number. Further, this support also needs to function on 32-bit systems; thus it allows using "large numbers" in a string — assuming the developer is using
bcmathorgmpfor large integers.Motivation and context
Protobuffers are a binary format for storing structured data, and Serde is well suited for this.
How has this been tested?
This is from a larger work and I’m migrating it out. It’s battletested, but formal tests will come when I finish porting the deformatter.
Screenshots (if appropriate)
Types of changes
What types of changes does your code introduce? Put an
xin all the boxes that apply:Checklist:
Go over all the following points, and put an
xin all the boxes that apply.Please, please, please, don't send your pull request until all of the boxes are ticked. Once your pull request is created, it will trigger a build on our continuous integration server to make sure your tests and code style pass.
If you're unsure about any of these, don't hesitate to ask. We're here to help!