Protobuf Formatter #85

withinboredom · 2025-05-30T17:36:21Z

Description

This contains and merges with #84.

This adds the formatter for protobuf and provides serialization for all the different types introduced in #84. Each type has slightly different encoding which needs to be correct for other systems to communicate with PHP. The only issue is that PHP doesn’t support unsigned integers. This means that any negative number in PHP going to the unsigned integer types is not actually negative (see: twos-complement), but the "sign-bit" is used as part of the number. Further, this support also needs to function on 32-bit systems; thus it allows using "large numbers" in a string — assuming the developer is using bcmath or gmp for large integers.

Motivation and context

Protobuffers are a binary format for storing structured data, and Serde is well suited for this.

How has this been tested?

This is from a larger work and I’m migrating it out. It’s battletested, but formal tests will come when I finish porting the deformatter.

Screenshots (if appropriate)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

Go over all the following points, and put an x in all the boxes that apply.

Please, please, please, don't send your pull request until all of the boxes are ticked. Once your pull request is created, it will trigger a build on our continuous integration server to make sure your tests and code style pass.

I have read the CONTRIBUTING document.
My pull request addresses exactly one patch/feature.
I have created a branch for this patch/feature.
Each individual commit in the pull request is meaningful.
I have added tests to cover my changes.
If my change requires a change to the documentation, I have updated it accordingly.

If you're unsure about any of these, don't hesitate to ask. We're here to help!

Crell · 2025-05-30T20:43:29Z

src/Formatter/ProtobufFormatter.php

+    private const FULL32 = 0xffffffff;
+    private const INT32_MAX = '4294967296';


Suggested change

private const FULL32 = 0xffffffff;

private const INT32_MAX = '4294967296';

private const Full32 = 0xffffffff;

private const Int32Max = '4294967296';

(And wherever those are used.)

Yes that's not standard, but I find PascalCase constants far easier to work with.

Crell · 2025-05-30T20:46:57Z

src/Formatter/Protobuf/RunningValue.php

+            }
+        }
+
+        throw new \RuntimeException("Unexpected end of data while parsing varint");


These should be more specific, custom exceptions, and use a create() method that saves the context data to properties. See the other exceptions in Serde for examples.

Crell · 2025-05-30T20:47:29Z

src/Formatter/Protobuf/RunningValue.php

+
+namespace Crell\Serde\Formatter\Protobuf;
+
+class RunningValue


This should probably have a somewhat more descriptive name.

Would it make sense for this to be a streaming formatter? It's binary and writing linearly. Or have one of each, with shared traits?

Would it make sense for this to be a streaming formatter?

That's a pretty brilliant idea. It can certainly be one.

Crell · 2025-05-30T20:48:42Z

src/Formatter/ProtobufFormatter.php

+        return new RunningValue();
+    }
+
+    public function serializeFinalize(mixed $runningValue, ClassSettings $classDef): string


You can include a more precise type for $runningValue in a docblock. Please do so. (See the other formatters for examples.)

Crell · 2025-05-30T20:51:45Z

src/Formatter/ProtobufFormatter.php

+        switch ($typeField) {
+            case null:
+                if(!mb_check_encoding($next, 'UTF-8')) {
+                    throw new \RuntimeException('Cannot encode non-utf8 string as protobuf');


Same comment on exceptions. (And elsewhere.)

Crell · 2025-05-30T20:52:38Z

src/Formatter/ProtobufFormatter.php

+                if (extension_loaded('bcmath')) {
+                    $bytes = bcmod($next, self::INT32_MAX);


I know it's new in 8.4, but would it make sense to use Number here instead? (I'm OK with the protobuf formatter specifically requiring PHP 8.4.)

Crell · 2025-05-30T20:55:59Z

src/Formatter/ProtobufFormatter.php

+            return $runningValue;
+        }
+
+        $typeField = $field->typeField === null ? null : get_class($field->typeField);


I feel like there has got to be a better way of doing this. I'm not sure what, but this feels so fugly.

Random thought, possibly wrong: TypeFields are really for the Handler layer. If we have format-specific extended information, then perhaps it makes sense to have a FormatField that's parallel to it, as noted in the previous issue. Then there can be a Protobuf-specific field, one of whose properties is type, which is an enum of all of the new attributes. That would make this check simpler, and not be confusing as we wouldn't have a dozen type fields that only do something if using Protobuf, and therefore preclude using the typefield for something else. (Every field can have only one typefield attribute.)

Heh; the original does something like this so that each attribute is actually an enum and it is handled as a match. That is probably why it looks kinda weird. It requires special handling, though... the common implementation of Serde won’t work with it (calling "newInstance" on a constant is an Error). I’m not sure if you would be interested in that approach, but it does clean up the code quite nicely (especially when emulating tagged unions). Looking at the original code, it needs some support in the attribute-utils package.

Now that I think about it, I might propose using enums in attributes as a PHP RFC. Then it won't feel so hacky in a few years.

If we have format-specific extended information, then perhaps it makes sense to have a FormatField that's parallel to it

This makes a ton of sense. It would also probably be useful for other rich formats like xml/soap.

a dozen type fields that only do something if using Protobuf,

TypeFields are really for the Handler layer

TBH, they are implemented in the handler layer of the larger implementation because we use two different serializers -- one for protobuf and another for json. I'm reimplementing them in the formatter because it makes using SerdeCommon possible for protobufs. Otherwise, you have to create a custom Serde with a whole slew of handlers. One solution might be to move this back to the handlers, then have SerdeCommon be for text-based formats, and a SerdeBinary wired up to be used for binary formats. For other binary formats (eg., msgpack and friends), you'll probably need most of these types as well.

Heh; the original does something like this so that each attribute is actually an enum and it is handled as a match.

Oh sweet lord, I am both impressed and horrified. 😅 Handling this natively somehow sounds much better than trying to do it in user-space. (Though it would probably be possible to bake it into AttributeUtils, I don't know that it's wise to do so.)

TBH, they are implemented in the handler layer of the larger implementation because we use two different serializers -- one for protobuf and another for json. I'm reimplementing them in the formatter because it makes using SerdeCommon possible for protobufs. Otherwise, you have to create a custom Serde with a whole slew of handlers. One solution might be to move this back to the handlers, then have SerdeCommon be for text-based formats, and a SerdeBinary wired up to be used for binary formats. For other binary formats (eg., msgpack and friends), you'll probably need most of these types as well.

Well, the intent for Serde is deliberately that you can easily assemble whatever combination of (de)formatters and [im|ex]porters you want, as long as they obey the interfaces. SerdeCommon is just that: My estimation of what most typical use cases will cover. But there's absolutely nothing preventing you from wiring up your own that uses all custom handlers and formatters (using Serde) or making your own subclass of Serde that does so. That's an explicit feature. I've actually been pondering doing a "high performance" set of imp/exp handlers that assume an all-at-once-array and only simple transformations and so can skip about 40 layers of nested function calls. So a SerdeBinary is entirely reasonable, whether it's first-party provided or not.

There's 2 catches, though:

The Formatter interface assumes a JSON-esque set of export types. That's sufficient for most any text-based serialization format, but possibly not binary, it seems. However, that list wasn't designed to be extensible. While technically you could add an extra interface, like BinaryFormatter, that adds more format methods, the Serializer class wouldn't know about it, so it would still only be typed as Formatter. Changing that puts us squarely into 2.0 territory. (And I would want to do a 2.0 of AttributeUtils that adds some more flexibility using PHP 8.4 before I did that, and I'm not sure exactly all of what I want to do there.)

As mentioned, TypeField is singular per field. That means specifying, say, #[UInt8] would be incompatible with also specifying #[PositiveInt] or #[DoctrineId] (something I know another Serde user is doing). That effectively makes an class using such attributes format-dependent, whereas the intent is for it to always be format independent. Hence why I've been considering FormatField as a per-format multi-value alternative.

So it sounds like doing it "right" would involve going all-in on Serde as a framework. I'm not entirely sure what all that entails, though, and there's enough users that backward compatibility becomes a factor.

I've actually been pondering doing a "high performance" set of imp/exp handlers that assume an all-at-once-array and only simple transformations and so can skip about 40 layers of nested function calls.

If you think about it, a JSON/array output is basically just text ... or if you squint at it, can be represented as a template. And PHP is really good at templates. 😈

The hard part is sanitization and serialization to the template which this library has all the plumbing for; now that I think about it.

The Formatter interface assumes a JSON-esque set of export types. That's sufficient for most any text-based serialization format, but possibly not binary, it seems. However, that list wasn't designed to be extensible.

I’m not even sure it could be considered complete even if you tried. MsgPack has (u)int8/16/32 etc. MD5 is a uint128, and sha256 is a uint256. Some can be represented in PHP without any information loss; while others have to be stored as a binary string or can be partially represented (i.e. a uint32 can’t be fully represented on a 32-bit machine, but it can be fully stored — it is just that the second half of the range is represented as a negative number; but it can be fully represented on a 64-bit machine).

So, you would probably need some kind of interface to represent a type conversion: TypeConversion that can be specified by a formatter. So, JSON would use IntToString to capture the integer in a string representation, which would then get passed to an appropriate template handler which would handle sanitization/encoding

{"id": $id, "name": "$name", "balance": $float}

where in protobuf it would use different type conversions but look somewhat similar in a template form:

$idTag$id$nameTag$nameLen$name$balanceTag$float

In some ways it is kind of unfortunate that we have json_encode — it is too darn powerful.

Using json_encode, you would probably use an "Identity" TypeConversion because json_encode will handle the entire pipeline.

Sorry — I’m totally off topic here, but I feel like I spent so long writing it down that I should send it anyway.

As mentioned, TypeField is singular per field. That means specifying, say, #[UInt8] would be incompatible with also specifying #[PositiveInt] or #[DoctrineId]

I did notice that; but hadn’t thought through the implications. That is not ideal...

I suspect that isn’t an easy-ish fix?

Making TypeField multi-value would be... invasive. It would break a lot of assumptions, and ask a lot of questions along the lines of "so which one do I use right now?"

It may be enough to pass on the raw ReflectionProperty so tooling can access the raw attributes?

ReflectionProperty is not serializable. It's exposed to the Property attribute, which can pull out the pieces it wants. (This is AttributeUtils functionality.) But once that's cached, Reflection is not called again.

Crell · 2025-05-30T20:56:36Z

src/Formatter/ProtobufFormatter.php

+        return $runningValue;
+    }
+
+    public function serializeString(mixed $runningValue, Field $field, ?string $next): mixed


There's enough going on here that the switch cases should be split up to their own private methods. (Which would probably enable using a match() instead.)

Crell · 2025-05-30T21:01:38Z

And of course the whole thing needs gobs of tests.

withinboredom added 7 commits May 30, 2025 09:08

add more type tags

3c7f7e1

sfixed64 can also be string

513a8e2

sint64 can also be a string

b3ab15a

add field number to field

655f64f

all floats in php are doubles 🤦

93e1b53

add type for keeping track of the running value and cursor location

a906d84

add formatter for protobuf

2ea0d47

Crell reviewed May 30, 2025

View reviewed changes

withinboredom mentioned this pull request Sep 12, 2025

Add modular threads php/frankenphp#1795

Merged

		private const FULL32 = 0xffffffff;
		private const INT32_MAX = '4294967296';

		if (extension_loaded('bcmath')) {
		$bytes = bcmod($next, self::INT32_MAX);

Uh oh!

Protobuf Formatter #85

Are you sure you want to change the base?

Protobuf Formatter #85

Uh oh!

Conversation

withinboredom commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

withinboredom May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Crell commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

withinboredom commented May 30, 2025 •

edited

Loading

withinboredom May 31, 2025 •

edited

Loading