Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full serialization using iwoplaza/typed-binary package #25

Merged
merged 19 commits into from
Jul 31, 2024

Conversation

UlrichEckhardt
Copy link
Contributor

Just as an example, here's one of the box classes implementing reading and writing. Notes:

  • getSchema() needs to become part of the baseclass/interface.
  • The separate static typeCode could be merged with the schema.
  • In order to read a yet unknown type of box, you first need to read the header with length and type. Afterwards, you rewind to the beginning and read using the according schema. This isn't pretty, but probably unavoidable.

@cyraxx
Copy link
Contributor

cyraxx commented Jul 18, 2024

  • In order to read a yet unknown type of box, you first need to read the header with length and type. Afterwards, you rewind to the beginning and read using the according schema.

I think that's actually exactly what c2pa-rs does as well.

@UlrichEckhardt UlrichEckhardt force-pushed the feature/typed-binary-integration branch 7 times, most recently from f438af4 to e0e4fc4 Compare July 22, 2024 10:34
@UlrichEckhardt
Copy link
Contributor Author

This branch now experimentally enables the typed-binary serialization using a switch in SuperBox.fromBuffer. From what I can tell, processing JPEGs seems to work. If it works, the existing code, which is still present, needs to be removed.

Some question marks remain:

  • One thing I don't like is that quite some data is copied between buffers unnecessarily. My gut feeling says that this won't perform as well as it could.
  • The other thing is that construction of a JUMBF structure in memory and serializing it works, but using it as-is won't, because that requires access to the rawContent field in some boxes. I wonder if we couldn't rewrite code to use proper data structures instead of fiddling with the internals of some other object, breaking encapsulation.
  • Some of the boxes have kind-of optional data, like the uuid field in DescriptionBox and UUIDBox. I wonder if those are really optional or if they could have a default. This could simplify some code.

@UlrichEckhardt
Copy link
Contributor Author

Short notes after call:

  • UUID etc.: Default value
  • TS properties for rawContent: maybe later
  • next milestone: JUMBF to Manifest and back again
  • CBOR tag: check cbor-js for tagging support
  • Schemata -> schemata, CamelCase file names only for classes
  • configure linter/TS to only allow correct JS-flavour features
  • ensure unknown JUMBF box types can be deserialized

@UlrichEckhardt UlrichEckhardt force-pushed the feature/typed-binary-integration branch 5 times, most recently from 72e92ec to cf533d2 Compare July 28, 2024 10:55
@UlrichEckhardt
Copy link
Contributor Author

This should now be ready for review and merge, there shouldn't be any regressions from the existing code.

Implementing this, one question came up though, concerning the uri field in super boxes. This is only set if the nested description box contains a label, otherwise it is left undefined. This undefined URI is then used to construct the URI of nested super boxes, too, which is suspicious to me. I haven't checked the spec yet, but would it make sense to use an implicitly empty label instead of undefined one for the description boxes? This is currently implemented as it was before.

@UlrichEckhardt UlrichEckhardt requested a review from cyraxx July 28, 2024 10:55
@UlrichEckhardt UlrichEckhardt changed the title Full serialization using iwoplaza/typed-binary package (WORK IN PROGRESS) Full serialization using iwoplaza/typed-binary package Jul 28, 2024
@cyraxx
Copy link
Contributor

cyraxx commented Jul 28, 2024

According to the spec, every box in a C2PA manifest needs to have a label:

All JUMBF Description boxes (ISO 19566-5:2023, A.3) used in a C2PA Manifest require a label, the Label Present toggle (xxxxxx1x) shall be set. In addition, because JUMBF URIs are used to refer to boxes throughout the system (e.g., listing assertions, references to ingredients, etc.), the Requestable toggle (xxxxxx11) shall be set.

Either way, you're right that using undefined to build further URIs doesn't make sense – if a box doesn't have a label, it can't have a URI and neither can any child boxes.

Copy link
Contributor

@cyraxx cyraxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!

I've aded some inline comments.

General remarks:

  • Mocha didn't seem to run the full asset decoding test suite here, shouldn't it?
  • The only thing I'm not really happy about is all the individual readByte()/writeByte() calls. Is there no way to read/write full byte arrays using typed binary?
  • We should also add typed binary to the list of acknowledgments in the README
  • About the URIs see my previous comment

measure(value: C2PASaltBox, measurer: bin.IMeasurer = new bin.Measurer()): bin.IMeasurer {
return measurer.add(
4 + // length
4 + // type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this common header length be moved to the generic box schema instead?

Actually, the same goes for reading and writing the common header (length, type):

  • Is it possible to have the generic schema handle the header and have the individual schemas only contain the specific box type's content?
  • If not, have common methods for reading/writing/measuring the header that are called by each schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we can absolutely do is this:

return measurer.add(
    this.length.measure(0).size +
        this.type.measure(value.typeCode).size +
        (value.salt ? value.salt.length : 0),
);

I find that a bit less readable and the performance will probably suffer, too.

Concerning the repeated code, let's consider this:

class BoxSchema extends bin.Schema<Box> {
    read(input: bin.ISerialInput): Box {
        const length = this.length.read(input);
        const type = this.type.read(input);
        return this.read_internal(input, length, type);
    }
    ... // ditto for write() and measure()
}

My problem here is that this returns and consumes Box objects, not e.g. C2PASaltBox. Doing that with correct typing on top could possibly be done using mixins. I'd have to do some research first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could possibly be made to work with generics – but how about this as a less complicated approach:

    write(output: bin.ISerialOutput, value: C2PASaltBox): void {
        this.writeHeader();
        value.salt?.forEach(byte => output.writeByte(byte));
    }

    measure(value: C2PASaltBox, measurer: bin.IMeasurer = new bin.Measurer()): bin.IMeasurer {
        return this.measureHeader(measurer).add(this.value?.length ?? 0);
    }

And then have writeHeader() and measureHeader() in the base class? I might be missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just extracted the length/type handling into an abstract base template for the concrete schemata. This was medium complicated, mainly to ensure correct typing, but I think it's a good compromise.

readonly id = bin.u32;
readonly hash = bin.arrayOf(bin.byte, 32);
// Note: This doesn't work due to a circular import.
// readonly privateBoxes = new GenericBoxSchema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually circular imports can be prevented by introducing a common base interface. I haven't checked the exact issue here though and whether that might help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GenericBoxSchema class (the name sucks and I'd happily welcome a better one) needs to know all box classes and their schemata while some box classes' schemata (super, description) need to access to a generic schema for the nested boxes. I honestly don't fully understand why it seems to work in some cases but not others, so I basically gave up.

Writing this, I was wondering if the GenericBoxSchema couldn't remain ignorant of any but the fallback schema. All other schemata would instead actively register themselves. I'm not sure if this is worth the effort at this point or if there are more urgent things to do.

@cyraxx
Copy link
Contributor

cyraxx commented Jul 29, 2024

Answering my own question:

  • The only thing I'm not really happy about is all the individual readByte()/writeByte() calls. Is there no way to read/write full byte arrays using typed binary?

This seems to be WIP iwoplaza/typed-binary#11

@UlrichEckhardt UlrichEckhardt force-pushed the feature/typed-binary-integration branch 2 times, most recently from 2fa4e67 to f05f75f Compare July 29, 2024 20:06
@UlrichEckhardt
Copy link
Contributor Author

  • Mocha didn't seem to run the full asset decoding test suite here, shouldn't it?

Yes, bug in the GitHub Actions setup, already fixed in main. Run echo tests/**/*test.ts in main and feature/typed-binary-integration for comparison. This behaves properly in fish, but fails with bash and sh.

  • We should also add typed binary to the list of acknowledgments in the README

Done. I also pushed ack's for cbor-x and mocha.

  • About the URIs see my previous comment

I'll look into that this week. I wonder if this should be handled during deserialization or during manifest creation/validation. This is similar to a JSON/CBOR field that doesn't properly decode.

@cyraxx
Copy link
Contributor

cyraxx commented Jul 29, 2024

  • About the URIs see my previous comment

I'll look into that this week. I wonder if this should be handled during deserialization or during manifest creation/validation. This is similar to a JSON/CBOR field that doesn't properly decode.

I wouldn't spend too much time on it – nobody would write a C2PA manifest that doesn't have each SuperBox labeled as it simply would be completely invalid manifest. So it's fine to just not apply the URI if there's no label and then have validation fail later when it tries to resolve URIs.

Speaking of, is there a reason why applyURI() is static?

@UlrichEckhardt
Copy link
Contributor Author

Speaking of, is there a reason why applyURI() is static?

It doesn't need any this instance, simply. Since it's called from a static factory method, too, there isn't a this either.

@cyraxx
Copy link
Contributor

cyraxx commented Jul 30, 2024

Speaking of, is there a reason why applyURI() is static?

It doesn't need any this instance, simply. Since it's called from a static factory method, too, there isn't a this either.

Well, it does take a box as its first parameter so I thought it could simply be turned into an instance method. Not a big deal either way 😉

@UlrichEckhardt UlrichEckhardt force-pushed the feature/typed-binary-integration branch from f05f75f to adf62dc Compare July 30, 2024 10:45
These are used by multiple box-specific schemata. This implements:

- length: Just a simple 32-bit integer.
- type: Four letter word, representing the type of a box.
- uuid: UUID made of 16 raw bytes.
- fallback: fallback schema for unknown box types. This may be
  temporary, but for now it avoids having too many loose ends
  in the code.
The schema is required to be given as parameter to the constructor,
along with the type string. Typically, they are provided by one of
the derived classes.
This type template eases writing box schemata. It contains an abstract
base class that manages the length and type header parts of the schema.
It then delegates the other parts to the derived class.
This schema looks at either the input stream or the box and the
dispatches to the according schema. Note that this presents some
issues with circular dependencies, because this class needs to
know about box classes that nest others and vice versa.
- Delete `parse()` functions from all box classes.
- Re-implement `BoxReader` based on the typed-binary code.
  Maybe this class could even be removed entirely, it is only
  used by some tests currently.
@UlrichEckhardt UlrichEckhardt force-pushed the feature/typed-binary-integration branch from adf62dc to 7d3d7ab Compare July 31, 2024 08:05
@cyraxx cyraxx merged commit 1c45f93 into main Jul 31, 2024
3 checks passed
@cyraxx cyraxx deleted the feature/typed-binary-integration branch July 31, 2024 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants