Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should gitsign attest _just_ store the signed attestation? #623

Open
TomHennen opened this issue Jan 22, 2025 · 6 comments
Open

Should gitsign attest _just_ store the signed attestation? #623

TomHennen opened this issue Jan 22, 2025 · 6 comments

Comments

@TomHennen
Copy link
Contributor

It currently stores both the raw predicate and the signed attestation (which also contains the predicate).

This seems redundant (users can get the predicate from the signed attestation) and potentially dangerous (users can get the predicate data without verifying the signature on the attestation).

So, perhaps it would make sense to only store the signed attestation, and provide an 'easy' way to verify the signature attestation and get the contents?

@wlynch
Copy link
Member

wlynch commented Jan 22, 2025

I'm not more worried about users fetching predicate data without checking the signature - it's just as easy to cat <file> | jq .payload | base64 -d that I don't think storing just the attestation is any more of a deterrent.

IIRC I think the reason why I set this up this way originally was:

  1. to make it easier to allow multiple signers attest to the same predicate without needing to duplicate the predicate content in each attestations - blobs will already be deduped in git storage, so the attestations can be wrappers over the blob SHAs. This is especially true if the predicate is checked into the main branch already - the blob also appearing in the attestation branch is almost negligible because its storage is deduped at the blob level. The content duplication in the attestation was mostly a side effect of reusing the cosign attestation libraries, we could alternatively change this to be a blob reference.

  2. To make it easier to reason about diffs with existing git tooling, since base64 DSSE blobs are effectively unusable using standard diff tools.

  3. To preserve the original predicate blob - In cases where the predicate was itself JSON, you may lose data when wrapping the raw predicate format into in-toto due to json serialization. https://github.com/wlynch/attest-demo has an example of this, but a blob like

    {
            "name": "vuln scan",
            "status": "success"
    }

    will end up being encoded as

    {"_type":"https://in-toto.io/Statement/v0.1","predicateType":"","subject":[{"name":"","digest":{"sha256":"021f858127457b5e3a416ed601f194bada463933"}}],"predicate":{"name":"vuln scan","status":"success"}}

    Even though the data is the same, it's not the same blob that gets stored. This was of interest because part of the goal here was to make it easy for CI tools to do things like gitsign attest -f test-output.json and have those blobs be made a part of the repository.

    What we may want to do instead is include information around the original file digest and reference that instead of including the whole blob in the payload. It's possible that the answer here might also be use a different in-toto predicate type though - e.g. maybe reference the file by digest instead and have tooling stitch it together? 🤔

That said, 100% agree that we should include tooling to make it easy to do the right thing for accessing the data + verification.

@TomHennen
Copy link
Contributor Author

Hmm, I'm not sure that would work for my use case. One of the things I need to be able to do is take the signed attestation, in a portable format (e.g. signed DSSEs), and send them elsewhere, where they'll get verified by other tools that may not know anything about gitsign.

@TomHennen
Copy link
Contributor Author

TomHennen commented Jan 24, 2025

Here's some more detailed thoughts (sorry I didn't have time for this earlier)

  1. to make it easier to allow multiple signers attest to the same predicate without needing to duplicate the predicate content in each attestations - blobs will already be deduped in git storage, so the attestations can be wrappers over the blob SHAs. This is especially true if the predicate is checked into the main branch already - the blob also appearing in the attestation branch is almost negligible because its storage is deduped at the blob level. The content duplication in the attestation was mostly a side effect of reusing the cosign attestation libraries, we could alternatively change this to be a blob reference.

FWIW the DSSE envelope allows multiple signatures without duplicating the content so, if you standardized on that format then this problem could be solved in that way.

  1. To make it easier to reason about diffs with existing git tooling, since base64 DSSE blobs are effectively unusable using standard diff tools.

At the very least in my use cases I don't intend to update old data to existing attestations. Instead I'd just be adding new things. So diffing just isn't important to me (but it maybe it is to others?).

  1. To preserve the original predicate blob - In cases where the predicate was itself JSON, you may lose data when wrapping the raw predicate format into in-toto due to json serialization. https://github.com/wlynch/attest-demo has an example of this, but a blob like

I can see how this could be an issue for folks that are just trying to sign raw data. For folks that intend to use in-toto statements this shouldn't be an issue though (and definitely wouldn't be an issue if we allowed them to just provide the statement fully formed so gitsign doesn't need to serialize it). For folks that want to sign raw data, it might be better to not stuff it in an in-toto attestation and instead just stick it directly in a DSSE? (As an aside I will say there's something of a problem in preserving the original blob but signing over the mutated blob, how can folks tell if the original blob was tampered with? The signature would never verify.)

But, I should probably focus more on the specific things I'm trying to do, and less on cases I'm not concerned with. So here are the use cases I'm trying to solve:

  1. Being able to easily sign in-toto attestations related to git commits within a GitHub Actions workflow
  2. Having a canonical storage location for those attestations
  3. Having a straightforward way to collect the attestations associated with a commit
  4. Having a straightforward way to verify the signatures on those attestations (though this verification might take place far away from gitsign itself).

With those use cases in mind, I suppose it doesn't matter much to me if there's an extra blob in there. But not having the signed in-toto attestation itself with the content embedded within would be an issue.

Under the current scheme I suppose I can just ignore that extra blob, but I think there is still room for improvement in how the attestations themselves are stored (2), which is likely a prerequisite for addressing 3 (since it will need to know how things are organized).

One of the challenges I see with the existing approach is that storage scheme could perhaps use a bit more definition. Right now it stores the results based on the input filename itself, overwriting any old attestations with the same filename. I think it's quite probable that there will be duplication in filenames and that folks will want to keep both sets of content. At the same time, I don't have any use cases for needing to update previous attestations (which makes diffing support unnecessary for me).

There may be a way for us to both get what we want by applying a bit more structure to the storage. Something like:

<commit sha>/
  in-toto/
    <predicate-type>/  <-- replace / with _
      <statement-digest>.json  <--- contains the signed DSSE, using .json as the extension per the spec: https://github.com/in-toto/attestation/blob/main/spec/v1/envelope.md#file-naming-convention
      <statement-digest>.json  <--- Using the 'statement digest' as the filename prevents collisions from duplicate predicate types
  raw/
    <user-specified type>/
      <orig-filename>-<content digest>  <-- or maybe don't worry about duplicates and drop the `-<content digest>` part
      <orig-filename>-<content digest>.sig  <-- sig stored in whatever form makes the most sense?

Example

17a3331ce9dec14a10e18cc1fb0237e6c65dac53/
  in-toto/
    https:__slsa.dev_verification_summary_v1/
      abc123.json
      def456.json

This setup would allow for command lines like

# Just get intoto attestations of a specific predicate type
$ gitsign get-attest abc123 --intoto-pred "https://slsa.dev/verification_summary/v1" --output-intoto-bundle vsa.intoto.jsonl
# Get all intoto attestations
$ gitsign get-attest abc123 --all-intoto --output-intoto-bundle abc123.intoto.jsonl
# Get all the attestations, including the 'raw' ones
$ gitsign get-attest abc123 --all --output-intoto-bundle abc123.intoto.jsonl
# Get all the attestations, but don't bundle them, just output them into a directory indivdiually
$ gitsign get-attest abc123 --all --output-folder abc123-attest/

Sorry, that was a lot...

Thoughts?

@TomHennen
Copy link
Contributor Author

FWIW as I was getting coffee I realized the problem with my 'raw' suggestion above (just sticking it in the DSSE) is that it doesn't allow the signature to cover the binding to the relevant git commit... 🤷

@wlynch
Copy link
Member

wlynch commented Jan 29, 2025

FWIW I'm not strongly tied to the existing storage setup, mainly wanted to provide some context as to why those choices were originally made. I'm open to storing the whole predicates, and I think the structured vs raw examples you gave could be a good trade-off here!

+1 on providing more structure on the storage.

I agree with moving to a hash-based filename, though I'm not 100% sold on encoding the predicate type into the path -

  1. I don't want to deal with collisions (I believe _ is valid in a predicate type string)
  2. I wonder if there's value in following cosign's signature spec for consistency and to have a place for other unauthenticated signature metadata.

My guess is we may want a metadata file along side each file to allow cheap-ish filtering without needing to load the entire predicate. (this could possibly be a single manifest? this would be a bottleneck for writes, but we already have a similar bottleneck for ref updates)

e.g.

abc/
  def.json
  def.metadata.json

where metadata.json looks something like:

{
  "mediaType": "application/vnd.dsse.envelope.v1+json",
  "digest": "sha256:def"
  "annotations": {
    "predicateType": "https://slsa.dev/verification_summary/v1",
    "foo": "bar",
  }
}

This way you can get flexibility of encoding the type so we're not strongly tied to DSSE with optional predicate type sub-data without being constrained by encoding everything into the path, and we'd have similar behavior to cosign (while that it's a strong requirement, is a nice to have).
You'd only need to load the metadata files to do queries for specific types without needing to inspect the payloads themselves.

@TomHennen
Copy link
Contributor Author

That all works for me. :)

I don't want to deal with collisions (I believe _ is valid in a predicate type string)
Agreed, that was why I made the predicate type the folder. FWIW _ is not a valid predicate type (it has to be a URI%2C%20required-,URI,-identifying%20the%20type)), I think you might be thinking of the subject.name field.

Either way, there can certainly be multiple attestations, from the same or different tools, that have the same predicate type. So it's absolutely possible to have collisions. Using the digest of the thing as the name solves that. The solution you outline (having the predicate type in a metadata file instead of in the directory name) works almost as well. The disadvantage is that you can't just "get all the attestations under folder " and instead have to "parse all the metadata files to find matching predicate types". Either way the user can do the same thing though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants