Skip to content

[validator] validating provenance metadata #173

@bclenet

Description

@bclenet

Validator should be able to validate provenance metadata. Here is an incomplete list of rules that could be checked by the validator.

They come from this comment : bids-standard/bids-specification#2099 (comment)

  1. Every id used in all prov files is unique with respect to the dataset.
    When a given file is validated we build a context for it that has all the information it needs to be validated by the schema. Typically this involves loading a sidecar or a handful of specific associated files. This type of pan-dataset assertion is not possible in the current schema.
  2. *_ent.json "AtLocation": Its value exists in the dataset. Doable in the current schema.
  3. *_ent.json "GeneratedBy": Its values only reference existing Activity ids
  4. *_act.json "AssociatedWith": Its Values only reference existing Software ids.
  5. *_act.json "Used": Its Values in references only reference existing Environment ids or ProvEntity Ids
  6. anybidsfile.json "GeneratedBy": Its Values only reference existing Activity ids
  7. anyBidsfile.json "SidecarGeneratedBy": Its Values only reference existing Activty ids
    This suffers from the above issue. For any given prov file we are validating we must load arbitrarily many others and check the value at a specific key inside each of their arrays. We could try and come up with new semantics for schema entries that would allow this. Another option would be to come up with a new way in the schema to aggregate values from multiple files into a single place, and then a way of running checks on the aggregated data. But, even if all the, for example, Software objects were gathered in a single place we'd have another problem...
    The expression language is incapable of iterating through an array of objects and running a check on a specific key for each element. We could extend the expression language to have a function like flatten(array: List[List | dict], key: Optional[star]. If the input is an array of arrays then we use normal flatten semantics of putting all elements of all arrays in a single array and return that. If the input is an Array of objects we return an array of each objects value at the key specified.
    One thing I like about this proposal is that each json file is simple enough to be immediately understood by a human. I was playing around with alternative ways of organizing data from the examples that might be more amenable to the current expression language and they were all much more difficult to read at a glance. The UID in the Ids makes me think this was not meant to be produced or consumed by humans, but I'm a sucker for looking at any json file that comes across my path.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions