CF Roadmap: Kicking off work on Provenance and Lineage theme #389
Replies: 11 comments 22 replies
-
Does "provenance and lineage" mean discovery metadata? |
Beta Was this translation helpful? Give feedback.
-
No, this is focused on being able to inspect data and find out how it was made. |
Beta Was this translation helpful? Give feedback.
-
Would this possibly be a revival of work on ACDD? I've also come across W3C's PROV a few times. |
Beta Was this translation helpful? Give feedback.
-
Is it related to (The present discussion so far reads a bit like Twenty questions. 😃) |
Beta Was this translation helpful? Give feedback.
-
My understanding is that many of the mentioned issues and external resources have some overlap. I think that Daniel @erget captures it well:
At this general level the W3C's PROV that Andrew @DocOtak points at is the comprehensive resource. The abstract of the web page states: For users requiring this kind all-embracing machinery, maybe the best thing CF can do is to point at PROV in the conventions document, and by including a suitable string in the Personally I think that the 2024 CF Workshop presentations by David Huard here, in the Uncertainty session, and by José Manuel Gutiérrez here, in the Statistical Processing session, offers a lot of food for thought. Many users do not need e.g. manual quality judgement, expert group voting, and much more, but still do need something more than current
I think that by enhancing the cell method machinery we can cover several existing use cases, but I also think that we need to come up with something new that complements cell methods to cover even more complex use cases and data manipulation/processing. I have for some time been playing with the idea of creating some kind of "pseudo-language". That is, something similar to what is now used for parametric vertical coordinates but with freedom to also describe the equations. Well, still just some wild ideas without much substance behind .... |
Beta Was this translation helpful? Give feedback.
-
Hi folks, particularly @sethmcg and @pagecp - the survey has spoken, unfortunately there's no appointment for the near-term that fits for all of us but @sethmcg we'll fill you in and hopefully you can join later. Anybdoy who wants on the invite let me know and I'll add you. @pagecp you're already on it. It'll be 2024-11-27T13:00Z on Teams. |
Beta Was this translation helpful? Give feedback.
-
Hi folks, we had an appointment last year to roadmap what we want to do with provenance and lineage... In the end it was a veritable echo chamber - I was the only one there! 😱 Do we want to try it again? Mark your availability here by EOB on 12 Feb (next Wednesday) and we'll set a date. |
Beta Was this translation helpful? Give feedback.
-
We met yesterday (sorry @sethmcg for leaving you out in the dark, that was definitely not intentional and it's my fault we missed out on you 😥) and had a first meeting to discuss our intents. We are meeting again on 12 March at 16 CET - if you aren't on the invite and would like to be, let me know and I'll add you. What use cases did we discuss?We identified the following reasons why somebody using CF data may want to have provenance baked in:
Not all of these use cases need to be fulfilled for things to be useful. The W3C PROV standard provides tools that service all of them. What do we propose incorporating into the CF Roadmap?In this order:
Any design principles we want to propose?We want to re-use PROV rather than adapt it or invent something, because using an existing standard has a lot of advantages. The issue that we see here is how to represent the provenance data in a way that works with CF so we don't have a multutide of implementations. What are we doing between now and the next meeting?I made some technical documentation available to the people in the meeting to mull over. This is related to a prototype implementation that we had at EUMETSAT for inspiration, it's not something that we could just take off the shelf and use. So we're thinking about
In particular, Lars kindly volunteered to attempt to provide an overview of uses of PROV in IPCC and other settings, this may or not be ready at that point. Looking forward to seeing you all at the next meeting :) |
Beta Was this translation helpful? Give feedback.
-
We discussed the use case
I'd like to add that this use case could well cover my use case of recording the uncertainty of data used at different parts of the processing chain, assuming that each contributing dataset includes its own uncertainty description (which is not in any way a part of this discussion!). |
Beta Was this translation helpful? Give feedback.
-
To be better informed for this afternoon's meeting, I've studied PROV a bit. The primer gives examples in three notations suitable for text files, namely xml, Turtle and their own PROV-N notation. As Daniel said, we could easily point to a file containing PROV from a global or variable netCDF attribute. If we wanted to store a PROV description using any of the notations within a CF-netCDF file, we could put it in a string or The PROV data model is not inconsistent with the CF data model, because the concepts are non-overlapping. Therefore if we wanted to contain the PROV information in the netCDF file in a way which could be manipulated, I think it would be pretty straightforward to store it in netCDF variables with attributes. They would be what we call "container" variables in the CF convention, which have no data (they are scalars with an irrelevant value), because their purpose is simply to contain attributes. From the xml of the primer's complete set of examples, you can see that there are only two levels in PROV, which would correspond to variable and attribute in netCDF; you never have a thing inside a thing inside a thing. I suppose, for instance, that you could have a container variable for each of the three types and seven relations of Table 2, with a CF attribute indicating what sort of thing it is. They would refer to each other by their variable names contained in attributes indicating the PROV roles. They might want to refer to CF external variables too. Collections and bundles could be done the same way. |
Beta Was this translation helpful? Give feedback.
-
@davidhassell @JonathanGregory @sethmcg @taylor13 and I had a meeting yesterday and have I believe everything that we need in order to report back into the roadmapping group. Here's my summary, improved a bit thanks to @JonathanGregory . If anybody sees a way to make this better or would like to join the conversation after having not been involved thus far, that is welcome. SummaryUse casesIn the CF roadmap we want to support 3 use cases in order of decreasing priority:
In order to realise this we want to use the W3C standard PROV, which allows us to interlink provenance records and, if combined e.g. with verifiable claims could be used for integrity checks and nonrepudiation. These features are outside the scope of CF, as explained below. ScopeWe won't worry about defining PROV. Although we haven't decided for sure, we're unlikely to use a container variable to embed the PROV record in the netCDF file - despite the elegance that that would provide, there are also many advantages of having a completely external PROV record. Although provenance is related to uncertainty estimates, we will not try to use PROV as the solution for describing uncertainty. Instead, we want an unequivocal way of referring to PROV records in CF, so that there is one way for data providers and consumers to provide and find that information. Other notesWe will encourage users to attach PROV to intermediate products rather than having huge history attributes - at a certain point the history attribute just isn't equipped for describing complex data pipelines. We need some careful work to make it possible to attach PROV records to specific variables, groups, etc. - it's not a given that a PROV record would refer to an entire file, although this should be possible. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Topic for discussion
Do you have ideas around data provenance and lineage, and would you like to be involved in ensuring that these themes are represented in a way that's beneficial to the CF Roadmap? This is the discussion for you.
At this year's CF Workshop we divided the roadmap preparations into multiple themes. This is one such theme. As you can see from the table, right now we're working on a green field with a blue sky - nothing really more than 2 keywords to start with. Behind that are a whole bunch of ideas and opportunites to shape where we go with it.
I'm setting up a call for later this month. If you want to be involved, please indicate your availability in this survey by the end of 14 Nov 2024.
Topics:
From there we can agree how to organise the work moving forward.
If you're not available on those dates but still want to be involved, that's no problem - let me know in the comments or drop me a mail and I'll keep you looped in :)
@pagecp
Beta Was this translation helpful? Give feedback.
All reactions