Open
Description
As a python library in Jupyter user, I'd like to be able to re-run arbitrary cells in my notebook in whatever order I see fit and still have correct run metadata generated, so that I can enjoy the interactivity and explorability of Jupyter notebooks along with the provenance tracking and reproducibility of Dotscience.
Currently, I can't, because of #1. Let's try using a different approach and see if it works.
ACs:
As this is a prototyping effort, all these ACs are to be considered "aspirational"; we'll see what we can achieve in practice then decide, at the end, whether what we have is better than what we ALREADY have.
- I can run cells in any order a reasonable user would do so, and get the results I'd expect in my run metadata.
- No extra user effort is required.
- Unless I do something really/deliberately silly or unlikely, there's no way to end up with two runs merged together ending up in a Dotscience commit.
These ACs should apply for all of these cases:
- Notebooks with one run.
- Notebooks with two or more runs in separate cells, eg multiple calls to
ds.publish()
. - Notebooks with a loop that generates runs, eg a single call to
ds.publish()
that's in a loop (eg, trying the same algorithm with a range of input parameters to see what's best). - A combination of the previous two cases.
Implementation plan:
We have a CUNNING PLAN to break this impasse! It's Luke's suggestion:
- Don't store state in-memory in the python library, because the history of that in-memory state is the dynamic flow of execution of Jupyter cells which may have nothing to do with their order in the notebook, leading to the problems expounded above.
- Instead, every time you call a metadata-registration function like ds.input(), it should output a machine-readable tag at that very point.
ds.publish()
outputs an "end of this run" tag- the parser (be it notebook or command-output) reads the tags from top to bottom, building up in-memory state in notebook lexical order and outputting a run and clearing its in-memory state at the "end of this run" tag
- Therefore, the assignment of actions to runs is based purely on the lexical structure, not the dynamic structure.
- For extra niceness, in Jupyter mode, we can output the markers inside "Jupyter widgets" that control their display (rather than plain text) so they're less obtrusive and prettier; but we need to transparently not do that when not in Jupyter.
- How does this work with "publish inside a loop"? Unless we come up with a clever trick, we'll only keep the results of the last iteration of the loop. But do users do publish inside a loop to try the same algorithm with different input parameters, or copy+paste the cell and edit the parameters in each copy?