Skip to content

PROTOTYPE: Use lexical scope to assign metadata to runs in notebooks #6

Open
@alaric-dotmesh

Description

@alaric-dotmesh

As a python library in Jupyter user, I'd like to be able to re-run arbitrary cells in my notebook in whatever order I see fit and still have correct run metadata generated, so that I can enjoy the interactivity and explorability of Jupyter notebooks along with the provenance tracking and reproducibility of Dotscience.

Currently, I can't, because of #1. Let's try using a different approach and see if it works.

ACs:

As this is a prototyping effort, all these ACs are to be considered "aspirational"; we'll see what we can achieve in practice then decide, at the end, whether what we have is better than what we ALREADY have.

  • I can run cells in any order a reasonable user would do so, and get the results I'd expect in my run metadata.
  • No extra user effort is required.
  • Unless I do something really/deliberately silly or unlikely, there's no way to end up with two runs merged together ending up in a Dotscience commit.

These ACs should apply for all of these cases:

  • Notebooks with one run.
  • Notebooks with two or more runs in separate cells, eg multiple calls to ds.publish().
  • Notebooks with a loop that generates runs, eg a single call to ds.publish() that's in a loop (eg, trying the same algorithm with a range of input parameters to see what's best).
  • A combination of the previous two cases.

Implementation plan:

We have a CUNNING PLAN to break this impasse! It's Luke's suggestion:

  • Don't store state in-memory in the python library, because the history of that in-memory state is the dynamic flow of execution of Jupyter cells which may have nothing to do with their order in the notebook, leading to the problems expounded above.
  • Instead, every time you call a metadata-registration function like ds.input(), it should output a machine-readable tag at that very point.
  • ds.publish() outputs an "end of this run" tag
  • the parser (be it notebook or command-output) reads the tags from top to bottom, building up in-memory state in notebook lexical order and outputting a run and clearing its in-memory state at the "end of this run" tag
  • Therefore, the assignment of actions to runs is based purely on the lexical structure, not the dynamic structure.
  • For extra niceness, in Jupyter mode, we can output the markers inside "Jupyter widgets" that control their display (rather than plain text) so they're less obtrusive and prettier; but we need to transparently not do that when not in Jupyter.
  • How does this work with "publish inside a loop"? Unless we come up with a clever trick, we'll only keep the results of the last iteration of the loop. But do users do publish inside a loop to try the same algorithm with different input parameters, or copy+paste the cell and edit the parameters in each copy?

Metadata

Metadata

Assignees

No one assigned

    Labels

    taskNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions