A bespoke template for spinning up a new folder that can be tracked with git and exported as a .zip to Zenodo
input_filesActual or symlinked files for the tracked analysis- Provenance tracking is implicit for "in-house" generated files, in the sense that the filename and/or checksum can be used to track down the original file on a given filesystem.
- If using symlinks, they'd relatively link to files outside the templated folder (this is for convenience to quickly iterate on multiple analyses on a single filesystem w/ multiple analyses in separate folders)
- Ideally, everything in
input_filesshould transition tozenodo_itemsonce those upstream analyses that made the files are uploaded to Zenodo and assigned a DOI
derived_filesModified forms or otherwise subsets of theinput_files. Generated using code within the template, i.e.run.shzenodo_itemsActual files, within folders that are named with a manually modified form of adoi:URI, wherein the colon:and the internal slash/of the DOI identifier are replaced with double-dash--- This is used to manually track provenance of files that were sourced from existing Zenodo items.
- For the example folder
doi--10.5281--zenodo.10569208, reversing the manually renaming producesdoi:10.5281/zenodo.10569208, thendoi.orgcan resolve it: https://doi.org/10.5281/zenodo.10569208 . The example fileEXAMPLE--KR_seqkit_replace_kv.2.0.tsvis present in the .zip archive of that linked Zenodo item (asKR_seqkit_replace_kv.2.0.tsv)
binA place to stage downloaded scripts fromsetup_env.sh, or a place to commit bespoke scriptscontainersA git untracked place to stage downloaded containers fromsetup_env.shsequenceserverA folder intended to store (in the contained./run.sh) the slightly finicky parameters for easy Docker or Singularity execution of SequenceServer
git clonethis repository- Rename the folder from
git-tracked-analysis-templateto something clear and descriptive, optionally with a date timestamp. e.g.2023-10-06_PKS_domain_to_module_conversion_script - Run
re-init.shto clean up and reinitialize the git tracking. - Run
setup_env.shto create a Conda environment that pre-installs the bioinformatics tools I most commonly use, and the dependencies forarchive.sh - Iterate on your analysis code, figures, etc. Use git tracking along the way, and
git tag -a v1.1of major and minor versions i.e.v1.1when significant stopping points for external sharing of the analysis are reached - Use
archive.shto pack the most recent commit of the analysis into a .zip, for manual uploading to Zenodo. Be sure to set the Zenodo version of the uploaded item to the git tagged version, thereby allowing for clear provenance traceability between the live git-tracked analysis and the snapshot on Zenodo.
- https://www.projecttier.org/tier-protocol/protocol-4-0/
- https://handbook.datalad.org/en/latest/basics/101-127-yoda.html
- https://caltechlibrary.github.io/RDMworkbook/index.html#description
- https://jakefeala.substack.com/i/160652737/folder-structure
- TidyData http://www.jstatsoft.org/v59/i10/
- https://carpentries-incubator.github.io/managing-computational-projects/instructor/09-rdm.html#tidy-spreadsheets
- https://psychoinformatics-de.github.io/rdm-course/
- https://slides.djnavarro.net/project-structure/
- http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf
- Try not to keep any symlinks in this bare template. If working with cloud filesystems / object stores, they have divergent support and need to be paid attention to. But, feel free to use symlinks after the template has been initialized. They will be properly stored in
gitand in the.zipfromarchive.sh - Check for symlinks with
find . -type l - The
zenodo_itemsDOI renaming scheme, is admittedly a bit hackish. But, it is a lightweight way to track provenance of files copied from Zenodo items. - The
setup_env.shdefining of the conda environment on individual lines, is admittedly a bit hackish compared to using .yaml definitions of the environment.