data inference model pipelines

PREFERENCES

Open-source
Hashicorp
Ubuntu (Canonical)

Possible enhancements @abhi18av

research object

https://www.researchobject.org/ro-crate/background

https://github.com/lucmoreau/ProvToolbox

https://github.com/trungdong/prov?tab=readme-ov-file

https://github.com/ResearchObject/ro-crate-py

Truly integrate DSO https://boehringer-ingelheim.github.io/dso/tutorials/getting_started.html

https://pypi.org/project/datasette/

https://mlcommons.org/working-groups/data/croissant/

Learn from https://github.com/cjolowicz/cookiecutter-hypermodern-python

prj tool (projectable) https://github.com/dzfrias/projectable

R has comprehensive bioconductor and https://github.com/erikgahner/awesome-ggplot2 + ggbio

https://biomejs.dev/blog/biome-v2-0-beta/

git-submodules for various sun templates

https://github.com/MarquezProject/marquez

mlflow/metaflow etc

deon https://deon.drivendata.org/#background-and-perspective

Public datasets

https://github.com/addypy/datagovindia/ https://www.re3data.org/browse/by-country/ https://github.com/awesomedata/awesome-public-datasets https://github.com/public-apis/public-apis https://free-apis.github.io/#/ https://github.com/datasets/awesome-data?tab=readme-ov-file https://datacatalogs.org https://dados.gov.br/home https://ckan.org/features https://github.com/GetDKAN/dkan https://queridodiario.ok.org.br https://magda.io https://dev.magda.io/search?page=2

REVIEW Create a utility to prune all folders which are empty, from a given list of folders.

Automations (via pixi + just) for installing baseline tools (Python + Java + Babashka + binaries eget, dust, duf)

Lineage

https://github.com/OpenLineage/OpenLineage?tab=readme-ov-file https://egeria-project.org/education/ https://github.com/grai-io/grai-core https://www.grai.io

CHECKLISTS

Data sharing

FAIR

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Copier](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/copier-org/copier/master/img/badge/badge-grayscale-inverted-border-orange.json)](https://github.com/copier-org/copier) [![pre-commit](https://img.shields.io/badge/pre–commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)

This is a template built with [Copier](https://github.com/copier-org/copier) to generate a data science focused python project.

Get started with the following command:

“`shell copier copy gh:abhi18av/template-analysis-and-writeup path/to/destination “`

## Features

### Core ideas

Data and Code Analysis and Writeup Clojure and Quarto Emacs and VSCode Users and Engineers

### Tools used in this template

Task runner - `just`
Data folders
1. data dictionaries
2. raw
3. processed
Programming languages and libraries
1. R
2. Python
3. Clojure(Script)
4. babashka/nbb
5. Java(jshell)
6. Nushell
7. Bash
8. Wolfram
9. OCaml
Notebooks
1. Quarto (R, Python, ObservableJS)
2. Mathematica
3. Matlab
Dashboards
1. Quarto (R, Python, ObservableJS)
Pipeline runner - `nextflow`
Package and environment management
1. Pixi
2. Renv
3. Pip
4. Clojure-CLI
5. NPM
Code and data version management
1. Git
2. Fossil
3. Data Version Control
Data transfer and backup
1. Rclone
2. Restic
3. ArtiVC
Writeup management (Manuscript, Report, Presentation)
1. Quarto
2. Typst
3. Org-mode
Infrastructure management (MINIO)
1. Terraform
2. Dagger
3. Nomad cluster
4. MicroK8s
5. Juju
Project-level bin folder, pbin
Utilities for editor, env management config
1. .vscode
2. .editorconfig
3. .envrc
4. pre-commit hooks
Project management
1. ORG files (meetings, experiments)

### Project structure

It is assumed that most of the work will be done in Jupyter Notebooks. However, the template also includes a python project, in which you can put functions and classes shared across notebooks. The repository is set up to use [Pytest](https://docs.pytest.org/en/stable/) for unit testing this module code.

The template also includes a `data` directory whose contents will be ignored by git. You can use this folder to store data that you do not commit. You may also put a readme file in which you can document the source datasets you use and how to acquire them.

### [just](https://github.com/casey/just)

`just` is a command runner that allows you to easily to run project-specific commands. In fact, you can use `just` to run all the setup commands listed below:

“`shell just setup “`

### [pre-commit](https://github.com/pre-commit/pre-commit)

pre-commit is a tool that runs checks on your files before you commit them with git, thereby helping ensure code quality. Enable it with the following command:

“`shell pre-commit install –install-hooks “`

The configuration is stored in `.pre-commit-config.yaml`.

### Github Actions

You may optionally add a github workflow file which checks the following:

uses ruff to check files are formatted and linted
Runs unit tests and checks coverage
Checks any markdown files are formatted with [markdownlint-cli2](https://github.com/DavidAnson/markdownlint-cli2)
Checks that all jupyter notebooks are clean

### [Typos](https://github.com/crate-ci/typos)

Typos checks for common typos in code, aiming for a low false positive rate. The repository is configured not to use it for Jupyter notebook files, as it tends to find errors in cell outputs.

Test with [Copier](https://github.com/copier-org/copier) and [copier-template-tester](https://github.com/KyleKing/copier-template-tester).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PREFERENCES

Possible enhancements @abhi18av

research object

https://www.researchobject.org/ro-crate/background

https://github.com/lucmoreau/ProvToolbox

https://github.com/trungdong/prov?tab=readme-ov-file

https://openprovenance.org

https://github.com/ResearchObject/ro-crate-py

Truly integrate DSO https://boehringer-ingelheim.github.io/dso/tutorials/getting_started.html

https://pypi.org/project/datasette/

https://mlcommons.org/working-groups/data/croissant/

Learn from https://github.com/cjolowicz/cookiecutter-hypermodern-python

prj tool (projectable) https://github.com/dzfrias/projectable

R has comprehensive bioconductor and https://github.com/erikgahner/awesome-ggplot2 + ggbio

https://biomejs.dev/blog/biome-v2-0-beta/

https://shiny.posit.co/py/

git-submodules for various sun templates

https://github.com/MarquezProject/marquez

mlflow/metaflow etc

deon https://deon.drivendata.org/#background-and-perspective

Public datasets

REVIEW Create a utility to prune all folders which are empty, from a given list of folders.

Automations (via pixi + just) for installing baseline tools (Python + Java + Babashka + binaries eget, dust, duf)

Lineage

CHECKLISTS

Data sharing

FAIR

FilesExpand file tree

README.org

Latest commit

History

README.org

File metadata and controls

PREFERENCES

Possible enhancements @abhi18av

research object

Truly integrate DSO https://boehringer-ingelheim.github.io/dso/tutorials/getting_started.html

Learn from https://github.com/cjolowicz/cookiecutter-hypermodern-python

prj tool (projectable) https://github.com/dzfrias/projectable

R has comprehensive bioconductor and https://github.com/erikgahner/awesome-ggplot2 + ggbio

git-submodules for various sun templates

mlflow/metaflow etc

deon https://deon.drivendata.org/#background-and-perspective

Public datasets

REVIEW Create a utility to prune all folders which are empty, from a given list of folders.

Automations (via pixi + just) for installing baseline tools (Python + Java + Babashka + binaries eget, dust, duf)

Lineage

CHECKLISTS

Data sharing

FAIR