BayesBio

Bayesian inference for small biological systems. A teaching-oriented Python toolkit that fits classical biological models — Michaelis-Menten enzyme kinetics, Hill dose-response, and a logistic classifier — to noisy experimental data using Markov Chain Monte Carlo (MCMC), with honest uncertainty quantification and convergence diagnostics.

This is an educational portfolio project. It is not clinical software, and the simulated datasets are not drawn from real experiments.

Why Bayesian inference?

Most biological measurements are noisy: enzyme assays vary between days, gene-expression replicates disagree, and sample sizes are small. A point estimate (Vmax = 7.9, full stop) gives a false sense of precision. A posterior distribution instead answers the more useful question:

Given my prior belief, my model, and these noisy measurements, what values of the parameters are plausible, and how plausible is each?

Concretely:

A prior p(θ) encodes what you believed about the parameters before seeing the data — for instance, "enzyme rate constants are positive and rarely larger than 100".
A likelihood p(data | θ) says how compatible the observed data is with a hypothesized parameter value.
Bayes' rule combines them into the posterior p(θ | data) ∝ p(data | θ) · p(θ) — the updated belief after the data.
Because the posterior usually has no closed form, we approximate it by drawing samples from it. Markov Chain Monte Carlo is a family of algorithms that produces such samples by random-walking through parameter space in a way that statistically targets the posterior.

The samples can then be summarized in any way you like: posterior means, credible intervals, plots of fitted curves with uncertainty bands, etc.

Why uncertainty matters in biology

A drug-response model that says EC50 = 1.0 is much less useful than one that says EC50 = 1.0 with 95% credible interval [0.6, 1.7]. The interval tells you whether the next experiment is likely to confirm the result and whether two compounds are meaningfully different or only nominally so. For small-n biology — three replicates per dose, twelve enzyme readings — uncertainty bars are not optional decoration; they are the result.

The bundled use cases

1. Enzyme kinetics — Michaelis-Menten

The Michaelis-Menten equation describes the rate v of an enzyme-catalyzed reaction as a function of substrate concentration [S]:

v = Vmax · [S] / (Km + [S])

where Vmax is the maximum reaction velocity (achieved when the enzyme is saturated) and Km is the substrate concentration at which v = Vmax/2. BayesBio simulates noisy (S, v) pairs and infers Vmax, Km, and the measurement-noise scale σ.

2. Gene-expression dose-response — Hill equation

The Hill equation models a sigmoidal response of expression y to inducer dose d:

y = baseline + (Emax - baseline) · d^n / (EC50^n + d^n)

EC50 is the dose at half-maximal response, n is the Hill coefficient (steepness / cooperativity), and baseline/Emax set the floor and ceiling. BayesBio infers all four together.

3. Cancer-signaling toy classifier — Bayesian logistic regression

A small synthetic dataset of three "biomarker" features per sample and a binary outcome (e.g., tumor responds vs. does not respond). BayesBio fits a Bayesian logistic regression and reports posterior credible intervals on each coefficient — directly showing which features are clearly informative and which are not.

Installation

Clone the repo and install in editable mode:

git clone https://github.com/yourname/BayesBio.git
cd BayesBio
python -m pip install -e .

Or, if you only want the runtime dependencies:

python -m pip install -r requirements.txt

Python 3.10+ is required.

Quick start

# 1. Generate a synthetic dataset
python -m bayesbio.cli simulate \
    --model michaelis_menten \
    --out examples/enzyme_kinetics.csv

# 2. Fit the model and write samples + diagnostics + plots + report
python -m bayesbio.cli fit \
    --model michaelis_menten \
    --input examples/enzyme_kinetics.csv \
    --outdir outputs/enzyme_fit \
    --truth-json examples/enzyme_kinetics.truth.json

The other models work the same way:

python -m bayesbio.cli fit \
    --model hill_response \
    --input examples/dose_response.csv \
    --outdir outputs/dose_response_fit \
    --truth-json examples/dose_response.truth.json

python -m bayesbio.cli fit \
    --model logistic_regression \
    --input examples/cancer_signaling_toy.csv \
    --outdir outputs/cancer_model_fit \
    --truth-json examples/cancer_signaling_toy.truth.json

You can also re-summarize an existing posterior_samples.csv:

python -m bayesbio.cli diagnose \
    --samples outputs/enzyme_fit/posterior_samples.csv \
    --chains 4 \
    --out outputs/enzyme_fit/diagnostics.json

Expected outputs

Every fit command produces the following inside --outdir:

File	Contents
`posterior_samples.csv`	One row per draw, columns = `chain` + each parameter
`parameter_summary.csv`	Mean / sd / quantiles / ESS / R-hat per parameter
`diagnostics.json`	Acceptance rates, ESS, R-hat, free-text warnings
`posterior_histograms.png`	Marginal posterior for each parameter (truth overlaid if provided)
`trace_plots.png`	Per-chain trace, one row per parameter (eyeball convergence)
`model_fit.png`	Observed data + posterior draws of the model curve (continuous models)
`ppc.png`	Posterior predictive check vs. observed (continuous models)
`coefficient_intervals.png`	Forest plot of posterior intervals (logistic regression only)
`final_report.md`	Self-contained Markdown report tying everything together

How the MCMC works

BayesBio ships with a from-scratch random-walk Metropolis-Hastings sampler in bayesbio/samplers.py. The algorithm is heavily commented so you can read it as a study aid. Briefly:

Start at an initial parameter vector θ.
Propose θ_new = θ + N(0, S) where S is a per-parameter standard deviation vector (the proposal scale).
Compute the log posterior at θ_new.
Accept the move with probability min(1, exp(log_post_new - log_post_old)).
Two-stage warm-up adaptation:
- Stage 1 tunes a single global step size by diminishing-step Robbins-Monro (Andrieu & Thoms, 2008) toward a target acceptance rate of 0.234 (Roberts, Gelman & Gilks, 1997).
- Stage 2 additionally rescales S per-parameter using the running standard deviation of the warm-up samples, so parameters on very different scales (e.g. baseline ~ 5 and Emax ~ 80) all mix.
Adaptation is frozen at the end of warm-up — only the post-warmup draws are returned, keeping the sampling phase a valid Markov chain.
Multiple independent chains are run from different inits so the R-hat convergence diagnostic can flag poor mixing.

Random-walk MH with diagonal-covariance adaptation is still less efficient than gradient-based samplers like NUTS (PyMC, Stan) on strongly correlated posteriors; for those, consider running the same likelihood and priors through PyMC.

Logistic regression: feature standardization

The default Normal(0, 1.5) prior on logistic-regression coefficients is meaningful only for unit-scale features. The CLI therefore standardizes feature columns (zero mean, unit standard deviation) before sampling and back-transforms the posterior samples into the original feature space before saving and plotting them. The user-facing posterior_samples.csv and intervals are therefore directly interpretable as "expected change in log-odds per one-unit change in the raw feature".

Diagnostics

diagnostics.json always contains:

Acceptance rate per chain. Healthy is roughly 0.20-0.50 for random-walk MH.
Effective sample size (ESS) per parameter, computed via Geyer's initial monotone sequence estimator. ESS far below the raw draw count indicates strong autocorrelation.
Split-R-hat per parameter (only when ≥2 chains are run). Values meaningfully above 1.01 suggest the chains have not converged to the same distribution.
A list of free-text warnings when any of the above breach the rules of thumb (R-hat > 1.05, ESS < 100, acceptance outside [0.10, 0.70]).

Tests

pytest -q

The tests cover the deterministic models, log-likelihood functions, prior validation, sampler shape/recovery, and CLI argument parsing.

Limitations

Synthetic data only. All bundled datasets are drawn from the generative model itself. Real enzyme assays often violate Gaussian noise assumptions (multiplicative / proportional noise, plate effects, background subtraction); the included Gaussian-noise likelihood is the textbook teaching choice but should be revisited before being applied to real assays. A log-normal noise option would be the natural extension.
Random-walk Metropolis-Hastings is a teaching sampler. It is much less efficient than NUTS (PyMC, Stan) on strongly correlated or higher-dimensional posteriors, even with the per-parameter scale adaptation described above.
R-hat is the BDA3 split version, not the rank-normalized variant recommended by Vehtari et al. (2021). For production-grade diagnostics on harder problems, defer to arviz.rhat.
No model comparison. WAIC / LOO / Bayes factors are out of scope.
No hierarchical / partial-pooling models. Each fit is for a single experiment; pooling across plates / batches would be a natural extension.
Not clinical software. The "cancer signaling" example is an entirely synthetic teaching dataset.

Future improvements

An optional PyMC backend (NUTS) for the same models, for cross-checking.
Log-normal observation noise option for enzyme/dose-response data.
Hierarchical (partial-pooling) version of the dose-response model.
WAIC / LOO model comparison via arviz.
A small Snakemake or Nextflow workflow that runs all three fits and collects the reports into a single index.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bayesbio		bayesbio
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BayesBio

Why Bayesian inference?

Why uncertainty matters in biology

The bundled use cases

1. Enzyme kinetics — Michaelis-Menten

2. Gene-expression dose-response — Hill equation

3. Cancer-signaling toy classifier — Bayesian logistic regression

Installation

Quick start

Expected outputs

How the MCMC works

Logistic regression: feature standardization

Diagnostics

Tests

Limitations

Future improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BayesBio

Why Bayesian inference?

Why uncertainty matters in biology

The bundled use cases

1. Enzyme kinetics — Michaelis-Menten

2. Gene-expression dose-response — Hill equation

3. Cancer-signaling toy classifier — Bayesian logistic regression

Installation

Quick start

Expected outputs

How the MCMC works

Logistic regression: feature standardization

Diagnostics

Tests

Limitations

Future improvements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages