Skip to content

xX-its-amit-Xx/BayesBio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BayesBio

Bayesian inference for small biological systems. A teaching-oriented Python toolkit that fits classical biological models — Michaelis-Menten enzyme kinetics, Hill dose-response, and a logistic classifier — to noisy experimental data using Markov Chain Monte Carlo (MCMC), with honest uncertainty quantification and convergence diagnostics.

This is an educational portfolio project. It is not clinical software, and the simulated datasets are not drawn from real experiments.


Why Bayesian inference?

Most biological measurements are noisy: enzyme assays vary between days, gene-expression replicates disagree, and sample sizes are small. A point estimate (Vmax = 7.9, full stop) gives a false sense of precision. A posterior distribution instead answers the more useful question:

Given my prior belief, my model, and these noisy measurements, what values of the parameters are plausible, and how plausible is each?

Concretely:

  • A prior p(θ) encodes what you believed about the parameters before seeing the data — for instance, "enzyme rate constants are positive and rarely larger than 100".
  • A likelihood p(data | θ) says how compatible the observed data is with a hypothesized parameter value.
  • Bayes' rule combines them into the posterior p(θ | data) ∝ p(data | θ) · p(θ) — the updated belief after the data.
  • Because the posterior usually has no closed form, we approximate it by drawing samples from it. Markov Chain Monte Carlo is a family of algorithms that produces such samples by random-walking through parameter space in a way that statistically targets the posterior.

The samples can then be summarized in any way you like: posterior means, credible intervals, plots of fitted curves with uncertainty bands, etc.

Why uncertainty matters in biology

A drug-response model that says EC50 = 1.0 is much less useful than one that says EC50 = 1.0 with 95% credible interval [0.6, 1.7]. The interval tells you whether the next experiment is likely to confirm the result and whether two compounds are meaningfully different or only nominally so. For small-n biology — three replicates per dose, twelve enzyme readings — uncertainty bars are not optional decoration; they are the result.


The bundled use cases

1. Enzyme kinetics — Michaelis-Menten

The Michaelis-Menten equation describes the rate v of an enzyme-catalyzed reaction as a function of substrate concentration [S]:

v = Vmax · [S] / (Km + [S])

where Vmax is the maximum reaction velocity (achieved when the enzyme is saturated) and Km is the substrate concentration at which v = Vmax/2. BayesBio simulates noisy (S, v) pairs and infers Vmax, Km, and the measurement-noise scale σ.

2. Gene-expression dose-response — Hill equation

The Hill equation models a sigmoidal response of expression y to inducer dose d:

y = baseline + (Emax - baseline) · d^n / (EC50^n + d^n)

EC50 is the dose at half-maximal response, n is the Hill coefficient (steepness / cooperativity), and baseline/Emax set the floor and ceiling. BayesBio infers all four together.

3. Cancer-signaling toy classifier — Bayesian logistic regression

A small synthetic dataset of three "biomarker" features per sample and a binary outcome (e.g., tumor responds vs. does not respond). BayesBio fits a Bayesian logistic regression and reports posterior credible intervals on each coefficient — directly showing which features are clearly informative and which are not.


Installation

Clone the repo and install in editable mode:

git clone https://github.com/yourname/BayesBio.git
cd BayesBio
python -m pip install -e .

Or, if you only want the runtime dependencies:

python -m pip install -r requirements.txt

Python 3.10+ is required.

Quick start

# 1. Generate a synthetic dataset
python -m bayesbio.cli simulate \
    --model michaelis_menten \
    --out examples/enzyme_kinetics.csv

# 2. Fit the model and write samples + diagnostics + plots + report
python -m bayesbio.cli fit \
    --model michaelis_menten \
    --input examples/enzyme_kinetics.csv \
    --outdir outputs/enzyme_fit \
    --truth-json examples/enzyme_kinetics.truth.json

The other models work the same way:

python -m bayesbio.cli fit \
    --model hill_response \
    --input examples/dose_response.csv \
    --outdir outputs/dose_response_fit \
    --truth-json examples/dose_response.truth.json

python -m bayesbio.cli fit \
    --model logistic_regression \
    --input examples/cancer_signaling_toy.csv \
    --outdir outputs/cancer_model_fit \
    --truth-json examples/cancer_signaling_toy.truth.json

You can also re-summarize an existing posterior_samples.csv:

python -m bayesbio.cli diagnose \
    --samples outputs/enzyme_fit/posterior_samples.csv \
    --chains 4 \
    --out outputs/enzyme_fit/diagnostics.json

Expected outputs

Every fit command produces the following inside --outdir:

File Contents
posterior_samples.csv One row per draw, columns = chain + each parameter
parameter_summary.csv Mean / sd / quantiles / ESS / R-hat per parameter
diagnostics.json Acceptance rates, ESS, R-hat, free-text warnings
posterior_histograms.png Marginal posterior for each parameter (truth overlaid if provided)
trace_plots.png Per-chain trace, one row per parameter (eyeball convergence)
model_fit.png Observed data + posterior draws of the model curve (continuous models)
ppc.png Posterior predictive check vs. observed (continuous models)
coefficient_intervals.png Forest plot of posterior intervals (logistic regression only)
final_report.md Self-contained Markdown report tying everything together

How the MCMC works

BayesBio ships with a from-scratch random-walk Metropolis-Hastings sampler in bayesbio/samplers.py. The algorithm is heavily commented so you can read it as a study aid. Briefly:

  1. Start at an initial parameter vector θ.
  2. Propose θ_new = θ + N(0, S) where S is a per-parameter standard deviation vector (the proposal scale).
  3. Compute the log posterior at θ_new.
  4. Accept the move with probability min(1, exp(log_post_new - log_post_old)).
  5. Two-stage warm-up adaptation:
    • Stage 1 tunes a single global step size by diminishing-step Robbins-Monro (Andrieu & Thoms, 2008) toward a target acceptance rate of 0.234 (Roberts, Gelman & Gilks, 1997).
    • Stage 2 additionally rescales S per-parameter using the running standard deviation of the warm-up samples, so parameters on very different scales (e.g. baseline ~ 5 and Emax ~ 80) all mix.
  6. Adaptation is frozen at the end of warm-up — only the post-warmup draws are returned, keeping the sampling phase a valid Markov chain.
  7. Multiple independent chains are run from different inits so the R-hat convergence diagnostic can flag poor mixing.

Random-walk MH with diagonal-covariance adaptation is still less efficient than gradient-based samplers like NUTS (PyMC, Stan) on strongly correlated posteriors; for those, consider running the same likelihood and priors through PyMC.

Logistic regression: feature standardization

The default Normal(0, 1.5) prior on logistic-regression coefficients is meaningful only for unit-scale features. The CLI therefore standardizes feature columns (zero mean, unit standard deviation) before sampling and back-transforms the posterior samples into the original feature space before saving and plotting them. The user-facing posterior_samples.csv and intervals are therefore directly interpretable as "expected change in log-odds per one-unit change in the raw feature".

Diagnostics

diagnostics.json always contains:

  • Acceptance rate per chain. Healthy is roughly 0.20-0.50 for random-walk MH.
  • Effective sample size (ESS) per parameter, computed via Geyer's initial monotone sequence estimator. ESS far below the raw draw count indicates strong autocorrelation.
  • Split-R-hat per parameter (only when ≥2 chains are run). Values meaningfully above 1.01 suggest the chains have not converged to the same distribution.
  • A list of free-text warnings when any of the above breach the rules of thumb (R-hat > 1.05, ESS < 100, acceptance outside [0.10, 0.70]).

Tests

pytest -q

The tests cover the deterministic models, log-likelihood functions, prior validation, sampler shape/recovery, and CLI argument parsing.


Limitations

  • Synthetic data only. All bundled datasets are drawn from the generative model itself. Real enzyme assays often violate Gaussian noise assumptions (multiplicative / proportional noise, plate effects, background subtraction); the included Gaussian-noise likelihood is the textbook teaching choice but should be revisited before being applied to real assays. A log-normal noise option would be the natural extension.
  • Random-walk Metropolis-Hastings is a teaching sampler. It is much less efficient than NUTS (PyMC, Stan) on strongly correlated or higher-dimensional posteriors, even with the per-parameter scale adaptation described above.
  • R-hat is the BDA3 split version, not the rank-normalized variant recommended by Vehtari et al. (2021). For production-grade diagnostics on harder problems, defer to arviz.rhat.
  • No model comparison. WAIC / LOO / Bayes factors are out of scope.
  • No hierarchical / partial-pooling models. Each fit is for a single experiment; pooling across plates / batches would be a natural extension.
  • Not clinical software. The "cancer signaling" example is an entirely synthetic teaching dataset.

Future improvements

  • An optional PyMC backend (NUTS) for the same models, for cross-checking.
  • Log-normal observation noise option for enzyme/dose-response data.
  • Hierarchical (partial-pooling) version of the dose-response model.
  • WAIC / LOO model comparison via arviz.
  • A small Snakemake or Nextflow workflow that runs all three fits and collects the reports into a single index.

License

MIT — see LICENSE.

About

Infers biological model parameters from noisy experimental data using Bayesian statistics, Markov Chain Monte Carlo, and clear visual diagnostics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages