Local-first virtual gene perturbation agent for 10x single-cell MTX datasets.
- Windows 10/11
- Python 3.10+
- Internet access to the selected LLM provider endpoint
git clone https://github.com/tangaode/gene-perturb-agent.git
cd gene-perturb-agent
Set-ExecutionPolicy -Scope Process Bypass
.\scripts\setup_easy.ps1
.\scripts\start_easy.ps1MTX_DIR supports:
- A single 10x folder containing
matrix.mtx(.gz),features.tsv(.gz)/genes.tsv(.gz), andbarcodes.tsv(.gz). - A parent folder containing multiple 10x sample folders (recursive discovery). All detected samples are merged by union gene space; barcodes are prefixed by sample folder name.
When clustering mode is enabled, the launcher performs:
- Cell QC filtering (
n_genes_by_counts,total_counts, mitochondrial ratio, ribosomal ratio). - Library-size normalization (
target_sum=1e4) andlog1p. - Highly variable gene selection and scaling.
- PCA.
- Harmony integration by
samplewhen multiple samples are present. - Neighbors, UMAP, and Leiden clustering (
flavor=igraph,n_iterations=2,resolution=0.5). - Marker ranking per cluster by adjusted p-value (
wilcoxon), with significant marker filterp < 0.05andlog2FC > 0.5. - LLM-based cluster label suggestion.
- Optional manual label override in PowerShell.
- Target group selection by
cluster:<id>orcell_type:<name>.
Default output directory: outputs/cellgroups/
Generated files:
cluster_annotations.csv: barcode-level cluster and cell-type labels.umap_coords.csv: UMAP coordinates.umap_clusters_unannotated.png: UMAP colored by raw cluster IDs.umap_clusters_annotated.png: UMAP colored by final cell-type labels.umap_clusters.png: alias of the annotated UMAP for backward compatibility.qc_summary.csv: QC thresholds and retained cell/gene counts.cluster_markers_top100_for_llm.json: top-100 marker list per cluster used for LLM annotation.cluster_markers_top50.json: preview marker list per cluster.cluster_markers_significant.csv: all significant high-expression markers per cluster (p < 0.05andlog2FC > 0.5).
LLM cell-type annotation uses the top 100 genes per cluster from the significant marker ranking.
start_easy.ps1 prompts for provider selection on each launch:
deepseek: asks for base URL, model, and API key.openai: asks for base URL, model, and API key.gemini: asks for base URL, model, and API key (OpenAI-compatible endpoint).
For deepseek/openai/gemini, an API key is required.
Typical base URLs:
- DeepSeek:
https://api.deepseek.com/v1 - OpenAI:
https://api.openai.com/v1 - Gemini (OpenAI-compatible):
https://generativelanguage.googleapis.com/v1beta/openai
Default final output is Top-5 upregulated and Top-5 downregulated genes (FINAL_TOPK=5).
Build:
.\scripts\build_release.ps1Package:
release/GenePerturbAgent.zip
End-user flow:
- Unzip
GenePerturbAgent.zip. - Run
Run-Agent.bat. - Select
MTX_DIRat startup (press Enter to reuse last path). - Optionally run clustering and select a cell group.
- Open
http://localhost:3000.
Notes:
start_easy.ps1prompts for dataset path, provider, model, and API key on every startup.- Cell clustering and cell-type annotation are recomputed on every startup when clustering mode is enabled.
- Runtime values are not persisted to
.env.localbystart_easy.ps1.
agent_api:8000virtualcell_service:8001evidence_service:8002web:3000
NCBI Gene, PubMed, GO:BP, GO:MF, GO:CC, Reactome, KEGG, WikiPathways, MSigDB Hallmark, STRING, BioGRID, CORUM.