Skip to content

tangaode/gene-perturb-agent

Repository files navigation

Gene Perturb Agent

Local-first virtual gene perturbation agent for 10x single-cell MTX datasets.

Requirements

  • Windows 10/11
  • Python 3.10+
  • Internet access to the selected LLM provider endpoint

Quick Start (PowerShell)

git clone https://github.com/tangaode/gene-perturb-agent.git
cd gene-perturb-agent
Set-ExecutionPolicy -Scope Process Bypass
.\scripts\setup_easy.ps1
.\scripts\start_easy.ps1

Input Data

MTX_DIR supports:

  • A single 10x folder containing matrix.mtx(.gz), features.tsv(.gz)/genes.tsv(.gz), and barcodes.tsv(.gz).
  • A parent folder containing multiple 10x sample folders (recursive discovery). All detected samples are merged by union gene space; barcodes are prefixed by sample folder name.

Clustering and Cell-Group Selection

When clustering mode is enabled, the launcher performs:

  1. Cell QC filtering (n_genes_by_counts, total_counts, mitochondrial ratio, ribosomal ratio).
  2. Library-size normalization (target_sum=1e4) and log1p.
  3. Highly variable gene selection and scaling.
  4. PCA.
  5. Harmony integration by sample when multiple samples are present.
  6. Neighbors, UMAP, and Leiden clustering (flavor=igraph, n_iterations=2, resolution=0.5).
  7. Marker ranking per cluster by adjusted p-value (wilcoxon), with significant marker filter p < 0.05 and log2FC > 0.5.
  8. LLM-based cluster label suggestion.
  9. Optional manual label override in PowerShell.
  10. Target group selection by cluster:<id> or cell_type:<name>.

Clustering Outputs

Default output directory: outputs/cellgroups/

Generated files:

  • cluster_annotations.csv: barcode-level cluster and cell-type labels.
  • umap_coords.csv: UMAP coordinates.
  • umap_clusters_unannotated.png: UMAP colored by raw cluster IDs.
  • umap_clusters_annotated.png: UMAP colored by final cell-type labels.
  • umap_clusters.png: alias of the annotated UMAP for backward compatibility.
  • qc_summary.csv: QC thresholds and retained cell/gene counts.
  • cluster_markers_top100_for_llm.json: top-100 marker list per cluster used for LLM annotation.
  • cluster_markers_top50.json: preview marker list per cluster.
  • cluster_markers_significant.csv: all significant high-expression markers per cluster (p < 0.05 and log2FC > 0.5).

LLM cell-type annotation uses the top 100 genes per cluster from the significant marker ranking.

LLM Provider Configuration

start_easy.ps1 prompts for provider selection on each launch:

  • deepseek: asks for base URL, model, and API key.
  • openai: asks for base URL, model, and API key.
  • gemini: asks for base URL, model, and API key (OpenAI-compatible endpoint).

For deepseek/openai/gemini, an API key is required.

Typical base URLs:

  • DeepSeek: https://api.deepseek.com/v1
  • OpenAI: https://api.openai.com/v1
  • Gemini (OpenAI-compatible): https://generativelanguage.googleapis.com/v1beta/openai

Prediction Output

Default final output is Top-5 upregulated and Top-5 downregulated genes (FINAL_TOPK=5).

One-Click Package

Build:

.\scripts\build_release.ps1

Package:

  • release/GenePerturbAgent.zip

End-user flow:

  1. Unzip GenePerturbAgent.zip.
  2. Run Run-Agent.bat.
  3. Select MTX_DIR at startup (press Enter to reuse last path).
  4. Optionally run clustering and select a cell group.
  5. Open http://localhost:3000.

Notes:

  • start_easy.ps1 prompts for dataset path, provider, model, and API key on every startup.
  • Cell clustering and cell-type annotation are recomputed on every startup when clustering mode is enabled.
  • Runtime values are not persisted to .env.local by start_easy.ps1.

Local Services / Ports

  • agent_api: 8000
  • virtualcell_service: 8001
  • evidence_service: 8002
  • web: 3000

Evidence Sources

NCBI Gene, PubMed, GO:BP, GO:MF, GO:CC, Reactome, KEGG, WikiPathways, MSigDB Hallmark, STRING, BioGRID, CORUM.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors