Binomial Z-scoring method for identifying enriched transcription factor binding sites in genomic regions compared to the background. The transcription factor (TF) data were obtained from ENCODE ChIP-seq experiments. Replicates for each TF were merged across multiple ENCODE sources to create a single comprehensive list of TFs. The GRanges objects for each TF are stored in two directories, TranscriptionFactors and TranscriptionFactors1, split due to GitHub’s 1000-file limit per directory. This R script performs TF enrichment analysis by comparing TF binding site overlap between experimental regions (e.g., Experiment sites) and background regions (e.g., Background sites). It calculates z-scores, p-values with FDR correction, and generates publication-ready visualizations.
library(GenomicRanges)
library(dplyr)
library(ggplot2)source("TranscriptionFactor_Enrichment.R")
exp <- readRDS("path/to/your/ipa_regions.rds") # Your experimental regions (GRanges)
bkd <- readRDS("path/to/your/polya_regions.rds") # Your background regions (GRanges)
tf_dir <- "path/to/tf_binding_sites/" # Directory with TF .rds files
output_plot <- "output/directory/" # Output directory
threshold_for_TFs_Enrichment <- 500 # Distance threshold label from center
results <- run_tf_enrichment_analysis(
ipa_gr = exp,
polya_gr = bkd,
tf_dir = tf_dir,
output_dir = output_plot,
exp_name = "Experiment",
bkd_name = "Background",
threshold_for_TFs_Enrichment = threshold_for_TFs_Enrichment,
significance_threshold = 0.001
)- Experimental regions: GRanges object (e.g., Experiment sites)
- Background regions: GRanges object (e.g., Background sites)
- TF binding sites: Directory containing
.rdsfiles, each with a GRanges object of TF binding sites
TF_enrichment_IPA_PolyA_500bp.csv- Complete results with statisticsTF_enrichment_IPA_PolyA_500bp.rds- R object for further analysisTF_enrichment_IPA_PolyA_500bp_volcano_plot.pdf- Volcano plot visualizationTF_enrichment_IPA_PolyA_500bp_top_barplot.pdf- Top 30 enriched TFs bar plot
| Column | Description |
|---|---|
| TF_name | Transcription factor name |
| Observed_in_Exp | Count in experimental regions |
| Observed_in_Bkd | Count in background regions |
| Expected_in_Exp | Expected count based on background |
| Background_probability | Proportion in background |
| Z_score | Enrichment z-score |
| P_value | Raw p-value |
| P_value_adjusted | FDR-corrected p-value |
| Significant | Yes/No based on threshold |
Core function that processes TF files and calculates enrichment statistics.
Wrapper function that runs the analysis and generates visualizations.
ipa_gr: Experimental regions (GRanges)polya_gr: Background regions (GRanges)tf_dir: Path to TF binding site files directoryoutput_dir: Output directory pathexp_name: Label for experimental regions (default: "IPA")bkd_name: Label for background regions (default: "PolyA")threshold_for_TFs_Enrichment: Distance threshold description (default: "500bp")significance_threshold: FDR cutoff (default: 0.05)min_overlap: Minimum base pair overlap (default: 1)
- Calculates background probability as the proportion of background regions overlapping TF sites
- Computes expected counts in experimental regions based on background frequency
- Calculates the z-score as the standardized difference between observed and expected
- Performs a one-tailed test for enrichment
- Applies Benjamini-Hochberg FDR correction for multiple testing
# Load libraries
library(GenomicRanges)
library(dplyr)
library(ggplot2)
source("tf_enrichment_analysis.R")
exp <- readRDS("data/ipa_sites.rds")
bkd <- readRDS("data/polya_sites.rds")
results <- run_tf_enrichment_analysis(
ipa_gr = exp,
polya_gr = bkd,
tf_dir = "tf_binding_sites/",
output_dir = "results/",
exp_name = "Experiment",
bkd_name = "Background",
threshold_for_TFs_Enrichment = 1000,
significance_threshold = 0.01
)
top_tfs <- results %>%
filter(Significant == "Yes") %>%
arrange(P_value) %>%
head(10)
print(top_tfs)The script provides detailed progress information:
- Number of TF files found
- Processing status for each TF
- Overlap counts for experimental and background regions
- Statistical metrics (z-score, p-value)
- Summary of significant TFs
- File save locations
- X-axis: Z-score (enrichment strength)
- Y-axis: -log10(P-value) (significance)
- Red points: Significant TFs (FDR < threshold)
- Gray points: Non-significant TFs
- Top 30 TFs ranked by enrichment
- Bar height: Z-score
- Bar color: Significance status
- No .rds files found: Check tf_dir path and file format
- Zero variance warning: Normal when TF has no background overlap
- Memory issues: Process TFs in batches for large datasets
TF binding site files must be:
- Saved as
.rdsformat - Contain valid GRanges objects
- Named by TF (e.g.,
CTCF.rds)

