RNA molecules localize to specific subcellular compartments to perform their functions. While most mRNAs are translated at the endoplasmic reticulum, many lncRNAs act within organelles. To better understand RNA localization mechanisms, we developed SRLE-seq (Screening RNA Localization Elements by Sequencing), a high-throughput method to identify functional localization elements.
Requires Python β₯ 3.8 and the following libraries:
pip install biopython numpy scipy tqdmThis section is designed to validate whether the constructed Kmer-based plasmid library is suitable for downstream experiments. The distribution of plasmid types corresponding to each k-mer should be relatively uniform. No single k-mer type should dominate the library in abundance, as this would indicate amplification bias or cloning artifacts.
The script accepts the following required command-line arguments:
| Argument | Type | Description | Default |
|---|---|---|---|
-fq1 |
str |
Path to the nuclear sample R1 (forward reads) FASTQ file | |
-fq2 |
str |
Path to the nuclear sample R2 (reverse reads) FASTQ file | None |
-s |
str |
Output directory for saving result figures and tables | |
-up_flanking |
str |
Sequence of the upstream (5') anchor primer | |
-down_flanking |
str |
Sequence of the downstream (3') anchor primer | |
-kmer_length |
str |
Length of k-mers to count |
β οΈ For paired-end reads with Paired-End sequences, at least one of the primer-binding regions should be longer than 150 bp. Otherwise, the reads should be merged using PEAR before further processing.
Follow these steps to quickly analyse library diversity:
python kmer_diversity_validation.py \
-fq1 R1.fastq \
-fq2 R2.fastq \
-s ./ \
-up_flanking ATTATGAT \
-down_flanking GCTTAGTG \
-kmer_length 6 \
After completion, the output directory will contain:
kmer_diversity.validation.tsv β records the raw count and CPM (counts per million) for each k-mer.
This section is designed to validate whether the constructed fragment-based plasmid library is suitable for downstream experiments. The inserted fragments should collectively achieve broad and uniform coverage across the full-length source sequence. Ideally, all regions of the original sequence should be represented within the library, ensuring comprehensive functional screening.
The script accepts the following required command-line arguments:
| Argument | Type | Description | Default |
|---|---|---|---|
-fq1 |
str |
Path to the nuclear sample R1 (forward reads) FASTQ file | |
-fq2 |
str |
Path to the nuclear sample R2 (reverse reads) FASTQ file | |
-s |
str |
Output directory for saving result figures and tables | |
-up_flanking |
str |
Sequence of the upstream (5') anchor primer | |
-down_flanking |
str |
Sequence of the downstream (3') anchor primer | |
-gene_chr |
str |
Chromosome name of the gene locus (e.g., chr11) |
|
-gene_region_start |
int |
Start coordinate of the target gene region (1-based) | |
-gene_region_end |
int |
End coordinate of the target gene region (1-based) | |
-config |
str |
Path to a configuration file containing parameter presets | |
-thread |
int |
Number of threads to use for parallel processing |
β οΈ For paired-end reads with Paired-End sequences, at least one of the primer-binding regions should be longer than 150 bp. Otherwise, the reads should be merged using PEAR before further processing.
Follow these steps to quickly analyse library diversity:
python randomfrag_diversity_evaluation.py \
-fq1 R1.fastq \
-fq2 R2.fastq \
-s ./ \
-up_flanking ATTATGAT \
-down_flanking GCTTAGTG \
-gene_chr chr11 \
-gene_region_start 65497688
-gene_region_end 65506516
-config ./config.yaml
-thread 64
After completion, the output directory will contain:
dna_fragment.coverage.bed - BED-format file showing per-base coverage of extracted fragments across the specified gene region.
dna_fragment.coverage.png - Coverage plot visualizing the distribution of extracted fragments.
This section aims to investigate the subcellular localization preferences of individual k-mers following plasmid transfection into cells. As a prerequisite, quality control (QC) is performed on the post-transfection k-mer library. A k-mer is considered to pass QC if it shows a count-per-million (CPM) value greater than 1 in at least one of the three compartments: Total, Nuclear (Nuc), or Cytoplasmic (Cyto). This criterion ensures that the k-mer has been successfully transfected and exhibits sufficient expression for downstream analysis.
The script accepts the following required command-line arguments:
| Argument | Type | Description | Default |
|---|---|---|---|
-nuc_fq1_file |
str |
Path to the nuclear sample R1 (forward reads) FASTQ file | |
-nuc_fq2_file |
str |
Path to the nuclear sample R2 (reverse reads) FASTQ file | |
-cyto_fq1_file |
str |
Path to the cytoplasm sample R1 (forward reads) FASTQ file | |
-cyto_fq2_file |
str |
Path to the cytoplasm sample R2 (forward reads) FASTQ file | |
-s |
str |
Output directory for saving result figures and tables | |
-up_flanking |
str |
Sequence of the upstream (5') anchor primer | |
-down_flanking |
str |
Sequence of the downstream (3') anchor primer | |
-mode |
str |
Processing mode: kmer_complexity or fragment_complexity |
|
-kmer_length |
int |
Length of k-mers to count |
β οΈ For paired-end reads with Paired-End sequences, at least one of the primer-binding regions should be longer than 150 bp. Otherwise, the reads should be merged using PEAR before further processing.
Follow these steps to quickly analyse fragment location analysis:
python library_location_analysis.py \
-nuc_fq1_file nuc.R1.fastq \
-nuc_fq2_file nuc.R2.fastq \
-cyto_fq1_file cyto.R1.fastq \
-cyto_fq2_file cyto.R2.fastq \
-s ./ \
-up_flanking ATTATGAT \
-down_flanking GCTTAGTG \
-mode kmer_complexity \
-kmer_length 6 \
After completion, the output directory will contain:
kmer_complexity.kmer.count.png β A scatter plot comparing the abundance (CPM) of each k-mer between the cytoplasmic (x-axis) and nuclear (y-axis) compartments. This plot reveals subcellular localization patterns of individual k-mers.
kmer_complexity.info.csv β A summary table listing each k-mer's cytoplasmic and nuclear counts, logβ fold change (cytoplasm vs. nucleus), and Nuclear Enrichment Score (NES), providing quantitative metrics for assessing k-mer localization bias.
This section focuses on the subcellular localization patterns of full-length or partial sequence fragments after plasmid transfection. Quality control of the fragment library requires each fragment to have a minimum sequencing coverage of 100Γ, ensuring reliable detection and quantification. Only fragments meeting this coverage threshold are included in subsequent analyses to evaluate their abundance, CPM, and nuclear enrichment score (NES), which together provide insights into their subcellular distribution characteristics.
The script accepts the following required command-line arguments:
| Argument | Type | Description | Default |
|---|---|---|---|
-nuc_fq1 |
str |
Path to the nuclear sample R1 (forward reads) FASTQ file | |
-nuc_fq2 |
str |
Path to the nuclear sample R2 (reverse reads) FASTQ file | |
-cyto_fq1 |
str |
Path to the cytoplasm sample R1 (forward reads) FASTQ file | |
-cyto_fq2 |
str |
Path to the cytoplasm sample R2 (forward reads) FASTQ file | |
-s |
str |
Output directory for saving result figures and tables | |
-up_flanking |
str |
Sequence of the upstream (5') anchor primer | |
-down_flanking |
str |
Sequence of the downstream (3') anchor primer | |
-gene_chr |
str |
Processing mode: kmer_complexity or fragment_complexity |
|
-gene_region_start |
int |
Start coordinate of the target gene region (1-based) | |
-gene_region_end |
int |
End coordinate of the target gene region (1-based) | |
-config |
str |
Path to a configuration file containing parameter presets | |
-thread |
int |
Number of threads to use for parallel processing |
β οΈ For paired-end reads with Paired-End sequences, at least one of the primer-binding regions should be longer than 150 bp. Otherwise, the reads should be merged using PEAR before further processing.
Follow these steps to quickly analyse fragment location analysis:
python randomfrag_location_analysis.py \
-nuc_fq1_file nuc.R1.fastq \
-nuc_fq2_file nuc.R2.fastq \
-cyto_fq1_file cyto.R1.fastq \
-cyto_fq2_file cyto.R2.fastq \
-s ./ \
-up_flanking ATTATGAT \
-down_flanking GCTTAGTG \
-mode kmer_complexity \
-gene_chr chr11 \
-gene_region_start 65497688 \
-gene_region_end 65506516 \
-config ./config.yaml \
-thread 64 \
After completion, the output directory will contain:
randomfrag_diversity.tsv β The abundance (read count and CPM) of each random fragment derived from the target gene.
coverage.bed β Base-level coverage information for each specified gene region, indicating how many reads cover each genomic position.
dna_fragment.coverage.png β A visualization of base-level sequencing coverage across each specified gene region, illustrating the distribution and depth of coverage.
Fragment.length.distribution.png β A scatter plot comparing the abundance of each k-mer in the cytoplasm (x-axis) versus the nucleus (y-axis), highlighting their subcellular localization tendencies.
If you use this script or parts of it in your research or project, please cite the repository or acknowledge the author appropriately. A suggested citation format:
Zeng. x. (2025). SRLE-seq k-mer comparison and visualization pipeline
Lin Yusen
Email: [email protected]
Feel free to open an issue or pull request for improvements or bug fixes.
Current contributors:
- Yusen Lin
- Xingqaun Zeng
- Jiajian Zhou
This project is licensed under the MIT License.
You are free to use, modify, and distribute it with attribution.