Skip to content

lysovosyl/SRLE-seq

Repository files navigation

🧬 Abstract

RNA molecules localize to specific subcellular compartments to perform their functions. While most mRNAs are translated at the endoplasmic reticulum, many lncRNAs act within organelles. To better understand RNA localization mechanisms, we developed SRLE-seq (Screening RNA Localization Elements by Sequencing), a high-throughput method to identify functional localization elements.


πŸ’» Installation

Requires Python β‰₯ 3.8 and the following libraries:

pip install biopython numpy scipy tqdm

βš™οΈ Functional

πŸ“š Kmer-based library diversity validation

1.Introduction

This section is designed to validate whether the constructed Kmer-based plasmid library is suitable for downstream experiments. The distribution of plasmid types corresponding to each k-mer should be relatively uniform. No single k-mer type should dominate the library in abundance, as this would indicate amplification bias or cloning artifacts.

2.Input Requirements

The script accepts the following required command-line arguments:

Argument Type Description Default
-fq1 str Path to the nuclear sample R1 (forward reads) FASTQ file
-fq2 str Path to the nuclear sample R2 (reverse reads) FASTQ file None
-s str Output directory for saving result figures and tables
-up_flanking str Sequence of the upstream (5') anchor primer
-down_flanking str Sequence of the downstream (3') anchor primer
-kmer_length str Length of k-mers to count

⚠️ For paired-end reads with Paired-End sequences, at least one of the primer-binding regions should be longer than 150 bp. Otherwise, the reads should be merged using PEAR before further processing.

3.Quick Start

Follow these steps to quickly analyse library diversity:

python kmer_diversity_validation.py \
  -fq1 R1.fastq \
  -fq2 R2.fastq \
  -s ./ \
  -up_flanking ATTATGAT \
  -down_flanking GCTTAGTG \
  -kmer_length 6 \

4.Check the Output

After completion, the output directory will contain:

kmer_diversity.validation.tsv – records the raw count and CPM (counts per million) for each k-mer.


πŸ“š Fragment-based library diversity validation

1.Introduction

This section is designed to validate whether the constructed fragment-based plasmid library is suitable for downstream experiments. The inserted fragments should collectively achieve broad and uniform coverage across the full-length source sequence. Ideally, all regions of the original sequence should be represented within the library, ensuring comprehensive functional screening.

2.Input Requirements

The script accepts the following required command-line arguments:

Argument Type Description Default
-fq1 str Path to the nuclear sample R1 (forward reads) FASTQ file
-fq2 str Path to the nuclear sample R2 (reverse reads) FASTQ file
-s str Output directory for saving result figures and tables
-up_flanking str Sequence of the upstream (5') anchor primer
-down_flanking str Sequence of the downstream (3') anchor primer
-gene_chr str Chromosome name of the gene locus (e.g., chr11)
-gene_region_start int Start coordinate of the target gene region (1-based)
-gene_region_end int End coordinate of the target gene region (1-based)
-config str Path to a configuration file containing parameter presets
-thread int Number of threads to use for parallel processing

⚠️ For paired-end reads with Paired-End sequences, at least one of the primer-binding regions should be longer than 150 bp. Otherwise, the reads should be merged using PEAR before further processing.

3.Quick Start

Follow these steps to quickly analyse library diversity:

python randomfrag_diversity_evaluation.py \
  -fq1 R1.fastq \
  -fq2 R2.fastq \
  -s ./ \
  -up_flanking ATTATGAT \
  -down_flanking GCTTAGTG \
  -gene_chr chr11 \
  -gene_region_start 65497688
  -gene_region_end 65506516
  -config ./config.yaml
  -thread 64

4.Check the Output

After completion, the output directory will contain:

dna_fragment.coverage.bed - BED-format file showing per-base coverage of extracted fragments across the specified gene region.

dna_fragment.coverage.png - Coverage plot visualizing the distribution of extracted fragments.


πŸ“ Kmer location analysis

1.Introduction

This section aims to investigate the subcellular localization preferences of individual k-mers following plasmid transfection into cells. As a prerequisite, quality control (QC) is performed on the post-transfection k-mer library. A k-mer is considered to pass QC if it shows a count-per-million (CPM) value greater than 1 in at least one of the three compartments: Total, Nuclear (Nuc), or Cytoplasmic (Cyto). This criterion ensures that the k-mer has been successfully transfected and exhibits sufficient expression for downstream analysis.

2.Input Requirements

The script accepts the following required command-line arguments:

Argument Type Description Default
-nuc_fq1_file str Path to the nuclear sample R1 (forward reads) FASTQ file
-nuc_fq2_file str Path to the nuclear sample R2 (reverse reads) FASTQ file
-cyto_fq1_file str Path to the cytoplasm sample R1 (forward reads) FASTQ file
-cyto_fq2_file str Path to the cytoplasm sample R2 (forward reads) FASTQ file
-s str Output directory for saving result figures and tables
-up_flanking str Sequence of the upstream (5') anchor primer
-down_flanking str Sequence of the downstream (3') anchor primer
-mode str Processing mode: kmer_complexity or fragment_complexity
-kmer_length int Length of k-mers to count

⚠️ For paired-end reads with Paired-End sequences, at least one of the primer-binding regions should be longer than 150 bp. Otherwise, the reads should be merged using PEAR before further processing.

3.Quick Start

Follow these steps to quickly analyse fragment location analysis:

python library_location_analysis.py \
  -nuc_fq1_file nuc.R1.fastq \
  -nuc_fq2_file nuc.R2.fastq \
  -cyto_fq1_file cyto.R1.fastq \
  -cyto_fq2_file cyto.R2.fastq \
  -s ./ \
  -up_flanking ATTATGAT \
  -down_flanking GCTTAGTG \
  -mode kmer_complexity \
  -kmer_length 6 \

4.Check the Output

After completion, the output directory will contain:

kmer_complexity.kmer.count.png – A scatter plot comparing the abundance (CPM) of each k-mer between the cytoplasmic (x-axis) and nuclear (y-axis) compartments. This plot reveals subcellular localization patterns of individual k-mers.

kmer_complexity.info.csv – A summary table listing each k-mer's cytoplasmic and nuclear counts, logβ‚‚ fold change (cytoplasm vs. nucleus), and Nuclear Enrichment Score (NES), providing quantitative metrics for assessing k-mer localization bias.


πŸ“ Randomfrag location analysis

1.Introduction

This section focuses on the subcellular localization patterns of full-length or partial sequence fragments after plasmid transfection. Quality control of the fragment library requires each fragment to have a minimum sequencing coverage of 100Γ—, ensuring reliable detection and quantification. Only fragments meeting this coverage threshold are included in subsequent analyses to evaluate their abundance, CPM, and nuclear enrichment score (NES), which together provide insights into their subcellular distribution characteristics.

2.Input Requirements

The script accepts the following required command-line arguments:

Argument Type Description Default
-nuc_fq1 str Path to the nuclear sample R1 (forward reads) FASTQ file
-nuc_fq2 str Path to the nuclear sample R2 (reverse reads) FASTQ file
-cyto_fq1 str Path to the cytoplasm sample R1 (forward reads) FASTQ file
-cyto_fq2 str Path to the cytoplasm sample R2 (forward reads) FASTQ file
-s str Output directory for saving result figures and tables
-up_flanking str Sequence of the upstream (5') anchor primer
-down_flanking str Sequence of the downstream (3') anchor primer
-gene_chr str Processing mode: kmer_complexity or fragment_complexity
-gene_region_start int Start coordinate of the target gene region (1-based)
-gene_region_end int End coordinate of the target gene region (1-based)
-config str Path to a configuration file containing parameter presets
-thread int Number of threads to use for parallel processing

⚠️ For paired-end reads with Paired-End sequences, at least one of the primer-binding regions should be longer than 150 bp. Otherwise, the reads should be merged using PEAR before further processing.

3.Quick Start

Follow these steps to quickly analyse fragment location analysis:

python randomfrag_location_analysis.py \
  -nuc_fq1_file nuc.R1.fastq \
  -nuc_fq2_file nuc.R2.fastq \
  -cyto_fq1_file cyto.R1.fastq \
  -cyto_fq2_file cyto.R2.fastq \
  -s ./ \
  -up_flanking ATTATGAT \
  -down_flanking GCTTAGTG \
  -mode kmer_complexity \
  -gene_chr chr11 \
  -gene_region_start 65497688 \
  -gene_region_end 65506516 \
  -config ./config.yaml \
  -thread 64 \

4.Check the Output

After completion, the output directory will contain:

randomfrag_diversity.tsv – The abundance (read count and CPM) of each random fragment derived from the target gene.

coverage.bed – Base-level coverage information for each specified gene region, indicating how many reads cover each genomic position.

dna_fragment.coverage.png – A visualization of base-level sequencing coverage across each specified gene region, illustrating the distribution and depth of coverage.

Fragment.length.distribution.png – A scatter plot comparing the abundance of each k-mer in the cytoplasm (x-axis) versus the nucleus (y-axis), highlighting their subcellular localization tendencies.


πŸ“ Please Cite

If you use this script or parts of it in your research or project, please cite the repository or acknowledge the author appropriately. A suggested citation format:

Zeng. x. (2025). SRLE-seq k-mer comparison and visualization pipeline


πŸ‘¨β€πŸ’» Maintainer

Lin Yusen
Email: [email protected]

Feel free to open an issue or pull request for improvements or bug fixes.


🀝 Contributors

Current contributors:

  • Yusen Lin
  • Xingqaun Zeng
  • Jiajian Zhou

πŸ“„ License

This project is licensed under the MIT License.
You are free to use, modify, and distribute it with attribution.

About

Kmer Location Analysis from RNA-seq Fastq Files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages