Skip to content

[Pre-submission] bioquik: fast tool for counting CG-anchored DNA motifs in FASTA files. #276

@Rajkanwars15

Description

@Rajkanwars15

Submitting Author: Rajkanwar Singh (@Rajkanwars15)
Package Name: bioquik
One-Line Description of Package: bioquik quickly finds and counts special DNA patterns (called motifs anchored at CG spots) in genome files (FASTA format).
Repository Link (if existing): https://github.com/Rajkanwars15/bioquik
EiC: TBD


Code of Conduct & Commitment to Maintain Package

Description

  • Include a brief paragraph describing what your package does:
    bioquik is an open-source Python toolkit for fast and reproducible quantification of CG-anchored DNA motifs in FASTA sequences. It automates motif expansion, efficient searching, and structured reporting to support downstream genomic and epigenomic analyses.

It leverages a high-performance FM-index backend (via pydivsufsort) to count motifs directly from large reference genomes and multi-sample FASTA datasets with low memory usage. It is designed for integration into bioinformatics pipelines, enabling parallel processing, rich progress reporting, and multiple machine-readable output formats (per-file CSV, combined summary CSV, and optional JSON). Optional visualization capabilities provide motif distribution plots and heatmaps to support exploratory analysis and quality control.

Community Partnerships

We partner with communities to support peer review with an additional layer of
checks that satisfy community requirements. If your package fits into an
existing community please check below:

Scope

  • Please indicate which category or categories this package falls under:

    • Data retrieval
    • Data extraction
    • Data processing/munging
    • Data deposition
    • Data validation and testing
    • Data visualization
    • Workflow automation
    • Citation management and bibliometrics
    • Scientific software wrappers
    • Database interoperability

Domain Specific

  • Geospatial
  • Education

  • Explain how and why the package falls under these categories (briefly, 1-2 sentences). For community partnerships, check also their specific guidelines as documented in the links above. Please note any areas you are unsure of:
    bioquik extracts motif occurrence data directly from genomic FASTA files using an FM-index to enable scalable search on large datasets. It then aggregates and summarizes these counts into standard tabular outputs with optional visual analytics. These components support computational genomics workflows focused on DNA sequence motif frequency analysis.

  • Who is the target audience and what are the scientific applications of this package?
    Bioinformaticians, genomic researchers analyzing DNA sequences for motifs.
    Applications: gene studies, motif frequency in genomes

  • Are there other Python packages that accomplish similar things? If so, how does yours differ?
    Biopython (SeqUtils for basic motif search), scikit-bio (sequence tools). bioquik differs with FM-index for many times faster counting on GB-scale files/motifs, CG-anchoring, native parallelism.

  • Any other questions or issues we should be aware of:

P.S. Have feedback/comments about our review process? Leave a comment here

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    pre-submission

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions