A comprehensive implementation of the Burrows-Wheeler Transform (BWT) and BWA algorithm for efficient DNA sequence alignment and pattern matching in bioinformatics applications.
The Burrows-Wheeler Transform is a fundamental algorithm in bioinformatics that enables efficient sequence alignment and data compression. This project implements the complete BWT pipeline including:
- String transformation using cyclic rotations and lexicographic sorting
- Lossless inversion to reconstruct original sequences
- Efficient pattern matching using the BWA (Burrows-Wheeler Aligner) algorithm
- Performance benchmarking against naive string search methods
- Compression analysis to demonstrate BWT's effectiveness
- β Complete BWT Implementation: From basic rotations to advanced pattern matching
- β Educational Focus: Well-documented code with detailed explanations
- β Performance Optimized: Efficient algorithms suitable for genomic-scale data
- β Comprehensive Testing: Includes validation and benchmarking tools
- β Real-world Application: Tested on actual genomic sequences and literary texts
- Python 3.8 or higher
- Basic understanding of string algorithms (helpful but not required)
- Clone the repository:
git clone https://github.com/your-username/bwt-dna-analysis.git
cd bwt-dna-analysis
- Install dependencies:
pip install -r requirements.txt
- Run the example:
from bwt_processor import BWTProcessor
# Initialize the processor
processor = BWTProcessor()
# Basic usage
sequence = "ACAGTGAT"
bwt_result = processor.bwt(sequence)
print(f"BWT of {sequence}: {bwt_result}")
# Pattern matching
matches = processor.bwa_search("TCGACGAT", "CGA")
print(f"Found {matches} matches")
The BWT-based search demonstrates significant performance improvements over naive string matching:
Method | Time Complexity | Space Complexity | Genomic Data (10KB) |
---|---|---|---|
Naive Search | O(nm) | O(1) | ~21.8 seconds |
BWT Search | O(m) | O(n) | ~0.7 milliseconds |
Where n = text length, m = pattern length
def bwt(text: str) -> str:
"""Transform string into BWT representation"""
rotations = cyclic_rotations(text + "$")
sorted_rotations = sorted(rotations)
return ''.join([rotation[-1] for rotation in sorted_rotations])
def invert_bwt(bwt_string: str) -> str:
"""Reconstruct original string from BWT"""
# Uses LF-mapping to trace back original sequence
# Demonstrates the reversible nature of BWT
def bwa_search(text: str, pattern: str) -> int:
"""Efficient pattern matching using BWT"""
# Combines BWT construction with backward search
# Achieves O(m) search time complexity
bwt-dna-analysis/
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
βββ LICENSE # MIT license
βββ bwt_processor.py # Main BWT implementation
βββ examples/
β βββ basic_usage.py # Simple usage examples
β βββ genomic_analysis.py # Real genomic data analysis
β βββ performance_demo.py # Benchmarking demonstrations
βββ tests/
β βββ test_bwt.py # Unit tests for BWT functions
β βββ test_search.py # Pattern matching tests
β βββ test_performance.py # Performance benchmarks
βββ data/
β βββ sample_sequences.txt # Sample DNA sequences
β βββ test_data.txt # Test data for validation
βββ notebooks/
β βββ BWT_Tutorial.ipynb # Interactive tutorial
β βββ Performance_Analysis.ipynb # Detailed analysis
βββ docs/
βββ algorithm_explanation.md # Detailed algorithm docs
βββ api_reference.md # Function documentation
- Genomic sequence alignment: Map short reads to reference genomes
- Variant detection: Identify genetic variations efficiently
- Phylogenetic analysis: Compare sequences across species
- Algorithm learning: Understand string processing concepts
- Bioinformatics training: Hands-on experience with real tools
- Computer science education: Study advanced data structures
- Text compression: Leverage BWT for general text compression
- Genomic data storage: Efficiently compress large sequence datasets
- Pattern analysis: Study repetitive structures in biological data
from bwt_processor import BWTProcessor
processor = BWTProcessor()
# Example 1: Simple DNA sequence
dna_seq = "GATTACA"
bwt_result = processor.bwt(dna_seq)
reconstructed = processor.invert_bwt(bwt_result)
print(f"Original: {dna_seq}")
print(f"BWT: {bwt_result}")
print(f"Reconstructed: {reconstructed}")
print(f"Perfect reconstruction: {dna_seq == reconstructed}")
# Example 2: Pattern search in genomic data
genome_fragment = "ATCGATCGATCGAATCGATCG"
pattern = "ATCG"
# BWA search
bwa_matches = processor.bwa_search(genome_fragment, pattern)
# Compare with naive search
bwt_time, naive_time, bwt_count, naive_count = processor.benchmark_search_methods(
genome_fragment, pattern
)
print(f"Pattern '{pattern}' found {bwa_matches} times")
print(f"BWA search time: {bwt_time:.6f}s")
print(f"Naive search time: {naive_time:.6f}s")
print(f"Speedup: {naive_time/bwt_time:.2f}x")
# Example 3: Analyze compression properties
text = "AAABBBCCCAAABBBAAA"
analysis = processor.analyze_compression(text)
print(f"Original runs: {analysis['original_runs']}")
print(f"BWT runs: {analysis['bwt_runs']}")
print(f"Run reduction: {analysis['run_reduction_ratio']:.2f}")
print(f"Entropy reduction: {analysis['entropy_reduction']:.3f}")
Run the test suite to validate implementation:
# Run all tests
python -m pytest tests/
# Run specific test categories
python -m pytest tests/test_bwt.py -v
python -m pytest tests/test_search.py -v
python -m pytest tests/test_performance.py -v
# Run with coverage report
python -m pytest tests/ --cov=bwt_processor --cov-report=html
The project includes comprehensive benchmarking tools:
# Basic performance test
python examples/performance_demo.py
# Genomic data analysis
python examples/genomic_analysis.py
# Memory usage analysis
python -m memory_profiler examples/memory_benchmark.py
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Make your changes and add tests
- Run the test suite:
python -m pytest
- Commit your changes:
git commit -m 'Add amazing feature'
- Push to the branch:
git push origin feature/amazing-feature
- Open a Pull Request
- Follow PEP 8 style guidelines
- Add docstrings to all functions
- Include unit tests for new features
- Update documentation as needed
- Burrows, M. and Wheeler, D.J. (1994). "A block-sorting lossless data compression algorithm"
- Li, H. and Durbin, R. (2009). "Fast and accurate short read alignment with Burrows-Wheeler transform"
- Ferragina, P. and Manzini, G. (2000). "Opportunistic data structures with applications"
- Bioinformatics Algorithms: Comprehensive textbook
- Rosalind: Bioinformatics programming challenges
- Coursera Bioinformatics: Online courses
This project is licensed under the MIT License - see the LICENSE file for details.
Devansh Shah
- π Biomedical Engineering Student & Healthcare Informatics
- π¬ Research Focus: Computational Biology & Bioinformatics
- π GitHub: @your-username
- πΌ LinkedIn: Your LinkedIn
- π§ Email: [email protected]
- Inspired by the seminal work of Michael Burrows and David Wheeler
- Built during coursework in Digital Health and Bioinformatics (DH607)
- Thanks to the bioinformatics community for open-source tools and datasets
β Star this repository if you found it helpful!
This project demonstrates advanced bioinformatics algorithms and is suitable for educational purposes, research applications, and as a foundation for more complex genomic analysis pipelines.