Digested Peptide Database

Overview

This is a Java library for creating and searching a digested peptide database with RocksDB key-value store. Digested peptides are generated from protein sequences using a specific enzyme (e.g., trypsin) and are stored in a database for rapid mass-based searches. This library is designed to be efficient and easy to use, with a focus on performance. The database is designed to be a compact and efficient storage of in-silico digested peptides. The primary goal is to support rapid mass-based searches, even on less powerful desktop machines.

This compact database layout enables fast peptide mass queries with minimal resource usage. For example, a Uniprot TrEMBL bacteria dataset (7 to 30 amino acids, 1 missed cleavage) occupies about 50 GB on disk and supports quick lookups by mass range on standard hardware.

Search example: masss from: 1900 Da to 1900.3Da gives results like:

{
  "totalResult": 318,
  "results": [
    {
      "mass": 1900.0012,
      "seq": "VSGSVSARTDVVVAGPGAGSK",
      "acc": [
        "A3PJ74"
      ]
    },
    {
      "mass": 1900.0013,
      "seq": "QLTDEVLALRLSENGSR",
      "acc": [
        "Q99Q94"
      ]
    },
    {
      "mass": 1900.0013,
      "seq": "SGVTQVEDIQVQVGRTGK",
      "acc": [
        "C1F7S7"
      ]
    },
    {
      "mass": 1900.0013,
      "seq": "LTLDDGVRQDLVSAGVSR",
      "acc": [
        "Q03WY1"
      ]
    },
    ...
    
  ] }

acc - Uniprot accession number, seq - peptide sequence, mass - peptide mass.

Key Features

Minimal and Fast database: Database of digested peptides designed for a small on-disk footprint
Mass-Oriented Search: Stores peptide data keyed by mass to expedite lookups based on precise mass ranges

Download Databases

Uniprot TrEMBL bacteria dataset. ~50GB , 7 to 30 amino acids, 1 miss cleavage.
- Download ...
Uniprot TrEMBL human dataset. (NOT FINISHED).

Library Integration

This database is designed as a Java library that can be easily embedded into existing software applications. The library provides a clean API for programmatic access to peptide data, making it ideal for integration with mass spectrometry analysis tools or custom bioinformatics pipelines.

New database creation

java -jar DigestedProteinDB-0.1-uber.jar create-db -c -d /path/to/db_bacteria_swisprot -u /path/to/uniprot_sprot_bacteria.xml.gz -n UniprotSwot_bacteria

Start web server for searching the database

java -jar DigestedProteinDB-0.1-uber.jar server --port 8080 --db-dir /path/to/db

Demo web: https://digestedproteindb.pbf.hr/

Create database workflow

The database is created from the uniprot .xml.gz or .xml file. Example class hr.pbf.digestdb.CreateDatabase is used to create the database.

Steps to create the database:

Parse Uniprot Data: Extract sequences and relevant metadata from uniprot .xml.gz database and NCBI Taxonomy.
Digest Proteins: Cleave the protein with a specific enzyme (trypsin, for example).
Organize Data: Sort and group peptides by mass for easy storage.
Build Database: Generate a mass-indexed key-value engine (RocksDB) and custom solution for quick searches.

Example build database on MacOS or Linux

Prepare uniprot xml protein as source data. The database is created from the uniprot .xml.gz or .xml file. The uniprot xml.gz file can be downloaded from the uniprot FTP website.

put in folder file workflow.properties

# relative path to the uniprot xml file
uniprot_xml_path=src/uniprot_sprot_bacteria.xml.gz
min_peptide_length=7
max_peptide_length=30
miss_cleavage=1
db_name=Uniprot Swis-Prot bacteria
enzyme_name=Trypsine
#sort_temp_dir=/tmp/temp_dir

TODO

Implement a simple command line interface for querying the database.
Create new database for additional proteins, enzymes and missed cleavages.
Create website for the project with:
- Documentation
- Downloadable databases
- Benchmarks

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
php		php
src		src
.gitignore		.gitignore
AUTHORS		AUTHORS
DigestedProteinDB.iml		DigestedProteinDB.iml
LICENSE		LICENSE
README.md		README.md
lombok.config		lombok.config
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digested Peptide Database

Overview

Key Features

Download Databases

Library Integration

New database creation

Start web server for searching the database

Create database workflow

Example build database on MacOS or Linux

TODO

About

Releases

Packages

Languages

License

jankod/DigestedProteinDB

Folders and files

Latest commit

History

Repository files navigation

Digested Peptide Database

Overview

Key Features

Download Databases

Library Integration

New database creation

Start web server for searching the database

Create database workflow

Example build database on MacOS or Linux

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages