Skip to content

A Graphical User Interface to Access and Output Protein Data

License

Notifications You must be signed in to change notification settings

Tyler-Hostetler/BioSynthNexus

Repository files navigation

BioSynthNexus


Overview

A graphical user interface to efficiently access and extract data involved with genome neighborhood networks.

General Functions:

Filters and extracts data from a Genome Neighborhood Network (GNN) sqlite file, generated by the Genome Neighborhood Tool (GNT) from Enzyme Function Initiative (EFI) Tools.1,2

Retrieves UniProt data for given protein accession ID(s).3


Running, Installation, Packaging

Running the Program Directly from an Executable (no dependecies needed)

Pre-packaged executables are provided in the latest release.

Download the file for your operating system and extract the executable to your preferred location, and double click to start.

Manual Installation / Packaging

An environment can be created with pipenv or conda using the provided Pipfile or requirements.txt. You must have pipenv or conda/miniconda/etc installed to utilize these options, which may require additional steps.

Pipenv

  • Open a terminal in directory containing the repository
  • Run pipenv install
    • All required dependencies should then be installed
  • The program can then be run with pipenv run python main.py

Conda

  • Open a terminal in directory containing the repository
  • Create a conda environment conda create -n env_name python=3.10 pip
  • Activate environment conda activate env_name
  • Install Requirements pip install -r requirements.txt
  • Run the program python main.py

Packaging into an application (optional)

If you would like to package your modified code into a single executable, pyinstaller is included in the dev-packages of the Pipfile

  • Open a terminal in repository directory
  • Run pipenv install -d
    • Note: If you are using a conda environment, you have to manually install pyinstaller pip install pyinstaller
  • Run pyinstaller --windowed --onefile --add-data=ui_main_window.ui:./ --add-data=custom_ui_theme.xml:./ main.py
    • A Folder named dist will contain the packaged application


Input / Output Options:

UniProt Requests:

Output Type Input Description
FASTA Accession ID(s) Gets the FASTA formatted protein sequence(s)
Genome Accession ID in GenBank Accession ID(s) Gets the GenBank Genome Accession ID(s)
Protein Accession ID in GenBank Accession ID(s) Gets the GenBank Protein Accession ID(s)
ORF Name in Corresponding Genome Accession ID(s) Gets the GenBank Open Reading Frame (ORF) ID(s)

Genome Neighborhood Network Requests:

Output Tyoe Input Description
Parent Accession ID Pfam ID(s) Gets Parent Accession IDs that correspond to neighborhoods that contain the given Pfam ID(s)
Genome Neighborhood ID Pfam ID(s) Gets the Genome Neighborhood ID(s) for those that contain the given Pfam ID(s)
Genome Neighborhood Pfams Genome Neighborhood ID Gets the Pfams for each of the proteins within a single BGC
Genome Neighborhood Accessions Genome Neighborhood ID Gets the Accession IDs for each of the proteins within a single BGC
Neighboring Gene Accessions by Pfam Genome Neighborhood ID(s)
+ Single Pfam (Secondary Input)
Gets the Accession IDs for the protein within each BGC with the selected Pfam
Displays as 'Output Accession_(BGC ID)'
Genome Neighborhood Pfam Comparison Pfam ID(s)
Optional: Pfam in Secondary Input
Searches all Genome Neighborhoods for each given Pfam ID, Displays as 'Genome Neighborhood ID_(number of matching Pfams)'
The Secondary Input is an optional pre-filter for neighborhoods that only contain genes from that Pfam ID

Notes:

  1. The Neighboring Gene Accessions by Pfam outputs as Gene Accession_(Genome Neighborhood ID), however if you use Replace Input with Output, only the Output Accession will be displayed.
  2. If the input field is empty, genome neighborhoods that contain unannotated proteins (Pfam = none) will be considered.

GNN Information

General construction

A sequence similarity network is first constructed from a query gene sequence using EFI-EST1,2 and subsequently processed by EFI-GNT1,2 to generate a genome neighborhood network (GNN). The GNN can be visualized using the genome neighborhood diagram (EFI-GND) available on EFI website. For more information on GNN construction, please vist Enzyme Function Initiative (EFI) Tools.1,2

Uploading a GNN to BioSynthNexus:

  • Click the Upload button to select your GNN *.sqlite file
  • Change the Request Type to Genome Neighborhood Network
  • Select your desired output (See Geneome Neighborhood Requests above for guidance)
  • Fill the left text field, labeled 'Input', with respective input
  • Click Search
  • Output will be displayed in the right text field, labeled 'Output'
  • The text in the Output Box can be used as an Input by utilizing the Replace Input with Output Button

Usage:

The GNN consists of a list of genome neighborhoods, each containing a parent gene (a homolog of the initial query) and its neighboring genes.

Representative biosynthetic gene clusters (BGCs) used for the following demonstrations.

![Figure 1](/images/Figure S1.jpg) Examples and step-by-step instructions are detailed below. In all examples, Gene 1 was used as a query to generate a sequence similarity network (SSN) and a subsequent genome neighborhood network (GNN) via the EFI-EST and EFI-GNT websites1,2, respectively. A list of genome neighborhoods (GNs) can be visualized on EFI-GNT website. Genes from the same protein family (Pfam) are represented in the same color, while genes not belonging to PF00001–PF00005 are shown in gray for clarity.

Retrieval of gene information from the UniProt database according to the selected the output type.

![Figure 2](/images/Figure S2.jpg) Figure 2: In this example, the FASTA format of sequences is displayed as results, which can be used for further multiple sequence alignment.

Note: Other output types, such as Protein Accession ID in GenBank, Genome Accession ID in GenBank, and ORF nNames in the Corresponding Genome, can be selected for different purposes. This feature does not require uploading a GNN sqlite file.

Retrieval of genome neighborhoods containing neighboring gene(s) within the specific Pfam ID(s).

![Figure 3](/images/Figure S3.jpg) Figure 3: In this example, GN1–GN3 are displayed as output results because these genome neighborhoods include a neighboring gene from PF00002 (input) (Figure S1). This strategy was employed in this study. Note: The accession IDs of Gene 1 from GN1–GN3 will be as displayed as results if “Parent Accession ID” is selected as the output type.

Retrieval of accession IDs for neighboring genes within a designated Pfam ID from the given genome neighborhoods.

![Figure 4](/images/Figure S4.jpg) Figure 4: In this example, the accession IDs of Gene 2 (secondary input) from GN1–GN3 (input) are displayed as results, with the origin of each Gene 2 specified in parentheses indicating the corresponding genome neighborhood ID. Note: After filtering the Pfam ID of interest in Figure 3, clicking “REPLACE INPUT WITH OUTPUT” button and following the instructions above enables efficient retrieval of the accession IDs of the targeted neighboring genes.

Retrieval of all Pfam IDs for neighboring genes from the given genome neighborhoods.

![Figure 5](/images/Figure S5.jpg) Figure 5: In this example, PF00002–PF00005 are displayed as results because these neighboring genes are within GN1 (input). Note: The accession IDs of GN1 neighboring genes will be displayed as results if “Genome Neighborhood Accession” is selected as the output type. The number of retrieved neighboring genes depends on the neighborhood size specified when generating the GNN sqlite file from the EFI-GNT website.

Retrieval of genome neighborhood information from the given Pfam ID(s).

![Figure 6](/images/Figure S6.jpg) Figure 6: In this example, all genome neighborhoods are displayed as results, with the number of matching Pfam IDs in parentheses. Moreover, a CSV file can be retrieved to show which Pfam IDs match in each genome neighborhood. Note: The secondary input is an optional choice to pre-filter the Pfam IDs. If “PF00002” is applied as the secondary input, only GN1–GN3 will be displayed as results, as they are the only genome neighborhoods containing a neighboring gene from PF00002 (Figure 1).

References

  1. Rémi Zallot, Nils Oberg, and John A. Gerlt, The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019 58 (41), 4169-4182.
  2. Nils Oberg, Rémi Zallot, and John A. Gerlt, EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol 2023.
  3. The UniProt Consortium , UniProt: the Universal Protein Knowledgebase in 2023, Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D523–D531