Skip to content

pbradleylab/phylogenize-db-prep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

170 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generation of Protein-Clustered Pangenome Databases

For looking a species level pangenomic comparisons, it may be of use to have protein-clustered pangenomes in a taxa matrix. It is also easier to manage a matrix of clusters than individual pangenomes; hence why using this while developing pangenomic analysis tools is useful. Phylogenize is a tools that allows, as descibed by it's developers, "links genes in microbial genomes to either microbial prevalence in, or specificity for, a given environment, while also taking into account an important potential confounder: the phylogenetic relationships between microbes". Protein level databases are used for the tool, which is why it is important as a compliment to have a way to easily generate new databses to increase its reach easily. We develope this workflow to efficiently work with nucleotides pangenomes to create new databases. We also encourage the community to contribute to this effort by submitting PR requests for databases to include or submitting the final databases generated via this workflow to the developers of Phylogenize or this repository.

Running The Workflow

This workflow expects that you have conda installed prior to starting. Conda is very easy to install in general and will allow you to easily install the other dependencies needed in this workflow. Then you will need to download this repository via git clone such as git clone git@github.com:Kekananen/phylogenize-db-prep.git.

  1. Edit the config/pepconfig.yml file's raw_data: string to be where your files are located at. Make sure to use the full path to avoid any errors. All your files will need to be in the same directory for this workflow. If you have a lot of files, you can symlink them into one directory to avoid taking up any more space. Another assumption if that the files will end with .ffn. The files won't be seen if they don't end with this; however you can rename then with the symlink when generated to avoid editing any actual file names prior to running this workflow.

Dependencies and Installation

You will only need to have snakemake installed in an environment. The snakemake version needs to be greater than 7.0 or the workflow won't run since this is enforced internally. You can install a specific version if conda is installing a later version by specifying it like conda install -c bioconda snakemake=7.0.0. If this takes a long time you can install mamba into your snakemake environment and use mamba install instead. If you install mamba, then snakemake will internally use mamba to generate the interal environments and install dependencies which can save a lot of time on the inital run.

Output Generated

The final results will be in the results/your_database_name/final/ all intermediate results are in results/your_database_name/specific_tool_name. If there are space limitations and you don't wish to retain any of these files, then in each rule that isn't desired add a temp(output) to the output line like: output: temp(rules.something.output) and the file will only be generated temporarily for any rules that need it. They will be retained if the rules that require the files as input fail, but will be removed if the all rules requiring the file succeeds.

Submission Process

Please contact either Kathryn Kananen or Patrick Bradley if you wish to submit to the phylogenize databases being used. If you need help transfering the final matrix to us, then please contact Kathryn Kananen to help facilitate a transfer of the larger file.

About

This workflow is meant to be a compliment to the Phylogenize package by Bradley and Pollard, 2020. Databases are generated at the species level with protein clustering for the package to use. The workflow starts from nucleotide level pangenomes and generates a taxonomy matrix of peptide clusters. It can then be run on the backend for databases

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors