Generation of Protein-Clustered Pangenome Databases

For looking a species level pangenomic comparisons, it may be of use to have protein-clustered pangenomes in a taxa matrix. It is also easier to manage a matrix of clusters than individual pangenomes; hence why using this while developing pangenomic analysis tools is useful. Phylogenize is a tools that allows, as descibed by it's developers, "links genes in microbial genomes to either microbial prevalence in, or specificity for, a given environment, while also taking into account an important potential confounder: the phylogenetic relationships between microbes". Protein level databases are used for the tool, which is why it is important as a compliment to have a way to easily generate new databses to increase its reach easily. We develope this workflow to efficiently work with nucleotides pangenomes to create new databases. We also encourage the community to contribute to this effort by submitting PR requests for databases to include or submitting the final databases generated via this workflow to the developers of Phylogenize or this repository.

Running The Workflow

This workflow expects that you have conda installed prior to starting. Conda is very easy to install in general and will allow you to easily install the other dependencies needed in this workflow. Then you will need to download this repository via git clone such as git clone git@github.com:Kekananen/phylogenize-db-prep.git.

Edit the config/pepconfig.yml file's raw_data: string to be where your files are located at. Make sure to use the full path to avoid any errors. All your files will need to be in the same directory for this workflow. If you have a lot of files, you can symlink them into one directory to avoid taking up any more space. Another assumption if that the files will end with .ffn. The files won't be seen if they don't end with this; however you can rename then with the symlink when generated to avoid editing any actual file names prior to running this workflow.

Dependencies and Installation

You will only need to have snakemake installed in an environment. The snakemake version needs to be greater than 7.0 or the workflow won't run since this is enforced internally. You can install a specific version if conda is installing a later version by specifying it like conda install -c bioconda snakemake=7.0.0. If this takes a long time you can install mamba into your snakemake environment and use mamba install instead. If you install mamba, then snakemake will internally use mamba to generate the interal environments and install dependencies which can save a lot of time on the inital run.

Output Generated

The final results will be in the results/your_database_name/final/ all intermediate results are in results/your_database_name/specific_tool_name. If there are space limitations and you don't wish to retain any of these files, then in each rule that isn't desired add a temp(output) to the output line like: output: temp(rules.something.output) and the file will only be generated temporarily for any rules that need it. They will be retained if the rules that require the files as input fail, but will be removed if the all rules requiring the file succeeds.

Submission Process

Please contact either Kathryn Kananen or Patrick Bradley if you wish to submit to the phylogenize databases being used. If you need help transfering the final matrix to us, then please contact Kathryn Kananen to help facilitate a transfer of the larger file.

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
config		config
images		images
test		test
workflow		workflow
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generation of Protein-Clustered Pangenome Databases

Running The Workflow

Dependencies and Installation

Output Generated

Submission Process

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Generation of Protein-Clustered Pangenome Databases

Running The Workflow

Dependencies and Installation

Output Generated

Submission Process

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages