Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding zol & fai usage on viral contigs #86

Open
ShailNair opened this issue Dec 2, 2024 · 2 comments
Open

Regarding zol & fai usage on viral contigs #86

ShailNair opened this issue Dec 2, 2024 · 2 comments

Comments

@ShailNair
Copy link

Hi,

Thank you for the fantastic package! The preprint is very well-written and highlights many interesting applications of Zol and Fai. I am particularly interested in the viral aspect and had a question regarding the study mentioned in the preprint. Why was the analysis limited to detecting viral contigs across metagenomes, without further analyses into the evolutionary aspects of the analyzed viral contigs? For example, exploring genomic variability among similar viral contigs over time.

Additionally, I am interested about using Zol and Fai to investigate how viral contigs evolve over time—whether they acquire or lose genes or modify existing ones. I have viral contigs identified from metagenomes of the same sample collected across different time points. As I understood, the steps are:

  1. Creating a prepTG database of assemblies:
    Should this database be created from a combined/coassembled assembly, or should a separate prepTG database be created for each individual assembly from individual metagenomes?

  2. Searching for viral contigs in the database using Fai:
    How should this be provided? Should I subset the viral.contigs.fasta/viral.contigs.faa file to include only similar viral contigs (at the species level), or can I provide the entire viral.contigs.fasta/viral.contigs.faa file?

Thank you and wish you good luck with the preprint.

@raufs
Copy link
Contributor

raufs commented Dec 2, 2024

Hi Shail,

Thank you for your interest in the package, kind feedback, and great questions!

We did attempt to use zol to further investigate the instances of the virus we detected in the lake metagenomes in the preprint, but of the confident instances detected - the sequence conservation was just really high, even across the 2-3 month time period of sampling, and most coding genes were tricky to functionally annotate. But other viruses can of course have auxiliary metabolic genes or other types of cargo of interest to track conservation of.

Sounds great, yes you can run prepTG to first perform gene-calling on your viral contigs from metagenomes and prepare files for searching via fai - recommend using pyrodigal-gv for the gene-calling method. If you have viral contigs identified already, this should cut back on disk space so that you don't replicate your full metagenome assemblies in the prepTG output.

fai performs targeted detection - so, similar to what we did in the paper, you could identify distinct viruses in your earliest time samples, and then search for them in the later time samples. Then for each virus identified across multiple timepoints, you can provide them as input to zol for conservation/evolutionary analysis. This will be rather manual, you would need to setup a script to do this for each distinct virus in your earliest timepoint and it will miss viruses that don't show up in your earliest sample.

Alternatively, a strategy that might be of interest to you, is to use the program zol-scape. Basically, say you have all your viral contigs (in nucleotide FASTA format - no gene calling) - you can run prepTG to perform gene calling and create GenBank files for each. You will have your viral contigs in GenBank format in the subdirectory of prepTG's resulting folder: Genomic_Genbanks_Additional/. Then you can use BiG-SCAPE (not part of zol but can be set up separately using conda) to cluster your viral contigs. Note, BiG-SCAPE is meant to be run on BGC predictions from antiSMASH, but the algorithm should be robust and relies on domain similarity, so you can just adjust the parameters slightly by specifying: --include_gbk_str=* --mix . I found it more straightforward to run than virus specific alternatives like vConTACT2. This will give you a clustering file - where each virus from different metagenomes will be clustered and you can see if the same virus exists in multiple samples. You can then run either zol-scape to comprehensively investigate all viral clusters using zol or manually run zol and investigate viral gene clusters of interest one by one.

Should this database be created from a combined/coassembled assembly, or should a separate prepTG database be created for each individual assembly from individual metagenomes?

Great question, I think if you want to investigate/track changes from assemblies it is better to have separate assemblies for individual metagenomes. The tradeoff is that you will have less coverage and might assemble fewer viral contigs. So co-assembly is probably better for detection (because you will know from mapping reads to the co-assembly which individual metagenomic samples have each contig/virus) but then it will make it complicated to look at evolutionary differences between samples for a virus with zol. There are also read-based approaches such as MetaPop (https://link.springer.com/article/10.1186/s40168-022-01231-0).

How should this be provided? Should I subset the viral.contigs.fasta/viral.contigs.faa file to include only similar viral contigs (at the species level), or can I provide the entire viral.contigs.fasta/viral.contigs.faa file?

You can create a comprehensive database of viral contigs (e.g. predicted using VIBRANT or something similar) from all individual metagenomic samples. To make mapping easier you can have the name include a unique identifier for both the metagenomic assembly and viral contig (e.g. MG-1_VC-1.fasta) . And then you can find homologous instances of the virus using fai with a single query virus. The information for the query could be a FASTA of proteins or a GenBank for the individual virus (e.g. one from the prepTG database). So you could write a wrapper script to basically search each virus from prepTG (in the Genomic_Genbanks_Additional/) against the full prepTG database to see which alternate viruses it is similar to or you could use the BiG-SCAPE/zol-scape based approach described above to perform more comprehensive clustering. Generally, we recommend having loose parameters for maximal sensitivity when searching with fai to help minimize false negative detection.

@ShailNair
Copy link
Author

ShailNair commented Dec 3, 2024

Thank you for the detailed explanation, @raufs . Using BiG-SCAPE followed by zol-scape sounds like a good approach. We've already set up the BiG-SCAPE Conda environment, so I'll give it a try with my viral contigs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants