Skip to content
xonq edited this page Jul 19, 2024 · 35 revisions

Recommended workflow overview

CLOCI fundamentally requires an input dataset of multiple genomes, and will predict homologous loci - and gene cluster families by extension - across the dataset. Single genome input options will be available in the future, though it is currently important to adequately sample a cluster's distribution to detect it. I generally recommend implementing CLOCI at least at the subphylum-level, though throughput will be greatly enhanced by focusing on a taxonomic order. The ideal taxonomic scale varies depending on the lineage's rate of microsynteny decay and the phylogenetic distance with which horizontal transfer occurs.

CLOCI can run without a reference phylogenomic tree and reference single-copy genes, although microsynteny tree reconstruction may fail to identify near single-copy genes on its own, or the topology of the microsynteny tree may be discordant with the true evolution of your inputted genomes. We thus recommend the following workflow:

  1. Generate a MycotoolsDB with your genomes of interest

  2. Generate a phylogenomic tree to constrain the topology of a microsynteny tree

  3. Prepare a file of gene accessions used for microsynteny construction, which consist of gene families that exist in single-copy or near single-copy abundance across the dataset. You can identify these de novo, or use a HMM search using proteins that were determined to be in near single-copy across Fungi in the CLOCI manuscript.

  4. Run CLOCI

  5. Add filtering parameters to CLOCI output to extract gene cluster families (GCFs) from homologous locus groups (HLGs)



Input dataset

I recommend generating a preassembled MycotoolsDB as input because -i input is still undergoing alpha testing.

Inputting genomes with a MycotoolsDB (-d)

To generate a MycotoolsDB of publicly available genomes from a subphylum:

mtdb u -i <INIT_DIRECTORY>/ -l <LINEAGE> -rk subphylum --ncbi_only

Adding your own genomes to the resulting MycotoolsDB is elaborated on in the Mycotools usage wiki. You can also initialize a MycotoolsDB solely from local genome data.

Inputting genomes without a MycotoolsDB (-i)

Alternatively, CLOCI inputs a tab-delimitted file of genome metadata with the following columns via -i:

#genus	species	strain	assembly_path	gffpath

Generating a robust microsynteny tree

CLOCI identifies syntenic loci while referencing a microsynteny phylogeny that accurately depicts divergence in gene order between genomes. While CLOCI attempts to automatically detect near single-copy gene family neighborhoods for reconstructing this tree, it is recommended to explicitly input these near single-copy genes using the -f argument referencing a file of reference genes, separated by lines. These are ideally single-copy genes inferred by OrthoFinder, near single-copy genes determined by db2hgs, or using a precompiled set of HMMs that are in near single-copy abundance across fungal genomes. NOTE: the gene accessions inputted here must correspond to the accessions used by CLOCI. If using -i for CLOCI, gene accessions will be modified from your input and must be translated to the new accessions used by CLOCI. You can find the modified GFFs after initially running CLOCI: cloci_<YYYYmmdd>/mycotoolsdb/data/gff3/*gff3

Additionally, microsynteny topology largely recapitulates observed phylogenomic relationships, though it is ideal to constrain the topology of the microsynteny tree to a robust, rooted, reference phylogenomic tree via the -c argument. The inputted tree can aditionally be rooted by using -r and selecting genome(s) to derive the outgroup branch from. NOTE: the phylogenomic tree tips must be the same as the genome codenames inputted into CLOCI. If using -i for CLOCI, genome accessions will be modified from your input and your phylogenomic tree tips must correspond to these new accessions, which can be located in the database generated after initially running CLOCI: cloci_<YYYYmmdd>/mycotoolsdb/mtdb/*.mtdb



Examples

1 Run CLOCI on a full preassembled MycotoolsDB with a reference phylogenomic tree

cloci -d $(mtdb) --cpus <CPUS> --root "<GENOME_1>,<GENOME_2>" \
-c <PHYLOGENOMIC_TREE> -f <FILE_OF_SINGLE_COPY_ACCESSIONS>

Once complete, evaluate and extract GCFs using proxies of coordinated gene evolution:

cloci -o <PREVIOUS_CLOCI_OUTPUT_DIR> -pt <PDS_THRESHOLD> -ct <CSB_THRESHOLD> \
-gt <GCL_THRESHOLD> -tt <TMD_THRESHOLD> --cpus <CPUS> --root \
"<GENOME_1>,<GENOME_2>" -c <PHYLOGENOMIC_TREE> \
-f <FILE_OF_SINGLE_COPY_ACCESSIONS>

2 Run CLOCI on a subset of a comprehensive MycotoolsDB without a constraint

mtdb e -l Agaricomycotina > agaricomycotina.mtdb

Run CLOCI rooting upon the MRCA of two inputted genomes

cloci -d agaricomycotina.mtdb --root "<OME1>,<OME2>"



Hyperparameters

Default values

CLOCI default parameters have been tuned for our initial dataset on ~2,250 fungi across the kingdom. These should suffice for circumscribing homologous loci in most analyses, though the gene cluster family filtering parameters are ideally determined referencing known clusters from your particular dataset. By default, thresholds for all proxies of coordinated gene evolution are set to 0, indicating that These thresholds will vary for the type of clusters of interest and the lineage, though some common reference cluster values are shown below. I recommend compiling a dataset of known cluster reference genes, running CLOCI, identifying those genes in the output, determining the values for the reference cluster proxies, and then implementing the thresholds.

There are numerous hyperparameters that will drastically affect output quality. I suspect our pilot study reached a local maximum in terms of output quality, though a global maximum perhaps lies with further hyperparameter tuning.

Reference gene cluster family characterization

TMD: Total microsynteny distance

GCL: Gene commitment to the locus

MMI: Mean minimum identity

MMP: Mean minimum positive

CSB: Conservative subsitution bias

PDS: Phylogenetic distribution sparsity

TMD: total microsynteny distance, PDS: phylogenetic distribution sparsity, GCL: Gene commitment to the locus, CSB: conservative subsitution bias


Options

Input options

-i --input: Tab-delimitted reference file with columns "genus, species, strain, assembly path, gff path". This step will curate inputted genomes into a MycotoolsDB in the CLOCI output directory. This input implementation will change the accessions and codenames of the inputted genomes, which makes it incompatible with downstream arguments --focal_genes, --constraint, and --root. To circumnavigate these incompatibilities, users can generate a MycotoolsDB de novo or by first using -i, then using the curated genomes in the MycotoolsDB for reference.

-d --database: Preexisting MycotoolsDB of genomes of interest. The MycotoolsDB must be linked, please review the wiki for Mycotools usage.

--pfam: Annotate outputted homologous locus groups (HLGs) use "Pfam-A.hmm"

--interpro: Annotate outputted HLGs using "interproscan.sh" [NOT IMPLEMENTED]


Microsynteny tree options

-f --focal_genes: File of genes to identify homology groups of for neighborhood extraction used in microsynteny tree reconstruction. [DOES NOT WORK FOR -i inputs]

-c --constraint: Phylogenomic constraint for microsynteny topology. This option is highly recommended if a precomputed microsynteny tree is not inputted. The tips of this constraint tree must be the same as the tips of the microsynteny tree. [DOES NOT WORK FOR -i inputs]. Mycotools automates a large portion of the leg-work for a multi-locus phylogenomic tree derived from single-copy orthologs determined from the dataset.

-r --root: Genome(s) to root upon, if multiple genomes are inputted then it will use the most recent common ancestor branch of the inputted genomes. If this MRCA branch overlaps with the current root then the tree will NOT be rerooted. By default, the tree will be rooted at the midpoint. [DOES NOT WORK FOR -i inputs]

-t --tree: Precomputed microsynteny tree, which can be generated via db2microsyntree


Detection options

-of --orthofinder: Use OrthoFinder orthogroups as homology groups [TESTING]

-g --homology_groups: Precomputed homology group results file (gene homology_group)

-l --linclust: Use mmseqs linclust as a fast homology group circumscription algorithm. By default, CLOCI uses mmseqs cluster

-w --window: Gene +/- sliding window for locus sizes at each step. By default, CLOCI uses -w 2, which entails a 5 gene sliding window.

-mmd --maximum_dist: Use maximum microsynteny distance (MMD) instead of total microsynteny distance (TMD). An argument for MMD is that it mitigates sampling bias and weights loci that have been conserved across larger distances. CLOCI defaults to TMD.

-us --unique_sp: Only consider one genome for species with multiple genomes in microsynteny distance calculations.


Null options

-nr --null_rank: Taxonomic rank used for generating null models for microsynteny decay. By default, CLOCI uses the species as microsynteny decay rate can change drastically within some genera.

-np --null_partitions: Manually curated set of lineages for partitioning null distributions from. The file contains a line depicting the lineage with a comment - #<LINEAGE> - followed by a line containing the genome names separated by spaces - <OME1> <OME2> <OME3>.

-ns --null_sample: Samples to obtain for null distributions; DEFAULT: 10,000


Homologous locus group aggregation parameters

-a --aligner: Alignment algorithm. CLOCI defaults to diamond, whereas other algorithms listed are under [TESTING]

-sa --sensitive_align: Use a more sensitive alignment option for the associated algorithm

-s --similarity: Similarity coefficient for second interation of HLG circumscription - [J]accard, [O]verlap, [S]orensen; DEFAULT: Sorensen. The Sorensen and Jaccard indices weight against missing HGs in locus-locus comparisons, with the Sorensen-Dice index weighting less strongly. These coefficients will dramatically alter output results.

-mg --minimum_gene_id: Minimum percent identity [30 < Value < 100] for gene-gene comparisons; DEFAULT: 45

-ml --minimum_loc_id: Minimum locus-wide percent identity [0 < Value < 100] to depict on locus-locus similarity graph; DEFAULT: 30

-ts --min_topology_sim: Percent [0 < Value < 100] Jaccard phylogenetic topology similarity minimum for merging singletons AND merging loci prior to the second iteration of HLG circumscription; DEFAULT: 25. This option will affect what singleton genes and adjacent loci may be combined based on the similarity of the phylogenetic distribution of the adjacent locus fragments.

-I1 --inflation_rnd1: Markov clustering (MCL) round 1 inflation value, which affects the granularity of locus domains and merged loci; DEFAULT 1.1

-I2 --inflation_rnd2: MCL round 2 inflation value, which affects the granularity of HLGs/GCFs; DEFAULT 1.3

-T --tune: Tune inflation values to a subset of reference clusters represented in a tab-delimited file: " <CONSERVED_HGS/GENES_IN_CLUSTER> <GENOMES_IN> <GENOMES_NOT_IN>" - this option will tune the granularity of HLGs to match the reference clusters depicted in the inputted file by iteratively running MCL until the desired granularity is obtained.

-hp --hgp_percentile: Percentile of homology group pair (HGp) distances to extract relative to the null distribution; DEFAULT 20. This percentile is set to increase downstream computational throughput while obtaining as comprehensive of a set of HGps as possible. HGps are used as seed loci for determining higher-order combinations.

-xp --hgx_percentile: Percentile of homology group combination (HGx) distances to extract relative to the null distribution of the lineage and HGx size; DEFAULT 61. This percentile affects the loci that are extracted and inputted into locus aggregation - if the threshold becomes too low then loci may be erroneously aggregated; if the threshold becomes too high then loci may be incomplete because only the most supported portions of loci are included, whereas genes on the flanks may be excluded.


GCF filtering options

The following filtering parameters are based on proxies for selection on coordinated gene evolution and involve math and explanations elaborated on in the methods and supplementary section of the original CLOCI manuscript. Please refer to this manuscript for elaboration.

-ip --id_percent: Mean minimum percent identity (MMI) to report a gene cluster family (GCF)

-pp --pos_percent: Mean minimum percent positive (MMP) to report a GCF

-ct --csb_threshold: Threshold [0 < Value < 1] of conservative substitution bias for to report a GCF

-pt --pds_threshold: Threshold [0 < Value < 1] of GCF phylogenetic distribution sparsity to report a GCF

-gt --gcl_treshold: Threshold [0 < Value < 1] of gene committment to the locus

-tt --md_threshold: Threshold [0 < Value < 1] of log-normalized GCF microsynteny distances to extract


Runtime options

--n50: Minimum assembly N50 to include genome in dataset

--stop: Export HG-HG alignment commands to parallelize and stop running

--skip: Ignore missing HG-HG alignments and assume they are failures

--fallback: Fallback to diamond when alignments fail

-n --new: Rerun with new parameters and overwrite incompatible data

--force: Force a rerun without bypassing to GCF filtration

--compress: Compress run [TESTING]

--cpus


Alternative directories

-o --output_dir: By default, CLOCI will output to cloci_<YYYYmmdd>

-hg --hg_dir: Directory of proteome fastas that correspond to each HG, named .faa

-hgx --hgx_dir: Directory of HGx alignment databases and results, with the databases labeled <HG>.dmnd/<HG>.mmseqs and results labeled <HG>.out