-
Notifications
You must be signed in to change notification settings - Fork 2
Usage
CLOCI fundamentally requires an input dataset of multiple genomes, and will predict homologous loci - and gene cluster families by extension - across the dataset. Single genome input options will be available in the future, though it is currently important to adequately sample a cluster's distribution to detect it. I generally recommend implementing CLOCI at least at the subphylum-level, though throughput will be greatly enhanced by focusing on a taxonomic order. The ideal taxonomic scale varies depending on the lineage's rate of microsynteny decay and the phylogenetic distance with which horizontal transfer occurs.
CLOCI can run without a reference phylogenomic tree and reference single-copy genes, although microsynteny tree reconstruction may fail to identify near single-copy genes on its own, or the topology of the microsynteny tree may be discordant with the true evolution of your inputted genomes. We thus recommend the following workflow:
-
Generate a MycotoolsDB with your genomes of interest
-
Generate a phylogenomic tree to constrain the topology of a microsynteny tree
-
Prepare a file of gene accessions used for microsynteny construction, which consist of gene families that exist in single-copy or near single-copy abundance across the dataset. You can identify these de novo, or use a HMM search using proteins that were determined to be in near single-copy across Fungi in the CLOCI manuscript.
-
Run CLOCI
-
Add filtering parameters to CLOCI output to extract gene cluster families (GCFs) from homologous locus groups (HLGs)
I recommend generating a preassembled
MycotoolsDB as input because -i
input is still undergoing alpha testing.
To generate a MycotoolsDB of publicly available genomes from a subphylum:
mtdb u -i <INIT_DIRECTORY>/ -l <LINEAGE> -rk subphylum --ncbi_only
Adding your own genomes to the resulting MycotoolsDB is elaborated on in the Mycotools usage wiki. You can also initialize a MycotoolsDB solely from local genome data.
Alternatively, CLOCI inputs a tab-delimitted file of genome metadata with the following
columns via -i
:
#genus species strain assembly_path gffpath
CLOCI identifies syntenic loci while referencing a microsynteny phylogeny that
accurately depicts divergence in gene order between genomes. While CLOCI
attempts to automatically detect near single-copy gene family neighborhoods for
reconstructing this tree, it is recommended to explicitly input these
near single-copy genes using the -f
argument referencing a file of reference genes,
separated by lines. These
are ideally single-copy genes inferred by OrthoFinder, near single-copy
genes determined by
db2hgs
,
or using a precompiled set of
HMMs
that are in near single-copy abundance across fungal genomes.
NOTE: the gene accessions inputted here must correspond to the accessions used by
CLOCI. If using -i
for CLOCI, gene accessions will be modified from your
input and must be translated to the new accessions used by CLOCI. You can
find the modified GFFs after initially running CLOCI:
cloci_<YYYYmmdd>/mycotoolsdb/data/gff3/*gff3
Additionally, microsynteny topology largely recapitulates
observed phylogenomic relationships, though it is ideal to constrain the
topology of the microsynteny tree to a robust, rooted, reference phylogenomic
tree via the -c
argument. The inputted tree can aditionally be rooted by
using -r
and selecting genome(s) to derive the outgroup branch from. NOTE: the phylogenomic tree tips must be the same as
the genome codenames inputted into CLOCI. If using -i
for CLOCI, genome
accessions will be modified from your input and your phylogenomic tree tips
must correspond to these new accessions, which can be located in the database
generated after initially running CLOCI: cloci_<YYYYmmdd>/mycotoolsdb/mtdb/*.mtdb
cloci -d $(mtdb) --cpus <CPUS> --root "<GENOME_1>,<GENOME_2>" \
-c <PHYLOGENOMIC_TREE> -f <FILE_OF_SINGLE_COPY_ACCESSIONS>
Once complete, evaluate and extract GCFs using proxies of coordinated gene evolution:
cloci -o <PREVIOUS_CLOCI_OUTPUT_DIR> -pt <PDS_THRESHOLD> -ct <CSB_THRESHOLD> \
-gt <GCL_THRESHOLD> -tt <TMD_THRESHOLD> --cpus <CPUS> --root \
"<GENOME_1>,<GENOME_2>" -c <PHYLOGENOMIC_TREE> \
-f <FILE_OF_SINGLE_COPY_ACCESSIONS>
mtdb e -l Agaricomycotina > agaricomycotina.mtdb
Run CLOCI rooting upon the MRCA of two inputted genomes
cloci -d agaricomycotina.mtdb --root "<OME1>,<OME2>"
CLOCI default parameters have been tuned for our initial dataset on ~2,250 fungi across the kingdom. These should suffice for circumscribing homologous loci in most analyses, though the gene cluster family filtering parameters are ideally determined referencing known clusters from your particular dataset. By default, thresholds for all proxies of coordinated gene evolution are set to 0, indicating that These thresholds will vary for the type of clusters of interest and the lineage, though some common reference cluster values are shown below. I recommend compiling a dataset of known cluster reference genes, running CLOCI, identifying those genes in the output, determining the values for the reference cluster proxies, and then implementing the thresholds.
There are numerous hyperparameters that will drastically affect output quality. I suspect our pilot study reached a local maximum in terms of output quality, though a global maximum perhaps lies with further hyperparameter tuning.
TMD: Total microsynteny distance
GCL: Gene commitment to the locus
MMI: Mean minimum identity
MMP: Mean minimum positive
CSB: Conservative subsitution bias
PDS: Phylogenetic distribution sparsity
-i --input
: Tab-delimitted reference file with columns "genus, species,
strain, assembly path, gff path". This step will curate inputted genomes into a
MycotoolsDB in the CLOCI output directory. This input implementation will
change the accessions and codenames of the inputted genomes, which makes it
incompatible with downstream arguments --focal_genes
,
--constraint
, and --root
. To circumnavigate these incompatibilities, users
can generate a MycotoolsDB de
novo or by first using -i
,
then using the curated genomes in the MycotoolsDB for reference.
-d --database
: Preexisting MycotoolsDB of genomes of interest. The
MycotoolsDB must be linked, please review the
wiki for Mycotools usage.
--pfam
: Annotate outputted homologous locus groups (HLGs) use "Pfam-A.hmm"
--interpro
: Annotate outputted HLGs using "interproscan.sh" [NOT IMPLEMENTED]
-f --focal_genes
: File of genes to identify homology groups of for
neighborhood extraction used in microsynteny tree reconstruction. [DOES NOT
WORK FOR -i inputs]
-c --constraint
: Phylogenomic constraint for microsynteny topology. This
option is highly recommended if a
precomputed microsynteny tree is not inputted. The tips of this constraint tree
must be the same as the tips of the microsynteny tree. [DOES NOT WORK FOR -i
inputs]. Mycotools automates a large
portion of the leg-work for a multi-locus phylogenomic
tree
derived from single-copy orthologs determined from the dataset.
-r --root
: Genome(s) to root upon, if multiple genomes are inputted then it
will use the most recent common ancestor branch of the inputted genomes. If
this MRCA branch overlaps with the current root then the tree will NOT be
rerooted. By default, the tree will be rooted at the midpoint. [DOES NOT WORK
FOR -i inputs]
-t --tree
: Precomputed microsynteny tree, which can be generated via
db2microsyntree
-of --orthofinder
: Use OrthoFinder orthogroups as homology groups [TESTING]
-g --homology_groups
: Precomputed homology group results file (gene
homology_group)
-l --linclust
: Use mmseqs linclust
as a fast homology group circumscription
algorithm. By default, CLOCI uses mmseqs cluster
-w --window
: Gene +/- sliding window for locus sizes at each step. By
default, CLOCI uses -w 2
, which entails a 5 gene sliding window.
-mmd --maximum_dist
: Use maximum microsynteny distance (MMD) instead of total
microsynteny distance (TMD). An argument for MMD is that it mitigates sampling
bias and weights loci that have been conserved across larger distances. CLOCI
defaults to TMD.
-us --unique_sp
: Only consider one genome for species with multiple genomes
in microsynteny distance calculations.
-nr --null_rank
: Taxonomic rank used for generating null models for
microsynteny decay. By default, CLOCI uses the species as microsynteny decay
rate can change drastically within some genera.
-np --null_partitions
: Manually curated set of lineages for partitioning null
distributions from. The file contains a line depicting the lineage with a
comment - #<LINEAGE>
- followed by a line containing the genome names
separated by spaces - <OME1> <OME2> <OME3>
.
-ns --null_sample
: Samples to obtain for null distributions; DEFAULT: 10,000
-a --aligner
: Alignment algorithm. CLOCI defaults to diamond
, whereas other
algorithms listed are under [TESTING]
-sa --sensitive_align
: Use a more sensitive alignment option for the associated
algorithm
-s --similarity
: Similarity coefficient for second interation of HLG
circumscription - [J]accard, [O]verlap,
[S]orensen; DEFAULT: Sorensen. The Sorensen and Jaccard indices weight against
missing HGs in locus-locus comparisons, with the Sorensen-Dice index weighting
less strongly. These coefficients will dramatically alter output results.
-mg --minimum_gene_id
: Minimum percent identity [30 < Value < 100] for
gene-gene comparisons; DEFAULT: 45
-ml --minimum_loc_id
: Minimum locus-wide percent identity [0 < Value < 100] to depict on locus-locus
similarity graph; DEFAULT: 30
-ts --min_topology_sim
: Percent [0 < Value < 100] Jaccard phylogenetic
topology similarity minimum for merging singletons AND merging loci prior to
the second iteration of HLG circumscription; DEFAULT: 25. This option will
affect what singleton genes and adjacent loci may be combined based on the
similarity of the phylogenetic distribution of the adjacent locus fragments.
-I1 --inflation_rnd1
: Markov clustering (MCL) round 1 inflation value, which
affects the granularity of locus domains and merged loci; DEFAULT 1.1
-I2 --inflation_rnd2
: MCL round 2 inflation value, which affects the
granularity of HLGs/GCFs; DEFAULT 1.3
-T --tune
: Tune inflation values to a subset of reference clusters
represented in a tab-delimited file: "
<CONSERVED_HGS/GENES_IN_CLUSTER> <GENOMES_IN> <GENOMES_NOT_IN>" - this option
will tune the granularity of HLGs to match the reference clusters depicted in
the inputted file by iteratively running MCL until the desired granularity is
obtained.
-hp --hgp_percentile
: Percentile of homology group pair (HGp) distances to
extract relative to the null distribution; DEFAULT 20. This percentile is set
to increase downstream computational throughput while obtaining as
comprehensive of a set of HGps as possible. HGps are used as seed loci for
determining higher-order combinations.
-xp --hgx_percentile
: Percentile of homology group combination (HGx)
distances to extract relative to the null distribution of the lineage and HGx
size; DEFAULT 61. This percentile affects the loci that are extracted and
inputted into locus aggregation - if the threshold becomes too low then loci
may be erroneously aggregated; if the threshold becomes too high then loci may
be incomplete because only the most supported portions of loci are included,
whereas genes on the flanks may be excluded.
The following filtering parameters are based on proxies for selection on coordinated gene evolution and involve math and explanations elaborated on in the methods and supplementary section of the original CLOCI manuscript. Please refer to this manuscript for elaboration.
-ip --id_percent
: Mean minimum percent identity (MMI) to report a gene
cluster family (GCF)
-pp --pos_percent
: Mean minimum percent positive (MMP) to report a GCF
-ct --csb_threshold
: Threshold [0 < Value < 1] of conservative substitution
bias for to report a GCF
-pt --pds_threshold
: Threshold [0 < Value < 1] of GCF phylogenetic
distribution sparsity to report a GCF
-gt --gcl_treshold
: Threshold [0 < Value < 1] of gene committment to the
locus
-tt --md_threshold
: Threshold [0 < Value < 1] of log-normalized GCF
microsynteny distances to extract
--n50
: Minimum assembly N50 to include genome in dataset
--stop
: Export HG-HG alignment commands to parallelize and stop running
--skip
: Ignore missing HG-HG alignments and assume they are failures
--fallback
: Fallback to diamond
when alignments fail
-n --new
: Rerun with new parameters and overwrite incompatible data
--force
: Force a rerun without bypassing to GCF filtration
--compress
: Compress run [TESTING]
--cpus
-o --output_dir
: By default, CLOCI will output to cloci_<YYYYmmdd>
-hg --hg_dir
: Directory of proteome fastas that correspond to each HG, named
.faa
-hgx --hgx_dir
: Directory of HGx alignment databases and results, with the
databases labeled <HG>.dmnd
/<HG>.mmseqs
and results labeled <HG>.out