This repository contains code and data related to CAMEOX (CAMEOs eXtended), a parallelized extension of CAMEOS (Constraining Adaptive Mutations using Engineered Overlapping Sequences) developed by LLNL (Lawrence Livermore National Laboratory). The original CAMEOS software was developed by Tom Blazejewski at Wang Lab (Columbia University). CAMEOX is the computational core of the GENTANGLE pipeline for automated design of gene entanglements.
The recommended installation method is as part of the GENTANGLE pipeline by cloning the GENTANGLE repository or, even better, by downloading the Singularity container as this eases the process of setting all the many requirements of CAMEOX, and also the DATANGLE repository to provide data examples and templates. Please see this link for details on these approaches.
git clone https://github.com/BiosecSFA/cameox.git
The main improvements in CAMEOX relative to CAMEOS are:
- Parallelization: CAMEOS optimization is not parallelized, while CAMEOX main optimization loop is parallelized by using shared-memory threads with optimized garbage collection on Julia. This improvement allows for a larger number of variants to be evaluated in parallel.
- Dynamic stopping criteria: CAMEOS works with a fix, static number of iterations, while that number is variable in CAMEOX below a given maximum: the main optimization loop is automatically stopped by a dynamic condition based on the relative number of variants evolving per iteration. This enhancement allows for a larger number of variants to be evaluated by limiting redundant calculations on an exponentially growing number of variants that stop evolving over time.
- Redundancy reduction: CAMEOS has only the “cull” mechanism using predefined NPLL (negative-pseudo-log-likelihood, aka anti-pseudo-log-likelihood or APLL) limits, while CAMEOX is skipping variants in each cycle accounting for their evolution over the iterations. This addition allows for a larger number of variants to be evaluated by limiting the redundancy of scoring the same variant.
- Numerical stability: CAMEOX resolves several bugs in CAMEOS, thereby greatly improving the numerical stability and robustness of the code. CAMEOX is able to run with an order of magnitude more sequences and for longer duration (more iterations for optimization) without crashing thus allowing the generation of more candidate solutions.
- Generalized codon optimization: CAMEOS codon optimization is hardwired for E. coli, while CAMEOX includes a generalized embedded codon optimization by reading from an external database. This CAMEOX extension allows users to design genes for other microorganisms with different codon usage tables.
- Adjustable mutagenesis parameters: CAMEOS uses hardwired mutagenesis parameters, while CAMEOX give users the ability to modify the mutagenesis parameters used in the optimization algorithm. This addition enable users to more aggressively mutate each variant and potentially explore a more diverse sequence space for candidate solutions.
- Detached data directory: CAMEOS requires data and code in the same pre-established directory structure, while in CAMEOX the data directory can be detached from code directory and locations are flexible. This improvement allows for greater installation flexibility, since the data directories can become large and may need to be stored in a separate location from the software.
- Customizable optimization weights: CAMEOS has internal hardwired values for the NPLL optimization, while CAMEOX exposes a user option with several common choices offered (see details below in subsection about PLL weights for optimization). This improvement allows the user to customize their design criteria to weight the fitness importance of one gene over the other as needed.
- Reference sequence NPLL values: CAMEOX outputs NPLL values calculated for reference sequences (usually WT, wild type), which are used in downstream normalization to enable comparisons between different runs. This new feature allows fitness scores to be evaluated relative to the WT and can enable comparison between the output of different models used for the genes. CAMEOS does not provide these values.
- Comprehensive metadata: CAMEOX generates comprehensive metadata to enable downstream management of multiple pairs and runs. This expansion helps with running a larger scale search over multiple gene pairs. CAMEOS is not providing metadata files.
- Pre-MRF optimization variants: CAMEOX can provide complete results for the variants post-HMM optimization but pre-MRF optimization to allow for checks or alternative optimization methods. This new feature is useful for evaluating differences between the HMM and MRF variants and potentially assess different optimization techniques. CAMEOS lacks this capability.
- Entanglement frame awareness: CAMEOX is aware of the working entanglement frame related to the longer gene (see subsection below with details). This enhancement allows for user to set and confirm if the shorter gene is embedded in the second or third reading frame related to the longer gene. CAMEOS does not account for this data.
CAMEOX improvements over CAMEOS have required some changes in the TSV input/parameters file from column 7 regarding CAMEOS. Each line in the file should now have the following columns:
- Output dir: relative base directory where the output directory will be created.
- Mark gene name: gene ID string for 'mark' gene; needed as a key for looking up some values associated with genes in files.
- Deg gene name: gene ID string for the corresponding 'deg' gene.
- Mark JLD file: relative path to mark gene JLD file.
- Deg JLD file: relative path to mark gene JLD file.
- Mark HMM file: relative path to mark gene HMM directory and
.hmm
file. - Deg HMM file: relative path to deg gene HMM directory and
.hmm
file. - Population size: number of seeds that will enter the optimization loop, i.e. number of individual HMM solutions to greedily optimize.
- Frame (placeholder): p1/p2/p3, but the entanglement frame depends on the order of the genes in the input (see subsection below for details).
- Relative change threshold: minimum threshold for the relative number of variants changing, used for setting a dynamic limit on the number of iterations; typical value for standard CAMEOX runs is 0, or very close.
- Host taxid: NCBI Taxonomic ID for the host of the entanglement, used by the host generalization subsystem (the default value is 562, for E. coli; see subsection below for details).
- Pseudolikelihoods weights for optimization choice, which should be one of the next options:
equal
,rand
,close2mark
,close2deg
(see subsection below for details).
Example of a single-line CAMEOX parameter file with Pseudomonas protegens Pf-5 (NCBI taxid: 220664) as host:
output/ aroB_pf5 infA_pf5 jlds/aroB_pf5.jld jlds/infA_pf5.jld hmms/aroB_pf5.hmm hmms/infA_pf5.hmm 20000 p1 0 220664
As indicated above in the input format, the frame parameter in the parameter/input file is a placeholder, both in CAMEOS and CAMEOX. The effective way to select the entanglement frame is via the order of the genes in the input. Using CAMEOS terminology, typically, the "mark" gene is the shorter gene and the "deg" gene is the longer gene. By inverting that order, the effective frame of entanglement regarding the longer gene is changed. CAMEOX is aware of the working entanglement frame and outputs that information at the start of any run to clarify the actual entanglement frame:
Processing entanglement [shorter_prot]⥂[longer_prot] in frame [real_frame]
where [real_frame]
can be either 5'3'F2
or 5'3'F3
.
As previously mentioned, CAMEOS codon optimization is hardwired for E. coli, while CAMEOX includes a generalized embedded codon optimization by reading from an external database. This database is composed by one TSV file for each organism used as host for the entanglements. Each filename follows the format CUT_{taxid}.tsv
, where CUT stands for Codon Usage Table and taxid
is the taxonomic identifier for the organism in the NCBI Taxonomy database. Each TSV file needs two columns: 'codon' for the codons and 'freq' for the frequencies. As an example, please see Pseudomonas protegens Pf-5 (NCBI taxid: 220664) CUT file. The DATANGLE repository also contains the E. coli (NCBI taxid: 562) CUT file direcly usable by CAMEOX.
In case that additional hosts are targeted, a quick method to get the CUT is to consult an online CoCoPUTs service, retrieve the CUT for the desired host with NCBI taxonomic identifier hostTaxId
, and save it with the described format in the file CUT_{hostTaxId}.tsv
, which should be placed in the root of CAMEOX data directory.
As indicated above in the input format, the last parameter indicates the pseudolikelihood (PLL) weights for optimization. Before the MRF optimization (main optimization loop), each gene of each pair of HMM seeds is assigned a weight. Within a pair, the weights sum 1.0
and indicate the relative importance of each gene PLL (as calculated by the respective MRF models) for the total pair score. The options for this parameter are the following:
equal
: The weight will be always equal for both genes (0.5
). So, there is no optimization preference for one over the other regarding the PLL.rand
: For each pair of HMM seed in the population of variants, the weight for one of the genes is randomly obtained from a uniform prob distribution between 0 and 1 so the weight of the other is taken to that both sum1.0
. Since is very difficult to known a priori the relative importance of both genes for a successful entanglement, this is the preferred choice when working with a large number of variants to be able to better explore the space of solutions and generate a workable Pareto's front.close2mark
: The weight will be always1.0
for the mark gene and0.0
for the deg gene, thus optimizing only for the mark gene. This may be useful in extreme entanglement cases where the relative importance of the mark gene is orders of magnitude above the one of the deg gene.close2deg
: The weight will be always1.0
for the deg gene and0.0
for the mark gene, thus optimizing only for the deg gene. This may be useful in extreme entanglement cases where the relative importance of the deg gene is orders of magnitude above the one of the mark gene.
- For mark and deg genes names, if you have used the upstream pipeline, please use the same strings here.
- You will need to use the same mark and deg genes names in the downstream pipeline.
- Please see the GENTANGLE wiki for useful documentation about the overall pipeline and the Singularity container that include CAMEOX (recommended method for running the code).
- The original CAMEOS manual may still be useful.
- For related code, data, documentation, and notebooks specific to Livermore Computing (LC) you can take a look at this repo if you have access to LC.
CAMEOX is part of and released as part of the GENTANGLE pipeline (LLNL-CODE-845475) and is distributed under the terms of the GNU Affero General Public License v3.0 (see LICENSE). CAMEOX is developed upon CAMEOS, which was released under a MIT license (see LICENSE-CAMEOS).
SPDX-License-Identifier: AGPL-3.0-or-later
This work is supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Lawrence Livermore National Laboratory Secure Biosystems Design SFA “From Sequence to Cell to Population: Secure and Robust Biosystems Design for Environmental Microorganisms”. Work at LLNL is performed under the auspices of the U.S. Department of Energy at Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
If you use CAMEOX in your research, please cite the following papers. Thanks!
GENTANGLE: integrated computational design of gene entanglements
Jose Manuel Martí, Chloe Hsu, Charlotte Rochereau, Tomasz Blazejewski, Hunter Nisonoff, Sean P. Leonard, Christina S. Kang-Yun, Jennifer Chlebek, Dante P. Ricci, Dan Park, Harris Wang, Jennifer Listgarten, Yongqin Jiao, Jonathan E. Allen
bioRxiv 2023.11.09.565696; doi: https://doi.org/10.1101/2023.11.09.565696
Blazejewski T, Ho HI, Wang HH. Synthetic sequence entanglement augments stability and containment of genetic information in cells. Science. 2019 Aug 9;365(6453):595-8. https://doi.org/10.1126/science.aav5477