There are a few options for getting RASSP inference results on small molecules of your choice:
- Web API
- We've setup a web API that runs FormulaNet/SubsetNet inferences on molecules at spectroscopy.ai.
- You can run inference on mols <= 48 atoms (for FormulaNet) and <= 64 atoms (for SubsetNet).
- Install RASSP locally.
- Build your own Docker image (TBD).
- Use our provided Docker image (TBD).
First, clone this repo into the directory of your choice, e.g. ROOTDIR=~/code/rassp-public.
If you have Anaconda already installed, great.
If not, install Miniconda and Mamba like so:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p $HOME/miniconda
export PATH="$HOME/miniconda/bin:$PATH"
conda config --set always_yes yes --set changeps1 no
conda update -q conda
conda init bash
Setup a new Conda environment using rassp/environment.yml:
cd rasspconda env create -q -n rassp -f environment.ymlconda activate rassp
Install rassp=1.0.0 as a local editable module (make sure to run it from the rassp-public root directory, where setup.py is located):
cd $ROOTDIRpython -m pip install -e .
Copy the expected files into their directories inside rassp:
rsync -razP models/ rassp/models/rsync -razP sample_data/ rassp/sample_data/
Run the demo script that runs forward spectral prediction on a list of InChI strings inside sample_data/in.txt:
cd $ROOTDIR- Follow instructions in
rassp/run_rassp.py
- Depending on how your Anaconda installation resolved the installations, you may get the following issues:
- Numba version incompatibility
ImportError: Numba needs NumPy 1.21 or less - Tinygraph (JonasLab library) incompatibility
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
- Numba version incompatibility
- The Numba error tells us that we should install
numpy<=1.21 - The Tinygraph error is cryptic, but it has to do with a change in the Numpy API at version 1.20.0
- To resolve this, uninstalling Numpy and reinstalling it with a specific version should fix things:
pip uninstall numpypip install numpy==1.21
TBD.
TBD.
rassp module:
expconfig: YAML config files for specifying experiments and modelsmsutil: Fast code for computing mass spectra and molecule subsetsmodel: Model logicfeaturize: Molecule featurizationdataset: Dataset objectdatagen: Datagen scriptsdocker: Running a Docker image to run inference using a given model checkpointutil.py: Generic utilsnetutil.py: Model utilsforward_evaluate_pipeline.py: Batch forward inference scriptforward_train.py: Main training scriptrun_rassp.py: Inference script for running a pre-trained model against moleculesmetrics.py: Metrics functions, including SDP, DP, and others
rassp-public module:
library_match_pipeline.py: Library matching / database lookup metricsanalysis_pipeline.py: Forward model metricsconst.py: Configures analysis scriptssample_data: Parquet files containing sample datasets to train and eval against- Should be copied into
rasspfolder
- Should be copied into
models: Pretrained model weights and checkpoints- Should be copied into
rasspfolder
- Should be copied into
Training, inference, and analysis artifacts (will be generated upon running scripts)
checkpoints: PyTorch model checkpointstblogs.formulae: Tensorboard logsforward.preds: Forward inference resultsresults.metrics: Metrics of forward inferencelibrary_match_results.metrics.mass_filter_15.reg_dp: Library matching metrics
Our pretrained SubsetNet and FormulaNet model weights can be found in rassp-public/models.
All model weights and files can also be located here (TBD): https://people.cs.uchicago.edu/~ericj/rassp/
Pretrained model weights (both the .model and .meta files) should be downloaded to rassp/models for our scripts to work.
We take the first 100 molecules from the smallmols dataset [1].
We then run cfm-predict against them and save their spectra in sample_data/smallmols_cfm_pred_public_sample.parquet.
We split this 100 mol dataset into 2x 50 mol datasets saved as sample_data/smallmols_cfm_pred_public_sample.0.parquet
and sample_data/smallmols_cfm_pred_public_sample.1.parquet. These two non-overlapping datasets are used in the library matching pipeline later on.
The columns:
mol_id: String- Generally can be String or Int, but
smallmolslabels them with strings indexing them against the NIST 2014 database they were pulled from.
- Generally can be String or Int, but
inchi: Stringinchi_key: String- Hash of the
inchistring. - Generated from
inchiviaChem.InchiToInchiKey(inchi).
- Hash of the
smiles: Stringrdmol: LargeBinary- A binary blob. To get RDKit molecules from a
rdmolbinary, we need to doChem.Mol(rdmol).
- A binary blob. To get RDKit molecules from a
cv_id: Int- Cross-val split index. Used to subdivide data into train and test sets.
morgan4_crc32: Int- CRC32 checksum of the
morgan4fingerprint of molecule, used to compute thecv_id.
- CRC32 checksum of the
spect: List[Tuple[Float, Float]]- A spectrum is represented as a list of 2-tuple (mass, intensity) pairs.
Run the following example command from within the eimspred_public repo:
USE_CUDA=1 CUDA_VISIBLE_DEVICES='<GPU_ID>' python rassp/forward_train.py rassp/expconfig/demo.yaml first-test
Change the GPU_ID to an integer 0, 1, etc if you have multiple GPUs, or an empty string if you are not running on GPU.
Output:
checkpoints- Location of model checkpoints<yaml_basename>.<additional_name>.<timestamp>.<epoch>.model- path patterndemo.first-test.48668593.00000000.model- example pathdemo.first-test.48668593.00000000.state- example path
tblogs.formulae- Location of Tensorboard intermediate results
Grab the model name from the checkpoint directory checkpoints. For example, our model name might look like: demo.first-test.48668593.
In const.py, we'll add a new entry in FORWARD_EVAL_EXPERIMENTS:
FORWARD_EVAL_EXPERIMENTS = {
'demo': {
'dataset' : './sample_data/smallmols_cfm_pred_public_sample.parquet',
'cv_method' : {
'how': 'morgan_fingerprint_mod',
'mod' : 10,
'test': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
},
'normalize_pred': True,
'streaming_save': True,
'checkpoint': 'checkpoints/demo.first-test.48668593',
'batch_size': 6,
'epoch': 0,
'mol_id_type': str, # either str or int, depending on your input dataset's `mol_id` column dtype
},
}
To execute forward evaluation against all the experiments in FORWARD_EVAL_EXPERIMENTS, we'll run:
USE_CUDA=1 CUDA_VISIBLE_DEVICES='<GPU_ID>' python rassp/forward_evaluate_pipeline.py
In the example experiment demo we have provided, the output spectra will be saved to a .sqlite file at
forward.preds/demo.spect.sqlite.
The columns:
mol_id: String | Int- Index labeling the molecule. Can be either a string or an integer, depending on the column in input dataset.
We assume integer by default, otherwise it needs to be explicitly specified as
mol_id_typeinconst.py.
- Index labeling the molecule. Can be either a string or an integer, depending on the column in input dataset.
We assume integer by default, otherwise it needs to be explicitly specified as
spect: LargeBinary- Pickled list. When unpickled, we get
List[Tuple[Float, Float]], where the inner 2-tuple comprises (mass, intensity) pairs.
- Pickled list. When unpickled, we get
phase: String- Either 'train' or 'test'. If running inference on all molecules, we set all phases to 'test' by
putting all splits in the
cv_methoddictionary inconst.py.
- Either 'train' or 'test'. If running inference on all molecules, we set all phases to 'test' by
putting all splits in the
Edit const.py ANALYSIS_EXPERIMENTS with the pred_spect path pointing to the output from forward_evaluate_pipeline.py
Run and get metrics for the model by running:
python analysis_pipeline.py
Output goes into results.metrics.
Edit const.py LIBRARY_MATCH_EXPERIMENTS with the exp_name set to the key for the experiment you want to run library matching metrics on in the ANALYSIS_EXPERIMENTS.
Notes:
- Unlike previous pipelines, this pipeline assumes the existence of the
inchi_keycolumn in themain_libraryandquery_libraryParquet files. - The main library and query library molecules must be strictly non-overlapping. There is an assert to check for this.
Run and get metrics for the model by running:
python library_match_pipeline.py
Output goes into library_match_results.metrics.mass_filter_<MASS_FILTER_DELTA>.<DP_NAME>.
Example usage:
- Ensure that you are in the
rasspdirectory. - Copy the
modelsdirectory torassp/models. - Copy the
sample_datadirectory torassp/sample_data. - Write your molecules as smiles / inchi strings to
sample_data/in.txt - Run
run_rassp.py, using the instructions provided in the script. - Spectra are stored in
sample_data/out.txt(or whateveroutput_filenamepath you specified.)
CPU train and eval:
# install conda environment per instructions
# install rassp as local package
pip install -e .
# train a model
USE_CUDA=0 python rassp/forward_train.py rassp/expconfig/demo.yaml first-test
# run model against mols to get predicted spectra
# edit const.py to point to the right model checkpoint, eg `checkpoints/demo.first-test.48755607`
USE_CUDA=0 python rassp/forward_evaluate_pipeline.py
# compute forward spectral metrics
python analysis_pipeline.py
# compute library matching metrics
python library_match_pipeline.py
If a GPU is available, PyTorch GPU will attempt to use it.
If multiple GPUs are available, you should explicitly specify the index of the device that you want to use by prepending CUDA_VISIBLE_DEVICES="<index:int>" to your python command.
Multi-GPU training is possible but finicky.
We recommend not doing so unless you're quite familiar with distributed GPU training.
If a GPU is not available and/or Nvidia drivers are not available, you will need to train with USE_CUDA=False inside rassp/forward_train.py, and add an environment flag
prior to executing each python script, eg:
USE_CUDA=0 CUDA_VISIBLE_DEVICES="" python rassp/forward_train.py rassp/expconfig/demo.yaml first-test
USE_CUDA=0 CUDA_VISIBLE_DEVICES="" python rassp/forward_evaluate_pipeline.py
- RASSP. 2023. URL: https://spectroscopy.ai/papers/rassp/
- CFM-ID. 2021. URL: https://cfmid.wishartlab.com/