This repo contains code to reproduce the results of SFCV Paper. These results include model predictions, tables, and images. Efforts are made to ensure reproducibility of this project. In case of undefined behaviour or errors in installing or benchmarking, please open an issue.
pip install sfcv
or
pip install git+https://github.com/Manas02/sfcv.git@main
This project uses pyvenv to manage python
environment with Python 3.11
. The following command will create virtual env in .venv
directory.
python3.11 -m venv .venv
pip install -r requirements.txt
Landrum & Riniker [Paper | Data]
Please open and run 00_Data_source_and_standardize.ipynb to download the above-mentioned dataset and to standardize the SMILES in those files.
Follow that by running 01_Data_add_LogP_LogD_MCE18.ipynb to predict and add data for CrippenLogP (rdkit), LogD (Code) and compute MCE-18.
Follow this with
running 02_Table_mol_per_target_before_after_standardization.ipynb
to generate the table and parity plot. The results are saved in benchmark/results/tables
and
benchmark/results/figures
directories.
Run 03_Plots_Table_target_properties.ipynb to get the summary of properties as a table and to plot the distributions.
Run 04_Implementation_SFCV.ipynb to visualise how SortedStepForwardCV and UnsortedStepForwardCV work.
Run 05_Implementation_ScaffoldSplitCV.ipynb to check how ScaffoldSplitCV works. The algorithm groups molecules by their chemical scaffolds, shuffles these groups, and sequentially assigns entire scaffold groups to the training set until a target fraction is reached, with the remaining groups forming the test set.
Run 06_Implementation_RandomSplitCV.ipynb to check how RandomSplitCV works.
Run 07_Validate_train_test_split.ipynb to visualise number of molecules in test set across folds across targets.
Run 08_Plots_chemical_space_across_split.ipynb to visualise chemical space wrt Split types.
Run 09_Plots_Table_split_properties.ipynb to visualise distributions of sorting properties wrt Split types per fold averaged over all targets.
Run 10_Implimentation_Discovery_Yield.ipynb to understand and visualise the illustrative example of discovery yield.
Run 11_Implimentation_Novelty_Error.ipynb to understand and visualise the illustrative example of novelty error.
Run 12_Implementation_Benchmark.ipynb to see how benchmarking was performed.
Run 13_Table_extract_results.ipynb to extract results into digestable format.
Run 14_Plots_results.ipynb, 15_Plots_Result_hERG.ipynb and 16_Plots_Result_MAPK.ipynb, 17_Plots_Result_VEGFR.ipynb to visualise the results.