Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 61 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,38 +10,76 @@ neural network (GNN) that predicts an association to a target molecule, e.g., a
DeepFPlearn<sup>+</sup> is an extension of deepFPlearn[[2]](#2), which uses binary fingerprints to represent the
molecule's structure computationally.

## Setting up Python environment
## Installation

The DFPL package requires a particular Python environment to work properly.
It consists of a recent Python interpreter and packages for data-science and neural networks.
The exact dependencies can be found in the
[`requirements.txt`](requirements.txt) (which is used when installing the package with pip)
and [`environment.yml`](environment.yml) (for installation with conda).

You have several ways to provide the correct environment to run code from the DFPL package.

1. Use the automatically built docker/Singularity containers
2. Build your own container [following the steps here](container/README.md)
3. Setup a python virtual environment
4. Set up a conda environment install the requirements via conda and the DFPL package via pip
1. Use bioconda to install the package
2. Set up a Python virtual environment
3. Use the automatically built Docker
4. Use the automatically built Singularity containers

In the following, you find details for option 1., 3., and 4.
### Bioconda

The package is also available on Bioconda. You can find the Bioconda recipe here and
[![install with bioconda](http://bioconda.github.io/recipes/deepfplearn/README.html)]

First create an environment with Python 3.8:

```shell
conda create -n dfpl python=3.8
conda activate dfpl
```

Then install the package:

```shell
conda install -c bioconda deepfplearn
```

### Set up DFPL in a python virtual environment

From within the `deepFPlearn` directory call

```
virtualenv -p python3 ENV_PATH
. ENV_PATH/bin/activate
pip install ./
```

replace `ENV_PATH` by the directory where the python virtual environment should be created.
If your system has only python3 installed `-p python3` may be removed.

In order to use the environment, it needs to be activated with `. ENV_PATH/bin/activate`.

### Docker container

You need docker installed on you machine.
You need docker installed on your machine. If you don't have it installed yet, you can find the installation
instructions [here](https://docs.docker.com/engine/install/).

In order to run DFPL pull the image using the following command line:

In order to run DFPL use the following command line
```shell
docker pull quay.io/biocontainers/deepfplearn:TAG
```
Then mount the directory containing the data you want to process and run the container with the following command:

```shell
docker run --gpus GPU_REQUEST registry.hzdr.de/department-computational-biology/deepfplearn/deepfplearn:TAG dfpl DFPL_ARGS
docker run -v /path/to/local/repo quay.io/biocontainers/deepfplearn:TAG dfpl DFPL_ARGS
```
And then you can run the container with the following command:

```shell
docker run --gpus GPU_REQUEST quay.io/biocontainers/deepfplearn:TAG dfpl DFPL_ARGS
```

where you replace

- `TAG` by the version you want to use or `latest` if you want to use latest available version)
- You can see available tags
here https://gitlab.hzdr.de/department-computational-biology/deepfplearn/container_registry/5827.
- `TAG` by the version you want to use
- You can see available tags in [biocontainers](https://biocontainers.pro/tools/deepfplearn).
In general a container should be available for each released version of DFPL.
- `GPU_REQUEST` by the GPUs you want to use or `all` if all GPUs should be used (remove `--gpus GPU_REQUEST` if only the
CPU should)
Expand All @@ -50,21 +88,22 @@ where you replace
In order to get an interactive bash shell in the container use:

```shell
docker run -it --gpus GPU_REQUEST registry.hzdr.de/department-computational-biology/deepfplearn/deepfplearn:TAG bash
docker run -it --gpus GPU_REQUEST quay.io/biocontainers/deepfplearn:TAG bash
```


### Singularity container

You need Singularity installed on your machine. You can download a container with
You need Singularity installed on your machine. You can find the installation instructions
[here](https://apptainer.org/user-docs/master/quick_start.html).

```shell
singularity pull dfpl.TAG.sif docker://registry.hzdr.de/department-computational-biology/deepfplearn/deepfplearn:TAG
singularity pull dfpl.TAG.sif docker://quay.io/biocontainers/deepfplearn:TAG
```

- replace `TAG` by the version you want to use or `latest` if you want to use latest available version)
- replace `TAG` by the version you want to use
- You can see available tags
here https://gitlab.hzdr.de/department-computational-biology/deepfplearn/container_registry/5827.
In general a container should be available for each released version of DFPL.
[here](https://biocontainers.pro/tools/deepfplearn).

This stores the container as a file `dfpl.TAG.sif` which can be run as follows:

Expand All @@ -78,10 +117,6 @@ singularity run --nv dfpl.TAG.sif dfpl DFPL_ARGS
or you can start a shell script (look at [run-all-cases.sh](scripts/run-all-cases.sh) for an
example)

```shell script
singularity run --nv dfpl.sif ". ./example/run-multiple-cases.sh"
```

It's also possible to get an interactive shell into the container

```shell script
Expand All @@ -90,47 +125,12 @@ singularity shell --nv dfpl.TAG.sif

**Note:** The Singularity container is intended to be used on HPC cluster where your ability to install software might
be limited.
For local testing or development, setting up the conda environment is preferable.

### Set up DFPL in a python virtual environment

From within the `deepFPlearn` directory call

```
virtualenv -p python3 ENV_PATH
. ENV_PATH/bin/activate
pip install ./
```

replace `ENV_PATH` by the directory where the python virtual environment should be created.
If your system has only python3 installed `-p python3` may be removed.

In order to use the environment it needs to be activated with `. ENV_PATH/bin/activate`.

### Set up DFPL in a conda environment

To use this tool in a conda environment:

1. Create the conda env from scratch

From within the `deepFPlearn` directory, you can create the conda environment with the provided yaml file that
contains all information and necessary packages
For local testing or development, setting up the bioconda environment is preferable.

```shell
conda env create -f environment.yml
```

2. Activate the `dfpl_env` environment with

```shell
conda activate dfpl_env
```

3. Install the local `dfpl` package by calling

```shell
pip install --no-deps ./
```

## Prepare data

Expand Down
115 changes: 24 additions & 91 deletions dfpl/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,108 +17,45 @@
from dfpl import vae as vae
from dfpl.utils import createArgsFromJson, createDirectory, makePathAbsolute

project_directory = pathlib.Path(".").parent.parent.absolute()
test_train_opts = options.Options(
inputFile=f"{project_directory}/input_datasets/S_dataset.pkl",
outputDir=f"{project_directory}/output_data/console_test",
ecWeightsFile=f"{project_directory}/output_data/case_00/AE_S/ae_S.encoder.hdf5",
ecModelDir=f"{project_directory}/output_data/case_00/AE_S/saved_model",
type="smiles",
fpType="topological",
epochs=100,
batchSize=1024,
fpSize=2048,
encFPSize=256,
enableMultiLabel=False,
testSize=0.2,
kFolds=2,
verbose=2,
trainAC=False,
trainFNN=True,
compressFeatures=True,
activationFunction="selu",
lossFunction="bce",
optimizer="Adam",
fnnType="FNN",
)

test_pred_opts = options.Options(
inputFile=f"{project_directory}/input_datasets/S_dataset.pkl",
outputDir=f"{project_directory}/output_data/console_test",
outputFile=f"{project_directory}/output_data/console_test/S_dataset.predictions_ER.csv",
ecModelDir=f"{project_directory}/output_data/case_00/AE_S/saved_model",
fnnModelDir=f"{project_directory}/output_data/console_test/ER_saved_model",
type="smiles",
fpType="topological",
)


def traindmpnn(opts: options.GnnOptions):
def traindmpnn(opts: options.GnnOptions) -> None:
"""
Train a D-MPNN model using the given options.
Args:
- opts: options.GnnOptions instance containing the details of the training
Returns:
- None
"""
os.environ["CUDA_VISIBLE_DEVICES"] = f"{opts.gpu}"
ignore_elements = ["py/object"]
# Load options from a JSON file and replace the relevant attributes in `opts`
arguments = createArgsFromJson(
opts.configFile, ignore_elements, return_json_object=False
)
arguments = createArgsFromJson(jsonFile = opts.configFile)
opts = cp.args.TrainArgs().parse_args(arguments)
logging.info("Training DMPNN...")
# Train the model and get the mean and standard deviation of AUC score from cross-validation
mean_score, std_score = cp.train.cross_validate(
args=opts, train_func=cp.train.run_training
)
logging.info(f"Results: {mean_score:.5f} +/- {std_score:.5f}")


def predictdmpnn(opts: options.GnnOptions, json_arg_path: str) -> None:
def predictdmpnn(opts: options.GnnOptions) -> None:
"""
Predict the values using a trained D-MPNN model with the given options.
Args:
- opts: options.GnnOptions instance containing the details of the prediction
- JSON_ARG_PATH: path to a JSON file containing additional arguments for prediction
Returns:
- None
"""
ignore_elements = [
"py/object",
"checkpoint_paths",
"save_dir",
"saving_name",
]
# Load options and additional arguments from a JSON file
arguments, data = createArgsFromJson(
json_arg_path, ignore_elements, return_json_object=True
)
arguments.append("--preds_path")
arguments.append("")
save_dir = data.get("save_dir")
name = data.get("saving_name")
# Replace relevant attributes in `opts` with loaded options
arguments = createArgsFromJson(jsonFile = opts.configFile)
opts = cp.args.PredictArgs().parse_args(arguments)
opts.preds_path = save_dir + "/" + name
df = pd.read_csv(opts.test_path)
smiles = []
for index, rows in df.iterrows():
my_list = [rows.smiles]
smiles.append(my_list)
# Make predictions and return the result
cp.train.make_predictions(args=opts, smiles=smiles)

cp.train.make_predictions(args=opts)


def train(opts: options.Options):
"""
Run the main training procedure
:param opts: Options defining the details of the training
"""

os.environ["CUDA_VISIBLE_DEVICES"] = f"{opts.gpu}"

# import data from file and create DataFrame
if "tsv" in opts.inputFile:
df = fp.importDataFile(
Expand All @@ -128,7 +65,7 @@ def train(opts: options.Options):
df = fp.importDataFile(
opts.inputFile, import_function=fp.importSmilesCSV, fp_size=opts.fpSize
)
# initialize encoders to None
# initialize (auto)encoders to None
encoder = None
autoencoder = None
if opts.trainAC:
Expand All @@ -142,26 +79,28 @@ def train(opts: options.Options):
# if feature compression is enabled
if opts.compressFeatures:
if not opts.trainAC:
if opts.aeType == "deterministic":
(autoencoder, encoder) = ac.define_ac_model(opts=options.Options())
elif opts.aeType == "variational":
if opts.aeType == "variational":
(autoencoder, encoder) = vae.define_vae_model(opts=options.Options())
elif opts.ecWeightsFile == "":
else:
(autoencoder, encoder) = ac.define_ac_model(opts=options.Options())

if opts.ecWeightsFile == "":
encoder = load_model(opts.ecModelDir)
else:
autoencoder.load_weights(
os.path.join(opts.ecModelDir, opts.ecWeightsFile)
)
# compress the fingerprints using the autoencoder
df = ac.compress_fingerprints(df, encoder)
# ac.visualize_fingerprints(
# df,
# before_col="fp",
# after_col="fpcompressed",
# train_indices=train_indices,
# test_indices=test_indices,
# save_as=f"UMAP_{opts.aeSplitType}.png",
# )
if opts.visualizeLatent:
ac.visualize_fingerprints(
df,
before_col="fp",
after_col="fpcompressed",
train_indices=train_indices,
test_indices=test_indices,
save_as=f"UMAP_{opts.aeSplitType}.png",
)
# train single label models if requested
if opts.trainFNN and not opts.enableMultiLabel:
sl.train_single_label_models(df=df, opts=opts)
Expand Down Expand Up @@ -257,7 +196,7 @@ def main():
raise ValueError("Input directory is not a directory")
elif prog_args.method == "traingnn":
traingnn_opts = options.GnnOptions.fromCmdArgs(prog_args)

createLogger("traingnn.log")
traindmpnn(traingnn_opts)

elif prog_args.method == "predictgnn":
Expand All @@ -267,12 +206,8 @@ def main():
test_path=makePathAbsolute(predictgnn_opts.test_path),
preds_path=makePathAbsolute(predictgnn_opts.preds_path),
)

logging.info(
f"The following arguments are received or filled with default values:\n{prog_args}"
)

predictdmpnn(fixed_opts, prog_args.configFile)
createLogger("predictgnn.log")
predictdmpnn(fixed_opts)

elif prog_args.method == "train":
train_opts = options.Options.fromCmdArgs(prog_args)
Expand All @@ -298,8 +233,6 @@ def main():
),
ecModelDir=makePathAbsolute(predict_opts.ecModelDir),
fnnModelDir=makePathAbsolute(predict_opts.fnnModelDir),
trainAC=False,
trainFNN=False,
)
createDirectory(fixed_opts.outputDir)
createLogger(path.join(fixed_opts.outputDir, "predict.log"))
Expand Down
Loading