This repository is the code base for the classification of organotropic metastases. Transcriptomic profiles of 7,011 cancer patients in the TCGA database were used to classify and analyze the seeding location of primary tumors. The sequencing data and all clinicopatholigic reports for all of these patients were publicly available for bulk data mining through TCGA Biolinks.
We utilized multiple programming languages (i.e. Java, Python, and R) to construct learning models and to perform biological analyses. As a result, this created many dependencies so we have provided two different ways to install our framework -- a manual installation and a docker installation.
Docker Installation
The docker image for this project can be pulled from the online Docker Hub repository or can be built using the Dockerfile included in the base directory of this project.
To pull the image from the Docker Hub repo, run the following command:
docker pull mskaro1/mot
To build the image using the Dockerfile, run the following command in the base directory of this project:
docker build --tag mskaro1/mot .
Manual Installation
For those seeking to manually install the project, all of the following dependencies must be satisfied prior to attempting the installation:
Python
- Version: Python >= 3.5
- Packages: All required Python packages are listed in the requirements.txt.
Java
R
- Version: >= 4.0
- Packages: All session.info() R packages are listed at the bottom of Enriched_features_Fisher_weighted_simulation.R .
Once all of the required depenedencies are satisfied, run the following command in the base directory to install the project as a python package:
pip install .
We have provided a sample dataset of TCGA data to demonstrate the effectiveness of our metastatic classification approach. Our sample dataset is of Colon Adenocarcinoma (COAD) tumors that metastasiszed to the colon, liver, or lung.
Recommneded: Docker Approach
docker run --rm -it -v <output-directory>:/demo-outputs mskaro1/mot
Note: <output-directory>
should be replaced with the path of a directory on the user's local machine, and it is where the outputs of the demo will be stored.
Manual Approach
python3 -m mot.metastasis_pipeline -i ./samples/metastasis-demo/ -o <output-directory> -w ./lib/weka.jar -c ./classes -j /src/GainRatio.java
Note: This command should be run in the base directory of the project, and <output-directory>
should be replaced with the path to a directory for the outputs to be stored.
Demo Outputs
├── <output-directory>/
│ ├── binary-datasets/
│ ├── oversampled-datasets/
│ ├── important-features/
| ├── feature-selected-datasets/
| ├── classification-results/
- binary-datasets: The multilabel COAD dataset is split into multiple binary datasets, and the binary datasets are stored in this directory.
- oversampled-datasets: The training and testing data generated from the binary datasets. The training data uses synthetic data generated by the SMOTE algorithm, while the testing data uses only real TCGA data.
- important-features: The top 1000 features (i.e. genes) of each training dataset ranked by their information gain ratio score.
- feature-selected-datasets: The training and testing datasets that only contain the top 1000 selected features.
- classification-results: Directory contains the classification results of our Random Forest model on the feature-selected datasets.
The entire metastatic pipeline can be ran using the metastasis_pipeline script. This script is callable from the command line interface using the following command:
python -m mot.metastasis_pipeline
The -h
flag to understand all available options.
Additionally, each component of the pipeline can be called individually from the command line. For more information read our wiki for a breakdown of each script's role in the pipeleine.
Note: For those seeking to use the docker image to interact with our framework, run the following command to gain access to the shell of the docker image:
docker run --rm -it --entrypoint="" mskaro1/mot bash
Docker image available now ! Check out our wiki for implementation actions! Thanks!
Using our code or our model? Consider citing us