Cancifier

Cancer is the second leading cause of deaths globally and accounts for 10 million deaths annually, 90% of which are caused by metastatic cancers ("Cancer", 2018). The effectiveness of contemporary cancer treatments, including targeted therapies, largely depends on knowledge of the patient’s primary tumor. However, up to 5% of all cancer patients have metastatic tumors for which routine diagnostic tools cannot locate the primary site. This results in a diagnosis of cancer of unknown primary (CUP) (Wei et al, 2014; Zhao et al, 2020). CUP can be due to one of the following reasons:

The primary cancer can be too small to detect.
The body’s immune system can kill the primary cancer.
The primary cancer can also be removed during surgery for another condition and doctors may not know that the cancer had formed ("Carcinoma of Unknown Primary Treatment", 2018).

Most CUP patients have a dismal prognosis with a median overall survival of 8-11 months and one-year survival of only 25% ("Carcinoma of Unknown Primary Treatment", 2018). Therefore, novel diagnostic methods are required to improve both speed and accuracy of cancer tissue of origin identification.

It is known that most metastatic samples retain the gene expression profile of the primary tumour and can be used to predict its primary site (note the clustering of gene expression in primary and metastatic tumor samples in the image on the right).

There exist a number of techniques that can be used to determine the transcriptomic profile of a cell. The most widely used are microarray platforms and newer RNA sequencing technologies which serve to measure the activity of thousands genes at a time, creating a picture of cell fate and pathology. Since metastatic cancers are believed to retain the gene expression profile of their primary site, machine learning methods can be used to predict the primary site of a cancer.

The aim of our project Cancifier (Cancer classifier) was to develop a bioinformatic pipeline that can accurately identify the site of origin of a metastatic cancer based on its transcriptomic profile obtained via either RNA-seq or microarray technologies. More specifically, we aimed to make a predictive improvement to an existing method which had an average accuracy of 86.7% (Liu et al., 2020).

Methods and results

Twenty three RNA microarray datasets were obtained programmatically from the Gene Expression Omnibus (GEO) database, and fifteen RNA-Seq datasets were downloaded from the The Cancer Genome Atlas database (TCGA). All dataset IDs were found using the Human Cancer Metastasis Database (HCMDB) which is an integrated database designed to store and analyze large scale expression data of cancer metastasis.

The notebook preprocess.ipynb contains the code that was used to preprocess and normalize the data. The following steps were taken to preprocess the data:

Mapping of Affymetrix microarray probes to the corresponding gene names. The mapping was done using DAVID database (Huang et al., 2009).
Quantile normalization and scaling to zero mean and unit variance. This was done separately for healthy and tumor samples, because quantile normalization assumes the same distribution of gene expression in each sample. The aim of this step was to remove inter-dataset variation and capture mean-variance relationships within each dataset.

Feature selection for better model performance and faster computational times. Since the majority of genes are housekeeping genes, not all of the genes present in the dataset will be useful in distinguishing between different tissues. Therefore, we performed differential gene expression analysis to select only features useful for subsequent machine learning steps. To select differentially expressed genes (genes that are expressed in one tissue, but not the others), we selected genes by computing the differential gene expression (p<0.05) in each subtype in comparison with the other subtypes of the same cancer type as was outlined by Zhao et al., 2020. Bonferroni correction was used to avoid spurious positives, and a more conservative p-value was used to avoid filtering out some potentially useful features.
t-SNE and Principal component analysis (PCA) were used to analyze the effectiveness of preprocessing. Ideally, the data should segregate by tissue type, but the batch effect may be a confounding variable when different datasets are merged. Therefore, we included batch as a covariate to check whether our preprocessing successfully corrected for batch effect and whether true variation has been preserved. Since the number of features was large, t-SNE was performed on reduced data (20 principal components) to suppress any noise and speed up the computation of pairwise distances between samples.
Since our dataset was imbalanced, to avoid undersampling we used SVC model which assigns weight to each class. With SVC, we found accuracy of 0.98 in the internal 10-fold cross-validation step, and the accuracy of 0.97 in the test dataset which contained true metastatic cancer samples for which the tissue of primary origin was known.

Conclusion

Our team developed a bioinformatic pipeline that can accurately identify the site of origin of a metastatic cancer based on its transcriptomic profile obtained via either RNA-seq or microarray technologies. According to our results, we made a predictive improvement to an existing method which had an average accuracy of 86.7% (Liu et al., 2020). In summary, our project demonstrates the utility of machine learning algorithms to decode gene expression profiles and better meet the clinical challenge of identifying the primary site of multiple cancers.

References

Cancer. (2018). Retrieved 6 February 2021, from https://www.who.int/news-room/fact-sheets/detail/cancer
Carcinoma of Unknown Primary Treatment. (2018). Retrieved 6 February 2021, from https://www.cancer.gov/types/unknown-primary/patient/unknown-primary-treatment-pdq
Huang, d., Sherman, B. T., & Lempicki, R. A. (2009). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols, 4(1), 44–57. https://doi.org/10.1038/nprot.2008.211
Huang, d., Sherman, B. T., & Lempicki, R. A. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research, 37(1), 1–13. https://doi.org/10.1093/nar/gkn923
Liu, X., Li, L., Peng, L., Wang, B., Lang, J., & Lu, Q. et al. (2020). Predicting Cancer Tissue-of-Origin by a Machine Learning Method Using DNA Somatic Mutation Data. Frontiers In Genetics, 11. doi: 10.3389/fgene.2020.00674
scikit-learn 0.24.1 documentation. Retrieved 6 February 2021, from https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Wei I H, Shi Y, Jiang H, Kumar-Sinha C, and Chinnaiyan A M (2014). RNA-Seq Accurately Identifies Cancer Biomarker Signatures to Distinguish Tissue of Origin. Neoplasia 16, 918–927
Zhao Y, Pan Z, Namburi S, Pattison A, Posner A, Balachander S, Paisie C A , Reddi H V, Rueter J, Gill A J, Fox S, Raghav K P S, Flynn W F, Tothill R W, Li S, Karuturi R K M, George J (2020). CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence. EBioMedicine 61, 103030

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
images		images
ML.py		ML.py
README.md		README.md
parse_geo_data.py		parse_geo_data.py
preprocess.ipynb		preprocess.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cancifier

Methods and results

Conclusion

References

About

Uh oh!

Releases

Packages

Languages

plezar/cancifier

Folders and files

Latest commit

History

Repository files navigation

Cancifier

Methods and results

Conclusion

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages