-
Notifications
You must be signed in to change notification settings - Fork 9
Tutorial: Nested Cross Validation with Multiple Datatypes
In this example, we will use netDx to run nested cross-validation design using multiple datatypes and custom similarity metrics.
suppressWarnings(suppressMessages(require(netDx)))
suppressWarnings(suppressMessages(require(netDx.examples)))
# load example data
load(sprintf("%s/extdata/nestedCV_input.rda",
path.package("netDx.examples")))
# define custom similarity function - one for gene expression nets and another, for clinical
KIRC_makeNets <- function(dataList, groupList, netDir,...) {
netList <- c()
# make RNA nets: group by pathway
if (!is.null(groupList[["rna"]])) {
netList <- makePSN_NamedMatrix(dataList$rna,
rownames(dataList$rna),
groupList[["rna"]],netDir,verbose=FALSE,
writeProfiles=TRUE,...)
netList <- unlist(netList)
cat(sprintf("Made %i RNA pathway nets\n", length(netList)))
}
# make clinical nets,one net for each variable
netList2 <- c()
if (!is.null(groupList[["clinical"]])) {
netList2 <- makePSN_NamedMatrix(dataList$clinical,
rownames(dataList$clinical),
groupList[["clinical"]],netDir,
simMetric="custom",customFunc=netDx::normDiff, # custom function
sparsify=TRUE,verbose=TRUE,append=TRUE,...)
}
netList2 <- unlist(netList2)
cat(sprintf("Made %i clinical nets\n", length(netList2)))
netList <- c(netList,netList2)
cat(sprintf("Total of %i nets\n", length(netList)))
return(netList)
}
# run nested CV call
runPredictor_nestedCV(pheno,
dataList=dats,groupList=groupList,
makeNetFunc=KIRC_makeNets, ### custom network creation function
outDir=sprintf("%s/nestedCV_output",getwd()), ## absolute path
numCores=1L,nFoldCV=2L, CVcutoff=1L,numSplits=2L)
outDir <- sprintf("%s/nestedCV_output",getwd())
# look at patient labels for first train/test split
pred <- read.delim(sprintf("%s/rng1/predictionResults.txt",outDir),h=T,as.is=T)
head(pred)
# look at feature scores for first train/test split
sc <- read.delim(sprintf("%s/rng1/SURVIVEYES/GM_results/SURVIVEYES_pathway_CV_score.txt",outDir),
sep="\t",h=T,as.is=T)
head(sc)
- Introduction
- Setup
- Data
- Design custom patient similarity networks (features)
- Define patient similarity for each network
- Run nested cross-validation
- Examine results directory
- References
In the nested design, cross-validation is performed repeatedly over multiple random splits of the data into train and blind test partitions. Feature selected networks are those that consistently score highly across the multiple splits.
Conceptually, this is what the higher-level logic looks like for a nested cross-validation design with 10-fold CV in the inner loop, and 100 splits in the outer loop:
(Note: these aren’t real function calls; this block just serves to illustrate the concept of the nested CV design for our purposes)
outerLoop <- 100 # num times to split data into train/blind test samples
innerLoop <- 10 # num folds for cross-validation, also max score for a network
netScores <- list() # collect <outerLoop> set of netScores
perf <- list() # collect <outerLoop> set of test evaluations
for k in 1:outerLoop
[train, test] <- splitData(80:20) # split data using RNG seed
netScores[[k]] <- runCV(train)
perf[[k]] <- collectPerformance(netScores[[k]], test)
end
suppressWarnings(suppressMessages(require(netDx)))
suppressWarnings(suppressMessages(require(netDx.examples)))
In this example, we use data from The Cancer Genome Atlas (http://cancergenome.nih.gov/), downloaded from the PanCancer Survival project (https://www.synapse.org/#!Synapse:syn1710282). We integrate gene expression profiles and clinical variables from renal clear cell carcinoma tumours to predict poor and good survival (Refs 1-2). The data are from 150 tumours and the classes are SURVIVEYES
and SURVIVENO
.
load(sprintf("%s/extdata/nestedCV_input.rda",
path.package("netDx.examples")))
head(pheno)
## ID age grade stage STATUS_INT STATUS
## 1 TCGA-AK-3428 62 G2 Stage III 1 SURVIVEYES
## 2 TCGA-AK-3434 72 G2 Stage I 1 SURVIVEYES
## 3 TCGA-B0-4688 46 G4 Stage IV 0 SURVIVENO
## 4 TCGA-B0-4690 65 G3 Stage IV 0 SURVIVENO
## 5 TCGA-B0-4691 55 G3 Stage IV 0 SURVIVENO
## 6 TCGA-B0-4693 72 G4 Stage III 0 SURVIVENO
netDx allows the user to define a custom function that takes patient data and variable groupings as input, and returns a set of patient similarity networks (PSN) as output. The user can customize what datatypes are used, how they are grouped, and what defines patient similarity for a given datatype. When running the predictor (next section), the user simply passes this custom function as an input variable (the makeNetFunc
parameter).
Note: While netDx provides a high degree of flexibility in achieving your design of choice, it is up to the user to ensure that the design, i.e. the similarity metric and variable groupings, is appropriate for your application. Domain knowledge is almost likely required for good design.
netDx requires that this function take some generic parameters as input. These include:
-
dataList
: the patient data -
groupList
: sets of input data that would correspond to individual networks (e.g. genes grouped into pathways) -
netDir
: the directory where the resulting PSN would be stored. This section provides more details on thedataList
andgroupList
variables.
dataList
This contains the input patient data for the predictor. Each key is a datatype, while each value is the corresponding data matrix. Note that columns are patients and rows are unit names (e.g. genes for rna, or variable names for clinical data).
Important: The software expects the patient order in the columns to match the row order in the pheno
table.
The names are datatypes:
names(dats)
## [1] "clinical" "rna"
The rna
container has the patient gene expression matrix:
head(dats[["rna"]][,1:6])
## TCGA-AK-3428 TCGA-AK-3434 TCGA-B0-4688 TCGA-B0-4690 TCGA-B0-4691
## ZNF121 72.1289 47.3807 64.2683 81.1642 22.5
## OR2J3 0.0000 0.0000 0.0000 0.0000 0.5
## HMOX1 10726.8908 3753.7897 675.0007 6300.2797 7766.0
## SYT4 0.0000 0.0000 0.0000 0.0000 2.5
## GPR137C 39.9160 38.3558 409.1136 30.5297 17.5
## AKAP12 5680.6723 2663.4704 1041.5136 3684.7812 3173.5
## TCGA-B0-4693
## ZNF121 84.6178
## OR2J3 0.0000
## HMOX1 7550.1990
## SYT4 0.0000
## GPR137C 27.4436
## AKAP12 4267.9413
The clinical
entry has the clinical data matrix:
head(dats[["clinical"]][,1:6])
## TCGA-AK-3428 TCGA-AK-3434 TCGA-B0-4688 TCGA-B0-4690 TCGA-B0-4691
## age 62 72 46 65 55
## grade 1 1 3 2 2
## stage 3 1 4 4 4
## TCGA-B0-4693
## age 72
## grade 3
## stage 3
Look at the number of units per datatype:
lapply(dats, nrow)
## $clinical
## [1] 3
##
## $rna
## [1] 20530
This object tells the predictor how to group units when constructing a network. For examples, genes may be grouped into a network representing a pathway. This object is a list; the names match those of dataList
while each value is itself a list and reflects a potential network.
names(groupList)
## [1] "rna" "clinical"
For example, here are the networks to be created with RNA data. Genes corresponding to pathways are to be grouped into individual network. Such a groupList would create pathway-level networks:
groupList[["rna"]][1:3]
## $GUANOSINE_NUCLEOTIDES__I_DE_NOVO__I__BIOSYNTHESIS
## [1] "NME7" "NME6" "RRM2B" "GMPS" "NME2" "NME3" "NME4"
## [8] "NME5" "RRM2" "NME1" "GUK1" "RRM1" "IMPDH2" "IMPDH1"
##
## $RETINOL_BIOSYNTHESIS
## [1] "RDH10" "DHRS4" "LRAT" "LIPC" "CES5A" "DHRS9" "RDH11" "DHRS3"
## [9] "CES1" "RBP1" "CES4A" "RBP2" "PNLIP" "RBP5" "RBP4" "CES2"
##
## $`MUCIN_CORE_1_AND_CORE_2__I_O__I_-GLYCOSYLATION`
## [1] "GALNT1" "GCNT4" "GALNT7" "GCNT3" "GCNT7" "GALNT6" "GALNT4"
## [8] "GALNT5" "ST3GAL2" "ST3GAL1" "ST3GAL4" "GALNT10" "GALNT15" "GALNTL6"
## [15] "B3GNT3" "GALNT16" "GALNT18" "GALNT11" "GALNT12" "GCNT1" "C1GALT1"
## [22] "GALNT13" "GALNT14" "WBSCR17" "GALNT8" "GALNT9" "GALNT2" "GALNT3"
For clinical data, we want to keep each variable as its own network:
head(groupList[["clinical"]])
## $age
## [1] "age"
##
## $grade
## [1] "grade"
##
## $stage
## [1] "stage"
To speed up this example, reduce the number of input networks created by limiting to three pathways (prediction may not be great!):
groupList[["rna"]] <- groupList[["rna"]][1:3]
This function is defined by the user and tells the predictor how to create networks from the provided input data.
This function must take dataList
,groupList
, and netDir
as input variables. The residual ...
parameter serves to pass additional variables to makePSN_NamedMatrix()
, notably numCores
.
In this particular example, the custom similarity function does the following:
Creates _pathway-level networks from RNA _data using the default Pearson correlation measure makePSN_NamedMatrix(writeProfiles=TRUE,...)
Creates _variable-level networks from clinical _data using a custom similarity function of normalized difference: makePSN_NamedMatrix(writeProfiles=FALSE,simMetric="custom",customFunc=normDiff)
.
KIRC_makeNets <- function(dataList, groupList, netDir,...) {
netList <- c()
# make RNA nets: group by pathway
if (!is.null(groupList[["rna"]])) {
netList <- makePSN_NamedMatrix(dataList$rna,
rownames(dataList$rna),
groupList[["rna"]],netDir,verbose=FALSE,
writeProfiles=TRUE,...)
netList <- unlist(netList)
cat(sprintf("Made %i RNA pathway nets\n", length(netList)))
}
# make clinical nets,one net for each variable
netList2 <- c()
if (!is.null(groupList[["clinical"]])) {
netList2 <- makePSN_NamedMatrix(dataList$clinical,
rownames(dataList$clinical),
groupList[["clinical"]],netDir,
simMetric="custom",customFunc=netDx::normDiff, # custom function
sparsify=TRUE,verbose=TRUE,append=TRUE,...)
}
netList2 <- unlist(netList2)
cat(sprintf("Made %i clinical nets\n", length(netList2)))
netList <- c(netList,netList2)
cat(sprintf("Total of %i nets\n", length(netList)))
return(netList)
}
Note: dataList
and groupList
are generic containers that can contain whatever object the user requires to create PSN. The custom function gives the user complete flexibility in feature design. For instance, a dataset with mutation data may pass GenomicRanges
objects in dataList
, to represent patient mutations, and groupList
would contain GenomicRanges
corresponding to loci to be grouped together in the resulting network.
Finally we call the function that runs the netDx predictor. We provide:
number of train/test splits (outer loop of nested CV: numSplits
parameter),
number of folds for cross-validation (inner loop: nFoldCV
),
cutoff for networks within an inner loop (CVcutoff
),
and the information to create the PSN, including patient data (dataList
), how variables are to be grouped into networks (groupList
) and the custom function to generate features (makeNetFunc
).
Running the below takes a lot of time so we have commented it out. Feel free to uncomment and run. Change numCores
to match the number of cores available on your machine for parallel processing.
The call below runs nested two-fold cross-validation over 2 splits. Within each split, it:
splits data into train/blind test using the default split of 80:20
runs 2-fold cross-validation using the training samples
uses networks that score >=1 out of 2 (CVcutoff
) to classify blind test samples.
!!! Note:Setting numCores
>1 will not work in Rstudio. To increase numCores
, use knitr::purl("NestedCV_MultiData.Rmd")
to extract the R code from this notebook into its own file, change numCores
and run in an interactive R client (i.e. by calling “R” on command-line). !!!
runPredictor_nestedCV(pheno,
dataList=dats,groupList=groupList,
makeNetFunc=KIRC_makeNets, ### custom network creation function
outDir=sprintf("%s/nestedCV_output",getwd()), ## absolute path
numCores=1L,nFoldCV=2L, CVcutoff=1L,numSplits=2L)
## Predictor started at:
## [1] "2017-09-13 11:01:27 EDT"
## -------------------------------
## # patients = 150
## # classes = 2 { SURVIVEYES,SURVIVENO }
## Sample breakdown by class
##
## SURVIVENO SURVIVEYES
## 70 80
## Nested CV design = 2 CV x 2 splits
## Datapoints:
## clinical: 3 units
## rna: 20530 units
## # input nets provided:
##
##
## Custom function to generate input nets:
## function(dataList, groupList, netDir,...) {
## netList <- c()
## # make RNA nets: group by pathway
## if (!is.null(groupList[["rna"]])) {
## netList <- makePSN_NamedMatrix(dataList$rna,
## rownames(dataList$rna),
## groupList[["rna"]],netDir,verbose=FALSE,
## writeProfiles=TRUE,...)
## netList <- unlist(netList)
## cat(sprintf("Made %i RNA pathway nets\n", length(netList)))
## }
##
## # make clinical nets,one net for each variable
## netList2 <- c()
## if (!is.null(groupList[["clinical"]])) {
## netList2 <- makePSN_NamedMatrix(dataList$clinical,
## rownames(dataList$clinical),
## groupList[["clinical"]],netDir,
## simMetric="custom",customFunc=netDx::normDiff, # custom function
## sparsify=TRUE,verbose=TRUE,append=TRUE,...)
## }
## netList2 <- unlist(netList2)
## cat(sprintf("Made %i clinical nets\n", length(netList2)))
## netList <- c(netList,netList2)
## cat(sprintf("Total of %i nets\n", length(netList)))
## return(netList)
## }
## -------------------------------
##
## -------------------------------
## RNG seed = 1
## -------------------------------
## Setting seed for reproducibility: 5
## IS_TRAIN
## STATUS TRAIN TEST
## SURVIVENO 56 14
## SURVIVEYES 64 16
## Made 3 RNA pathway nets
## Made 3 clinical nets
## Total of 6 nets
## Got 6 networks
## * Creating placeholder files
## * Populating database files, recoding identifiers
## * Converting profiles to interaction networks
## user system elapsed
## 0.008 0.000 4.549
## Got 6 networks from 6 profiles
## * Build GeneMANIA index
## * Build GeneMANIA cache
## * Cleanup
## ******
## Class: SURVIVEYES
##
## nonpred SURVIVEYES <NA>
## 56 64 0
## Writing GM queries: Inner CV loop:set seed: 42
## Read 64 IDs
## 2-fold CV
## Each iter will sample 32 records, 32 will be test
## chunk 1: 32 test (1-32); 32 query
## chunk 2: 32 test (33-64); 32 query
## 1 2 Net weight distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.010 4.452 23.210 25.000 43.760 52.570
## filter_WtSum = 100.0; 4 of 4 networks left
## Net weight distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.64 16.45 18.86 20.00 26.34 29.70
## filter_WtSum = 100.0; 5 of 5 networks left
##
## ******
## Class: SURVIVENO
##
## nonpred SURVIVENO <NA>
## 64 56 0
## Writing GM queries: Inner CV loop:set seed: 42
## Read 56 IDs
## 2-fold CV
## Each iter will sample 28 records, 28 will be test
## chunk 1: 28 test (1-28); 28 query
## chunk 2: 28 test (29-56); 28 query
## 1 2 Net weight distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.86 12.55 21.83 25.00 34.28 55.48
## filter_WtSum = 100.0; 4 of 4 networks left
## Net weight distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.42 29.35 36.27 33.34 38.80 41.32
## filter_WtSum = 100.0; 3 of 3 networks left
## SURVIVEYES: 6 networks
## Filter set provided; only making nets for those provided here
## rna: 3 of 3 nets left
## clinical: 3 of 3 nets left
## Made 3 RNA pathway nets
## Made 3 clinical nets
## Total of 6 nets
## Got 6 networks
## * Creating placeholder files
## * Populating database files, recoding identifiers
## * Converting profiles to interaction networks
## user system elapsed
## 0.009 0.001 4.061
## Got 6 networks from 6 profiles
## * Build GeneMANIA index
## * Build GeneMANIA cache
## * Cleanup[1] "java -d64 -Xmx6G -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/netDx/java/GeneMANIA-3.2B7.jar org.genemania.plugin.apps.QueryRunner --data /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVEYES/dataset --in flat --out flat --threads 1 --results /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVEYES /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVEYES/SURVIVEYES_query 2>&1 > /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVEYES/SURVIVEYES_query.log"
## * Attempt 1 : SURVIVEYES_query
## QueryRunner time taken: 6.5 s
## SURVIVENO: 5 networks
## Filter set provided; only making nets for those provided here
## rna: 2 of 3 nets left
## clinical: 3 of 3 nets left
## Made 2 RNA pathway nets
## Made 3 clinical nets
## Total of 5 nets
## Got 5 networks
## * Creating placeholder files
## * Populating database files, recoding identifiers
## * Converting profiles to interaction networks
## user system elapsed
## 0.009 0.000 3.775
## Got 5 networks from 5 profiles
## * Build GeneMANIA index
## * Build GeneMANIA cache
## * Cleanup[1] "java -d64 -Xmx6G -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/netDx/java/GeneMANIA-3.2B7.jar org.genemania.plugin.apps.QueryRunner --data /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVENO/dataset --in flat --out flat --threads 1 --results /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVENO /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVENO/SURVIVENO_query 2>&1 > /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVENO/SURVIVENO_query.log"
## * Attempt 1 : SURVIVENO_query
## QueryRunner time taken: 4.3 s
## *** 120 rows have an NA prediction (probably query samples that were not not ranked
## Accuracy on 30 blind test subjects = 90.0%
## -------------------------------
## RNG seed = 2
## -------------------------------
## Setting seed for reproducibility: 10
## IS_TRAIN
## STATUS TRAIN TEST
## SURVIVENO 56 14
## SURVIVEYES 64 16
## Made 3 RNA pathway nets
## Made 3 clinical nets
## Total of 6 nets
## Got 6 networks
## * Creating placeholder files
## * Populating database files, recoding identifiers
## * Converting profiles to interaction networks
## user system elapsed
## 0.008 0.001 4.239
## Got 6 networks from 6 profiles
## * Build GeneMANIA index
## * Build GeneMANIA cache
## * Cleanup
## ******
## Class: SURVIVEYES
##
## nonpred SURVIVEYES <NA>
## 56 64 0
## Writing GM queries: Inner CV loop:set seed: 42
## Read 64 IDs
## 2-fold CV
## Each iter will sample 32 records, 32 will be test
## chunk 1: 32 test (1-32); 32 query
## chunk 2: 32 test (33-64); 32 query
## 1 2 Net weight distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.940 7.292 12.930 25.000 30.640 70.200
## filter_WtSum = 100.0; 4 of 4 networks left
## Net weight distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.54 11.50 14.68 16.67 17.18 34.66
## filter_WtSum = 100.0; 6 of 6 networks left
##
## ******
## Class: SURVIVENO
##
## nonpred SURVIVENO <NA>
## 64 56 0
## Writing GM queries: Inner CV loop:set seed: 42
## Read 56 IDs
## 2-fold CV
## Each iter will sample 28 records, 28 will be test
## chunk 1: 28 test (1-28); 28 query
## chunk 2: 28 test (29-56); 28 query
## 1 2 Net weight distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.20 10.45 19.16 25.00 33.71 51.47
## filter_WtSum = 100.0; 4 of 4 networks left
## Net weight distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.670 7.045 12.420 33.330 49.160 85.910
## filter_WtSum = 100.0; 3 of 3 networks left
## SURVIVEYES: 6 networks
## Filter set provided; only making nets for those provided here
## rna: 3 of 3 nets left
## clinical: 3 of 3 nets left
## Made 3 RNA pathway nets
## Made 3 clinical nets
## Total of 6 nets
## Got 6 networks
## * Creating placeholder files
## * Populating database files, recoding identifiers
## * Converting profiles to interaction networks
## user system elapsed
## 0.009 0.000 3.986
## Got 6 networks from 6 profiles
## * Build GeneMANIA index
## * Build GeneMANIA cache
## * Cleanup[1] "java -d64 -Xmx6G -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/netDx/java/GeneMANIA-3.2B7.jar org.genemania.plugin.apps.QueryRunner --data /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVEYES/dataset --in flat --out flat --threads 1 --results /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVEYES /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVEYES/SURVIVEYES_query 2>&1 > /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVEYES/SURVIVEYES_query.log"
## * Attempt 1 : SURVIVEYES_query
## QueryRunner time taken: 7.1 s
## SURVIVENO: 5 networks
## Filter set provided; only making nets for those provided here
## rna: 3 of 3 nets left
## clinical: 2 of 3 nets left
## Made 3 RNA pathway nets
## Made 2 clinical nets
## Total of 5 nets
## Got 5 networks
## * Creating placeholder files
## * Populating database files, recoding identifiers
## * Converting profiles to interaction networks
## user system elapsed
## 0.009 0.000 4.070
## Got 5 networks from 5 profiles
## * Build GeneMANIA index
## * Build GeneMANIA cache
## * Cleanup[1] "java -d64 -Xmx6G -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/netDx/java/GeneMANIA-3.2B7.jar org.genemania.plugin.apps.QueryRunner --data /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVENO/dataset --in flat --out flat --threads 1 --results /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVENO /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVENO/SURVIVENO_query 2>&1 > /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVENO/SURVIVENO_query.log"
## * Attempt 1 : SURVIVENO_query
## QueryRunner time taken: 5.7 s
## *** 120 rows have an NA prediction (probably query samples that were not not ranked
## Accuracy on 30 blind test subjects = 76.7%
## Predictor completed at:
## [1] "2017-09-13 11:05:16 EDT"
The results directory contains the log file (log.txt
) and results for each split are stored in its own directory (here, rng1/
and rng2/
).
outDir <- sprintf("%s/nestedCV_output",getwd())
dir(outDir)
## [1] "inputNets.txt" "log.txt" "rng1" "rng2"
Look into one of the split directories. predictionResults.txt
contains blind test prediction labels for that split. Each class has its own directory with corresponding cross-validation results.
dir(sprintf("%s/rng1",outDir))
## [1] "ids.txt" "predictionResults.txt" "SURVIVENO"
## [4] "SURVIVEYES"
pred <- read.delim(sprintf("%s/rng1/predictionResults.txt",outDir),h=T,as.is=T)
head(pred)
## ID age grade stage STATUS_INT STATUS TT_STATUS
## 1 TCGA-B0-4688 46 G4 Stage IV 0 SURVIVENO TEST
## 2 TCGA-B0-4693 72 G4 Stage III 0 SURVIVENO TEST
## 3 TCGA-B0-4696 58 G3 Stage III 0 SURVIVENO TEST
## 4 TCGA-B0-4712 76 G3 Stage IV 0 SURVIVENO TEST
## 5 TCGA-B0-4816 49 G3 Stage II 0 SURVIVENO TEST
## 6 TCGA-B0-4817 81 G3 Stage III 0 SURVIVENO TEST
## SURVIVEYES_SCORE SURVIVENO_SCORE PRED_CLASS
## 1 0.01162791 0.9893617 SURVIVENO
## 2 0.33720930 0.7127660 SURVIVENO
## 3 0.05813953 0.7021277 SURVIVENO
## 4 0.09302326 0.9787234 SURVIVENO
## 5 0.52325581 0.5425532 SURVIVENO
## 6 0.27906977 0.7659574 SURVIVENO
The cross-validation results are in the <className>/GM_results/
directory. The set of network scores for the split are in <className>/GM_results/<className>_pathway_CV_score.txt
.
dir(sprintf("%s/rng1/SURVIVEYES",outDir))
## [1] "GM_results"
## [2] "ids.txt"
## [3] "SURVIVEYES_query"
## [4] "SURVIVEYES_query-results.report.txt"
## [5] "SURVIVEYES_query-results.report.txt.NRANK"
## [6] "SURVIVEYES_query-results.report.txt.PRANK"
## [7] "SURVIVEYES_query.log"
## [8] "tmp"
sc <- read.delim(sprintf("%s/rng1/SURVIVEYES/GM_results/SURVIVEYES_pathway_CV_score.txt",outDir),
sep="\t",h=T,as.is=T)
head(sc)
## name score
## 1 stage_cont 2
## 2 grade_cont 2
## 3 age_cont 2
## 4 RETINOL_BIOSYNTHESIS.profile 1
## 5 GUANOSINE_NUCLEOTIDES__I_DE_NOVO__I__BIOSYNTHESIS.profile 1
## 6 MUCIN_CORE_1_AND_CORE_2__I_O__I_-GLYCOSYLATION.profile 1
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] netDx.examples_0.1 netDx_0.94 RColorBrewer_1.1-2
## [4] pracma_2.0.7 ROCR_1.0-7 gplots_3.0.1
## [7] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3 IRanges_2.8.2
## [10] S4Vectors_0.12.2 BiocGenerics_0.20.0 combinat_0.0-8
## [13] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3
## [16] bigmemory_4.5.19 bigmemory.sri_0.1.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 compiler_3.3.1 plyr_1.8.4
## [4] XVector_0.14.1 bitops_1.0-6 tools_3.3.1
## [7] zlibbioc_1.20.0 digest_0.6.12 tibble_1.3.3
## [10] evaluate_0.10.1 gtable_0.2.0 pkgconfig_2.0.1
## [13] rlang_0.1.2 igraph_1.1.2 yaml_2.1.14
## [16] httr_1.3.1 stringr_1.2.0 knitr_1.17
## [19] gtools_3.5.0 caTools_1.17.1 rprojroot_1.2
## [22] grid_3.3.1 r2cytoscape_0.0.3 R6_2.2.2
## [25] XML_3.98-1.9 rmarkdown_1.6 RJSONIO_1.3-0
## [28] gdata_2.18.0 reshape2_1.4.2 ggplot2_2.2.1
## [31] magrittr_1.5 backports_1.1.0 scales_0.4.1
## [34] codetools_0.2-15 htmltools_0.3.6 colorspace_1.3-2
## [37] quadprog_1.5-5 KernSmooth_2.23-15 stringi_1.1.5
## [40] lazyeval_0.2.0 munsell_0.4.3 RCurl_1.95-4.8
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] netDx.examples_0.1 netDx_0.94 RColorBrewer_1.1-2
## [4] pracma_2.0.7 ROCR_1.0-7 gplots_3.0.1
## [7] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3 IRanges_2.8.2
## [10] S4Vectors_0.12.2 BiocGenerics_0.20.0 combinat_0.0-8
## [13] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3
## [16] bigmemory_4.5.19 bigmemory.sri_0.1.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 compiler_3.3.1 plyr_1.8.4
## [4] XVector_0.14.1 bitops_1.0-6 tools_3.3.1
## [7] zlibbioc_1.20.0 digest_0.6.12 tibble_1.3.3
## [10] evaluate_0.10.1 gtable_0.2.0 pkgconfig_2.0.1
## [13] rlang_0.1.2 igraph_1.1.2 yaml_2.1.14
## [16] httr_1.3.1 stringr_1.2.0 knitr_1.17
## [19] gtools_3.5.0 caTools_1.17.1 rprojroot_1.2
## [22] grid_3.3.1 r2cytoscape_0.0.3 R6_2.2.2
## [25] XML_3.98-1.9 rmarkdown_1.6 RJSONIO_1.3-0
## [28] gdata_2.18.0 reshape2_1.4.2 ggplot2_2.2.1
## [31] magrittr_1.5 backports_1.1.0 scales_0.4.1
## [34] codetools_0.2-15 htmltools_0.3.6 colorspace_1.3-2
## [37] quadprog_1.5-5 KernSmooth_2.23-15 stringi_1.1.5
## [40] lazyeval_0.2.0 munsell_0.4.3 RCurl_1.95-4.8
- Yuan, Y. et al. (2014) Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat Biotechnol 32, 644-52.
- The Cancer Genome Atlas Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43-9.