Tutorial: Nested Cross Validation with Multiple Datatypes

In this example, we will use netDx to run nested cross-validation design using multiple datatypes and custom similarity metrics.

TL;DR

suppressWarnings(suppressMessages(require(netDx)))
suppressWarnings(suppressMessages(require(netDx.examples)))

# load example data
load(sprintf("%s/extdata/nestedCV_input.rda",
             path.package("netDx.examples")))

# define custom similarity function - one for gene expression nets and another, for clinical
KIRC_makeNets <- function(dataList, groupList, netDir,...) {
    netList <- c()
    # make RNA nets: group by pathway
    if (!is.null(groupList[["rna"]])) { 
    netList <- makePSN_NamedMatrix(dataList$rna, 
                    rownames(dataList$rna),
                groupList[["rna"]],netDir,verbose=FALSE, 
                writeProfiles=TRUE,...) 
    netList <- unlist(netList)
    cat(sprintf("Made %i RNA pathway nets\n", length(netList)))
    }
    
    # make clinical nets,one net for each variable
    netList2 <- c()
    if (!is.null(groupList[["clinical"]])) {
    netList2 <- makePSN_NamedMatrix(dataList$clinical, 
        rownames(dataList$clinical),
        groupList[["clinical"]],netDir,
        simMetric="custom",customFunc=netDx::normDiff, # custom function
        sparsify=TRUE,verbose=TRUE,append=TRUE,...)
    }
    netList2 <- unlist(netList2)
    cat(sprintf("Made %i clinical nets\n", length(netList2)))
    netList <- c(netList,netList2) 
    cat(sprintf("Total of %i nets\n", length(netList)))
    return(netList)
}

# run nested CV call
runPredictor_nestedCV(pheno,
   dataList=dats,groupList=groupList,
   makeNetFunc=KIRC_makeNets, ### custom network creation function
   outDir=sprintf("%s/nestedCV_output",getwd()), ## absolute path
   numCores=1L,nFoldCV=2L, CVcutoff=1L,numSplits=2L)

outDir <- sprintf("%s/nestedCV_output",getwd())

# look at patient labels for first train/test split
pred <- read.delim(sprintf("%s/rng1/predictionResults.txt",outDir),h=T,as.is=T)
head(pred)

# look at feature scores for first train/test split
sc <- read.delim(sprintf("%s/rng1/SURVIVEYES/GM_results/SURVIVEYES_pathway_CV_score.txt",outDir),
   sep="\t",h=T,as.is=T)
head(sc)

Introduction

In the nested design, cross-validation is performed repeatedly over multiple random splits of the data into train and blind test partitions. Feature selected networks are those that consistently score highly across the multiple splits.

Conceptually, this is what the higher-level logic looks like for a nested cross-validation design with 10-fold CV in the inner loop, and 100 splits in the outer loop:

(Note: these aren’t real function calls; this block just serves to illustrate the concept of the nested CV design for our purposes)

outerLoop <- 100     # num times to split data into train/blind test samples
innerLoop <- 10      # num folds for cross-validation, also max score for a network
netScores <- list()  # collect <outerLoop> set of netScores
perf <- list()       # collect <outerLoop> set of test evaluations

for k in 1:outerLoop
 [train, test] <- splitData(80:20) # split data using RNG seed
 netScores[[k]] <- runCV(train)
 perf[[k]] <- collectPerformance(netScores[[k]], test)
end

Setup

suppressWarnings(suppressMessages(require(netDx)))
suppressWarnings(suppressMessages(require(netDx.examples)))

Data

In this example, we use data from The Cancer Genome Atlas (http://cancergenome.nih.gov/), downloaded from the PanCancer Survival project (https://www.synapse.org/#!Synapse:syn1710282). We integrate gene expression profiles and clinical variables from renal clear cell carcinoma tumours to predict poor and good survival (Refs 1-2). The data are from 150 tumours and the classes are SURVIVEYES and SURVIVENO.

load(sprintf("%s/extdata/nestedCV_input.rda",
             path.package("netDx.examples")))
head(pheno)
##             ID age grade     stage STATUS_INT     STATUS
## 1 TCGA-AK-3428  62    G2 Stage III          1 SURVIVEYES
## 2 TCGA-AK-3434  72    G2   Stage I          1 SURVIVEYES
## 3 TCGA-B0-4688  46    G4  Stage IV          0  SURVIVENO
## 4 TCGA-B0-4690  65    G3  Stage IV          0  SURVIVENO
## 5 TCGA-B0-4691  55    G3  Stage IV          0  SURVIVENO
## 6 TCGA-B0-4693  72    G4 Stage III          0  SURVIVENO

Design custom patient similarity networks (features)

netDx allows the user to define a custom function that takes patient data and variable groupings as input, and returns a set of patient similarity networks (PSN) as output. The user can customize what datatypes are used, how they are grouped, and what defines patient similarity for a given datatype. When running the predictor (next section), the user simply passes this custom function as an input variable (the makeNetFunc parameter).

Note: While netDx provides a high degree of flexibility in achieving your design of choice, it is up to the user to ensure that the design, i.e. the similarity metric and variable groupings, is appropriate for your application. Domain knowledge is almost likely required for good design.

netDx requires that this function take some generic parameters as input. These include:

dataList: the patient data
groupList: sets of input data that would correspond to individual networks (e.g. genes grouped into pathways)
netDir: the directory where the resulting PSN would be stored. This section provides more details on the dataList and groupList variables.

dataList This contains the input patient data for the predictor. Each key is a datatype, while each value is the corresponding data matrix. Note that columns are patients and rows are unit names (e.g. genes for rna, or variable names for clinical data).

Important: The software expects the patient order in the columns to match the row order in the pheno table.

The names are datatypes:

names(dats)

## [1] "clinical" "rna"

The rna container has the patient gene expression matrix:

head(dats[["rna"]][,1:6])

##         TCGA-AK-3428 TCGA-AK-3434 TCGA-B0-4688 TCGA-B0-4690 TCGA-B0-4691
## ZNF121       72.1289      47.3807      64.2683      81.1642         22.5
## OR2J3         0.0000       0.0000       0.0000       0.0000          0.5
## HMOX1     10726.8908    3753.7897     675.0007    6300.2797       7766.0
## SYT4          0.0000       0.0000       0.0000       0.0000          2.5
## GPR137C      39.9160      38.3558     409.1136      30.5297         17.5
## AKAP12     5680.6723    2663.4704    1041.5136    3684.7812       3173.5
##         TCGA-B0-4693
## ZNF121       84.6178
## OR2J3         0.0000
## HMOX1      7550.1990
## SYT4          0.0000
## GPR137C      27.4436
## AKAP12     4267.9413

The clinical entry has the clinical data matrix:

head(dats[["clinical"]][,1:6])

##       TCGA-AK-3428 TCGA-AK-3434 TCGA-B0-4688 TCGA-B0-4690 TCGA-B0-4691
## age             62           72           46           65           55
## grade            1            1            3            2            2
## stage            3            1            4            4            4
##       TCGA-B0-4693
## age             72
## grade            3
## stage            3

Look at the number of units per datatype:

lapply(dats, nrow)

## $clinical
## [1] 3
## 
## $rna
## [1] 20530

groupList

This object tells the predictor how to group units when constructing a network. For examples, genes may be grouped into a network representing a pathway. This object is a list; the names match those of dataList while each value is itself a list and reflects a potential network.

names(groupList)

## [1] "rna"      "clinical"

For example, here are the networks to be created with RNA data. Genes corresponding to pathways are to be grouped into individual network. Such a groupList would create pathway-level networks:

groupList[["rna"]][1:3]

## $GUANOSINE_NUCLEOTIDES__I_DE_NOVO__I__BIOSYNTHESIS
##  [1] "NME7"   "NME6"   "RRM2B"  "GMPS"   "NME2"   "NME3"   "NME4"  
##  [8] "NME5"   "RRM2"   "NME1"   "GUK1"   "RRM1"   "IMPDH2" "IMPDH1"
## 
## $RETINOL_BIOSYNTHESIS
##  [1] "RDH10" "DHRS4" "LRAT"  "LIPC"  "CES5A" "DHRS9" "RDH11" "DHRS3"
##  [9] "CES1"  "RBP1"  "CES4A" "RBP2"  "PNLIP" "RBP5"  "RBP4"  "CES2" 
## 
## $`MUCIN_CORE_1_AND_CORE_2__I_O__I_-GLYCOSYLATION`
##  [1] "GALNT1"  "GCNT4"   "GALNT7"  "GCNT3"   "GCNT7"   "GALNT6"  "GALNT4" 
##  [8] "GALNT5"  "ST3GAL2" "ST3GAL1" "ST3GAL4" "GALNT10" "GALNT15" "GALNTL6"
## [15] "B3GNT3"  "GALNT16" "GALNT18" "GALNT11" "GALNT12" "GCNT1"   "C1GALT1"
## [22] "GALNT13" "GALNT14" "WBSCR17" "GALNT8"  "GALNT9"  "GALNT2"  "GALNT3"

For clinical data, we want to keep each variable as its own network:

head(groupList[["clinical"]])

## $age
## [1] "age"
## 
## $grade
## [1] "grade"
## 
## $stage
## [1] "stage"

To speed up this example, reduce the number of input networks created by limiting to three pathways (prediction may not be great!):

groupList[["rna"]] <- groupList[["rna"]][1:3]

Define patient similarity for each network

This function is defined by the user and tells the predictor how to create networks from the provided input data.

This function must take dataList,groupList, and netDir as input variables. The residual ... parameter serves to pass additional variables to makePSN_NamedMatrix(), notably numCores.

In this particular example, the custom similarity function does the following:

Creates _pathway-level networks from RNA _data using the default Pearson correlation measure makePSN_NamedMatrix(writeProfiles=TRUE,...) Creates _variable-level networks from clinical _data using a custom similarity function of normalized difference: makePSN_NamedMatrix(writeProfiles=FALSE,simMetric="custom",customFunc=normDiff).

KIRC_makeNets <- function(dataList, groupList, netDir,...) {
    netList <- c()
    # make RNA nets: group by pathway
    if (!is.null(groupList[["rna"]])) { 
    netList <- makePSN_NamedMatrix(dataList$rna, 
                    rownames(dataList$rna),
                groupList[["rna"]],netDir,verbose=FALSE, 
                writeProfiles=TRUE,...) 
    netList <- unlist(netList)
    cat(sprintf("Made %i RNA pathway nets\n", length(netList)))
    }
    
    # make clinical nets,one net for each variable
    netList2 <- c()
    if (!is.null(groupList[["clinical"]])) {
    netList2 <- makePSN_NamedMatrix(dataList$clinical, 
        rownames(dataList$clinical),
        groupList[["clinical"]],netDir,
        simMetric="custom",customFunc=netDx::normDiff, # custom function
        sparsify=TRUE,verbose=TRUE,append=TRUE,...)
    }
    netList2 <- unlist(netList2)
    cat(sprintf("Made %i clinical nets\n", length(netList2)))
    netList <- c(netList,netList2) 
    cat(sprintf("Total of %i nets\n", length(netList)))
    return(netList)
}

Note: dataList and groupList are generic containers that can contain whatever object the user requires to create PSN. The custom function gives the user complete flexibility in feature design. For instance, a dataset with mutation data may pass GenomicRanges objects in dataList, to represent patient mutations, and groupList would contain GenomicRanges corresponding to loci to be grouped together in the resulting network.

Run nested cross-validation

Finally we call the function that runs the netDx predictor. We provide:

number of train/test splits (outer loop of nested CV: numSplits parameter), number of folds for cross-validation (inner loop: nFoldCV), cutoff for networks within an inner loop (CVcutoff), and the information to create the PSN, including patient data (dataList), how variables are to be grouped into networks (groupList) and the custom function to generate features (makeNetFunc).

Running the below takes a lot of time so we have commented it out. Feel free to uncomment and run. Change numCores to match the number of cores available on your machine for parallel processing.

The call below runs nested two-fold cross-validation over 2 splits. Within each split, it:

splits data into train/blind test using the default split of 80:20 runs 2-fold cross-validation using the training samples uses networks that score >=1 out of 2 (CVcutoff) to classify blind test samples.

!!! Note:Setting numCores>1 will not work in Rstudio. To increase numCores, use knitr::purl("NestedCV_MultiData.Rmd") to extract the R code from this notebook into its own file, change numCores and run in an interactive R client (i.e. by calling “R” on command-line). !!!

runPredictor_nestedCV(pheno,
   dataList=dats,groupList=groupList,
   makeNetFunc=KIRC_makeNets, ### custom network creation function
   outDir=sprintf("%s/nestedCV_output",getwd()), ## absolute path
   numCores=1L,nFoldCV=2L, CVcutoff=1L,numSplits=2L)

## Predictor started at:
## [1] "2017-09-13 11:01:27 EDT"
## -------------------------------
## # patients = 150
## # classes = 2 { SURVIVEYES,SURVIVENO }
## Sample breakdown by class
## 
##  SURVIVENO SURVIVEYES 
##         70         80 
## Nested CV design = 2 CV x 2 splits
## Datapoints:
##  clinical: 3 units
##  rna: 20530 units
## # input nets provided:
## 
## 
## Custom function to generate input nets:
## function(dataList, groupList, netDir,...) {
##  netList <- c()
##  # make RNA nets: group by pathway
##  if (!is.null(groupList[["rna"]])) { 
##  netList <- makePSN_NamedMatrix(dataList$rna, 
##                  rownames(dataList$rna),
##              groupList[["rna"]],netDir,verbose=FALSE, 
##              writeProfiles=TRUE,...) 
##  netList <- unlist(netList)
##  cat(sprintf("Made %i RNA pathway nets\n", length(netList)))
##  }
##  
##  # make clinical nets,one net for each variable
##  netList2 <- c()
##  if (!is.null(groupList[["clinical"]])) {
##  netList2 <- makePSN_NamedMatrix(dataList$clinical, 
##      rownames(dataList$clinical),
##      groupList[["clinical"]],netDir,
##      simMetric="custom",customFunc=netDx::normDiff, # custom function
##      sparsify=TRUE,verbose=TRUE,append=TRUE,...)
##  }
##  netList2 <- unlist(netList2)
##  cat(sprintf("Made %i clinical nets\n", length(netList2)))
##  netList <- c(netList,netList2) 
##  cat(sprintf("Total of %i nets\n", length(netList)))
##  return(netList)
## }
## -------------------------------
## 
## -------------------------------
## RNG seed = 1
## -------------------------------
## Setting seed for reproducibility: 5
##             IS_TRAIN
## STATUS       TRAIN TEST
##   SURVIVENO     56   14
##   SURVIVEYES    64   16
## Made 3 RNA pathway nets
## Made 3 clinical nets
## Total of 6 nets
## Got 6 networks
##  * Creating placeholder files
##  * Populating database files, recoding identifiers
##  * Converting profiles to interaction networks
##    user  system elapsed 
##   0.008   0.000   4.549 
## Got 6 networks from 6 profiles
##  * Build GeneMANIA index
##  * Build GeneMANIA cache
##   * Cleanup
## ******
## Class: SURVIVEYES
## 
##    nonpred SURVIVEYES       <NA> 
##         56         64          0 
##  Writing GM queries: Inner CV loop:set seed: 42
## Read 64 IDs
##  2-fold CV
##  Each iter will sample 32 records, 32 will be test
## chunk 1: 32 test (1-32); 32 query
## chunk 2: 32 test (33-64); 32 query
## 1 2 Net weight distribution:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.010   4.452  23.210  25.000  43.760  52.570 
## filter_WtSum = 100.0; 4 of 4 networks left
## Net weight distribution:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.64   16.45   18.86   20.00   26.34   29.70 
## filter_WtSum = 100.0; 5 of 5 networks left
## 
## ******
## Class: SURVIVENO
## 
##   nonpred SURVIVENO      <NA> 
##        64        56         0 
##  Writing GM queries: Inner CV loop:set seed: 42
## Read 56 IDs
##  2-fold CV
##  Each iter will sample 28 records, 28 will be test
## chunk 1: 28 test (1-28); 28 query
## chunk 2: 28 test (29-56); 28 query
## 1 2 Net weight distribution:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.86   12.55   21.83   25.00   34.28   55.48 
## filter_WtSum = 100.0; 4 of 4 networks left
## Net weight distribution:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.42   29.35   36.27   33.34   38.80   41.32 
## filter_WtSum = 100.0; 3 of 3 networks left
## SURVIVEYES: 6 networks
## Filter set provided; only making nets for those provided here
##  rna: 3 of 3 nets left
##  clinical: 3 of 3 nets left
## Made 3 RNA pathway nets
## Made 3 clinical nets
## Total of 6 nets
## Got 6 networks
##  * Creating placeholder files
##  * Populating database files, recoding identifiers
##  * Converting profiles to interaction networks
##    user  system elapsed 
##   0.009   0.001   4.061 
## Got 6 networks from 6 profiles
##  * Build GeneMANIA index
##  * Build GeneMANIA cache
##   * Cleanup[1] "java -d64 -Xmx6G -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/netDx/java/GeneMANIA-3.2B7.jar org.genemania.plugin.apps.QueryRunner  --data /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVEYES/dataset --in flat --out flat --threads 1 --results /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVEYES /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVEYES/SURVIVEYES_query 2>&1 > /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVEYES/SURVIVEYES_query.log"
## * Attempt 1 : SURVIVEYES_query
## QueryRunner time taken: 6.5 s
## SURVIVENO: 5 networks
## Filter set provided; only making nets for those provided here
##  rna: 2 of 3 nets left
##  clinical: 3 of 3 nets left
## Made 2 RNA pathway nets
## Made 3 clinical nets
## Total of 5 nets
## Got 5 networks
##  * Creating placeholder files
##  * Populating database files, recoding identifiers
##  * Converting profiles to interaction networks
##    user  system elapsed 
##   0.009   0.000   3.775 
## Got 5 networks from 5 profiles
##  * Build GeneMANIA index
##  * Build GeneMANIA cache
##   * Cleanup[1] "java -d64 -Xmx6G -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/netDx/java/GeneMANIA-3.2B7.jar org.genemania.plugin.apps.QueryRunner  --data /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVENO/dataset --in flat --out flat --threads 1 --results /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVENO /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVENO/SURVIVENO_query 2>&1 > /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng1/SURVIVENO/SURVIVENO_query.log"
## * Attempt 1 : SURVIVENO_query
## QueryRunner time taken: 4.3 s
## *** 120 rows have an NA prediction (probably query samples that were not not ranked
## Accuracy on 30 blind test subjects = 90.0%
## -------------------------------
## RNG seed = 2
## -------------------------------
## Setting seed for reproducibility: 10
##             IS_TRAIN
## STATUS       TRAIN TEST
##   SURVIVENO     56   14
##   SURVIVEYES    64   16
## Made 3 RNA pathway nets
## Made 3 clinical nets
## Total of 6 nets
## Got 6 networks
##  * Creating placeholder files
##  * Populating database files, recoding identifiers
##  * Converting profiles to interaction networks
##    user  system elapsed 
##   0.008   0.001   4.239 
## Got 6 networks from 6 profiles
##  * Build GeneMANIA index
##  * Build GeneMANIA cache
##   * Cleanup
## ******
## Class: SURVIVEYES
## 
##    nonpred SURVIVEYES       <NA> 
##         56         64          0 
##  Writing GM queries: Inner CV loop:set seed: 42
## Read 64 IDs
##  2-fold CV
##  Each iter will sample 32 records, 32 will be test
## chunk 1: 32 test (1-32); 32 query
## chunk 2: 32 test (33-64); 32 query
## 1 2 Net weight distribution:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.940   7.292  12.930  25.000  30.640  70.200 
## filter_WtSum = 100.0; 4 of 4 networks left
## Net weight distribution:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.54   11.50   14.68   16.67   17.18   34.66 
## filter_WtSum = 100.0; 6 of 6 networks left
## 
## ******
## Class: SURVIVENO
## 
##   nonpred SURVIVENO      <NA> 
##        64        56         0 
##  Writing GM queries: Inner CV loop:set seed: 42
## Read 56 IDs
##  2-fold CV
##  Each iter will sample 28 records, 28 will be test
## chunk 1: 28 test (1-28); 28 query
## chunk 2: 28 test (29-56); 28 query
## 1 2 Net weight distribution:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.20   10.45   19.16   25.00   33.71   51.47 
## filter_WtSum = 100.0; 4 of 4 networks left
## Net weight distribution:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.670   7.045  12.420  33.330  49.160  85.910 
## filter_WtSum = 100.0; 3 of 3 networks left
## SURVIVEYES: 6 networks
## Filter set provided; only making nets for those provided here
##  rna: 3 of 3 nets left
##  clinical: 3 of 3 nets left
## Made 3 RNA pathway nets
## Made 3 clinical nets
## Total of 6 nets
## Got 6 networks
##  * Creating placeholder files
##  * Populating database files, recoding identifiers
##  * Converting profiles to interaction networks
##    user  system elapsed 
##   0.009   0.000   3.986 
## Got 6 networks from 6 profiles
##  * Build GeneMANIA index
##  * Build GeneMANIA cache
##   * Cleanup[1] "java -d64 -Xmx6G -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/netDx/java/GeneMANIA-3.2B7.jar org.genemania.plugin.apps.QueryRunner  --data /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVEYES/dataset --in flat --out flat --threads 1 --results /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVEYES /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVEYES/SURVIVEYES_query 2>&1 > /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVEYES/SURVIVEYES_query.log"
## * Attempt 1 : SURVIVEYES_query
## QueryRunner time taken: 7.1 s
## SURVIVENO: 5 networks
## Filter set provided; only making nets for those provided here
##  rna: 3 of 3 nets left
##  clinical: 2 of 3 nets left
## Made 3 RNA pathway nets
## Made 2 clinical nets
## Total of 5 nets
## Got 5 networks
##  * Creating placeholder files
##  * Populating database files, recoding identifiers
##  * Converting profiles to interaction networks
##    user  system elapsed 
##   0.009   0.000   4.070 
## Got 5 networks from 5 profiles
##  * Build GeneMANIA index
##  * Build GeneMANIA cache
##   * Cleanup[1] "java -d64 -Xmx6G -cp /Library/Frameworks/R.framework/Versions/3.3/Resources/library/netDx/java/GeneMANIA-3.2B7.jar org.genemania.plugin.apps.QueryRunner  --data /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVENO/dataset --in flat --out flat --threads 1 --results /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVENO /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVENO/SURVIVENO_query 2>&1 > /Users/shraddhapai/Documents/Software/netDx/examples/nestedCV_output/rng2/SURVIVENO/SURVIVENO_query.log"
## * Attempt 1 : SURVIVENO_query
## QueryRunner time taken: 5.7 s
## *** 120 rows have an NA prediction (probably query samples that were not not ranked
## Accuracy on 30 blind test subjects = 76.7%
## Predictor completed at:
## [1] "2017-09-13 11:05:16 EDT"

Examine results directory

The results directory contains the log file (log.txt) and results for each split are stored in its own directory (here, rng1/ and rng2/).

outDir <- sprintf("%s/nestedCV_output",getwd())
dir(outDir)

## [1] "inputNets.txt" "log.txt"       "rng1"          "rng2"

Look into one of the split directories. predictionResults.txt contains blind test prediction labels for that split. Each class has its own directory with corresponding cross-validation results.

dir(sprintf("%s/rng1",outDir))

## [1] "ids.txt"               "predictionResults.txt" "SURVIVENO"            
## [4] "SURVIVEYES"

pred <- read.delim(sprintf("%s/rng1/predictionResults.txt",outDir),h=T,as.is=T)
head(pred)

##             ID age grade     stage STATUS_INT    STATUS TT_STATUS
## 1 TCGA-B0-4688  46    G4  Stage IV          0 SURVIVENO      TEST
## 2 TCGA-B0-4693  72    G4 Stage III          0 SURVIVENO      TEST
## 3 TCGA-B0-4696  58    G3 Stage III          0 SURVIVENO      TEST
## 4 TCGA-B0-4712  76    G3  Stage IV          0 SURVIVENO      TEST
## 5 TCGA-B0-4816  49    G3  Stage II          0 SURVIVENO      TEST
## 6 TCGA-B0-4817  81    G3 Stage III          0 SURVIVENO      TEST
##   SURVIVEYES_SCORE SURVIVENO_SCORE PRED_CLASS
## 1       0.01162791       0.9893617  SURVIVENO
## 2       0.33720930       0.7127660  SURVIVENO
## 3       0.05813953       0.7021277  SURVIVENO
## 4       0.09302326       0.9787234  SURVIVENO
## 5       0.52325581       0.5425532  SURVIVENO
## 6       0.27906977       0.7659574  SURVIVENO

The cross-validation results are in the <className>/GM_results/ directory. The set of network scores for the split are in <className>/GM_results/<className>_pathway_CV_score.txt.

dir(sprintf("%s/rng1/SURVIVEYES",outDir))

## [1] "GM_results"                               
## [2] "ids.txt"                                  
## [3] "SURVIVEYES_query"                         
## [4] "SURVIVEYES_query-results.report.txt"      
## [5] "SURVIVEYES_query-results.report.txt.NRANK"
## [6] "SURVIVEYES_query-results.report.txt.PRANK"
## [7] "SURVIVEYES_query.log"                     
## [8] "tmp"

sc <- read.delim(sprintf("%s/rng1/SURVIVEYES/GM_results/SURVIVEYES_pathway_CV_score.txt",outDir),
   sep="\t",h=T,as.is=T)
head(sc)

##                                                        name score
## 1                                                stage_cont     2
## 2                                                grade_cont     2
## 3                                                  age_cont     2
## 4                              RETINOL_BIOSYNTHESIS.profile     1
## 5 GUANOSINE_NUCLEOTIDES__I_DE_NOVO__I__BIOSYNTHESIS.profile     1
## 6    MUCIN_CORE_1_AND_CORE_2__I_O__I_-GLYCOSYLATION.profile     1

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] netDx.examples_0.1   netDx_0.94           RColorBrewer_1.1-2  
##  [4] pracma_2.0.7         ROCR_1.0-7           gplots_3.0.1        
##  [7] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3  IRanges_2.8.2       
## [10] S4Vectors_0.12.2     BiocGenerics_0.20.0  combinat_0.0-8      
## [13] doParallel_1.0.10    iterators_1.0.8      foreach_1.4.3       
## [16] bigmemory_4.5.19     bigmemory.sri_0.1.3 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12       compiler_3.3.1     plyr_1.8.4        
##  [4] XVector_0.14.1     bitops_1.0-6       tools_3.3.1       
##  [7] zlibbioc_1.20.0    digest_0.6.12      tibble_1.3.3      
## [10] evaluate_0.10.1    gtable_0.2.0       pkgconfig_2.0.1   
## [13] rlang_0.1.2        igraph_1.1.2       yaml_2.1.14       
## [16] httr_1.3.1         stringr_1.2.0      knitr_1.17        
## [19] gtools_3.5.0       caTools_1.17.1     rprojroot_1.2     
## [22] grid_3.3.1         r2cytoscape_0.0.3  R6_2.2.2          
## [25] XML_3.98-1.9       rmarkdown_1.6      RJSONIO_1.3-0     
## [28] gdata_2.18.0       reshape2_1.4.2     ggplot2_2.2.1     
## [31] magrittr_1.5       backports_1.1.0    scales_0.4.1      
## [34] codetools_0.2-15   htmltools_0.3.6    colorspace_1.3-2  
## [37] quadprog_1.5-5     KernSmooth_2.23-15 stringi_1.1.5     
## [40] lazyeval_0.2.0     munsell_0.4.3      RCurl_1.95-4.8

sessionInfo

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] netDx.examples_0.1   netDx_0.94           RColorBrewer_1.1-2  
##  [4] pracma_2.0.7         ROCR_1.0-7           gplots_3.0.1        
##  [7] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3  IRanges_2.8.2       
## [10] S4Vectors_0.12.2     BiocGenerics_0.20.0  combinat_0.0-8      
## [13] doParallel_1.0.10    iterators_1.0.8      foreach_1.4.3       
## [16] bigmemory_4.5.19     bigmemory.sri_0.1.3 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12       compiler_3.3.1     plyr_1.8.4        
##  [4] XVector_0.14.1     bitops_1.0-6       tools_3.3.1       
##  [7] zlibbioc_1.20.0    digest_0.6.12      tibble_1.3.3      
## [10] evaluate_0.10.1    gtable_0.2.0       pkgconfig_2.0.1   
## [13] rlang_0.1.2        igraph_1.1.2       yaml_2.1.14       
## [16] httr_1.3.1         stringr_1.2.0      knitr_1.17        
## [19] gtools_3.5.0       caTools_1.17.1     rprojroot_1.2     
## [22] grid_3.3.1         r2cytoscape_0.0.3  R6_2.2.2          
## [25] XML_3.98-1.9       rmarkdown_1.6      RJSONIO_1.3-0     
## [28] gdata_2.18.0       reshape2_1.4.2     ggplot2_2.2.1     
## [31] magrittr_1.5       backports_1.1.0    scales_0.4.1      
## [34] codetools_0.2-15   htmltools_0.3.6    colorspace_1.3-2  
## [37] quadprog_1.5-5     KernSmooth_2.23-15 stringi_1.1.5     
## [40] lazyeval_0.2.0     munsell_0.4.3      RCurl_1.95-4.8

References

Yuan, Y. et al. (2014) Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat Biotechnol 32, 644-52.
The Cancer Genome Atlas Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43-9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tutorial: Nested Cross Validation with Multiple Datatypes

TL;DR

Table of Contents

Introduction

Setup

Data

Design custom patient similarity networks (features)

groupList

Define patient similarity for each network

Run nested cross-validation

Examine results directory

sessionInfo

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally