Data Mining, Analytics and Machine Learning Experiments: Metabolite Biomarkers for Wheat

While I am sharpening my machine learning skills and looking for non gene expression data from plant sciences, one recent article caught my attention. Authors were attempting to use metabolite data from multi-year, multi-varietal trail conducted with both conventional and Organic farming systems of wheat to predict farming system. It is an excellent research article with nice hypothesis, systematic data collection and analysis. Please refer to the article

Authors concluded article with nice suggestions for further analysis. I thought I could try my machine learning skills and apply Random Forest (rf) modeling with some feature engineering (Authors have done rf modeling already, but said the performance of model was only slightly better than the SVM and not many model tuning parameters were tested.)

Summary points from Supervised Learning/Classification perspective: 1. SVMs trained on entire data set (all years, all culitvars, both treatments) gave an accuracy of 0.9032 at p-value of 1.486e-11) 2. Better accuracies could be obtained while using the subsets of data. Ex. 2007 year data gave an accuracy of 0.9677 p-value of 3.746e-08. 3. Reducing the feature set to verified biological compounds minimizes the risk of systematic errors through compounds minimizes risk of systematic errors through background noise.

One thing I feel would make a big impact in biomarker discovery: Location would play siginificant component as trait is a function of G X E [Genotype by Environment]. So I would love to use the location data to make the biomarkers identified more robust is available. (However for now i am assuming that, all the trials were done at the same location across years.)

Set R environment and read files

setwd("C:/Users/rduvv/Desktop/Ind_Data/wheat")
library(caret)
library(ggplot2)
library(reshape2)
owm <- read.csv(file = "owm_m.csv", header = T)

Let us quickly explore the data:

dim(owm)

## [1] 313  40

colnames(owm)

##  [1] "Growth"                    "Variety"                  
##  [3] "Year"                      "X2.Methylcitrate"         
##  [5] "X2.Phospho.D.glycerate"    "X4.Aminobutanoate"        
##  [7] "Adenosine"                 "Citrate"                  
##  [9] "D.Gluconic.acid"           "D.Glucono.1.5.lactone"    
## [11] "Ethanolamine"              "Fructose"                 
## [13] "Fumarate"                  "Glucose"                  
## [15] "Glycine"                   "Inositol.1.phosphate"     
## [17] "L.Alanine"                 "L.Arginine"               
## [19] "L.Asparagine"              "L.Aspartate"              
## [21] "L.Citrulline"              "L.Isoleucine"             
## [23] "L.Leucine"                 "L.Methionine"             
## [25] "L.Phenylalanine"           "L.Proline"                
## [27] "L.Serine"                  "L.Threonine"              
## [29] "L.Tryptophan"              "L.Valine"                 
## [31] "Lysine"                    "Malate"                   
## [33] "Pantothenate"              "Ribitol"                  
## [35] "Succinate"                 "Sucrose"                  
## [37] "alpha.alpha.Trehalose"     "beta.Alanine"             
## [39] "myo.Inositol"              "trans.4.Hydroxy.L.proline"

table(owm$Year)

## 
## y2007 y2009 y2010 
##   160    16   137

table(owm$Variety)

## 
## Antonius  Caphorn      CCP       DJ      MC2   Probus      RdB    Runal 
##       31       31       29       14       15       31       31       45 
## Sandomir    Scaro   Titlis 
##       29       27       30

table(owm$Growth)

## 
## conventional      organic 
##          156          157

I am going to use the caret package as authors did for modeling

One of the key dimentionality reduction method available in Caret package is near zero variance and zero variance. Let us use this to find if there are any features with zero (or) near zero variance.

nzv <- nearZeroVar(owm [,4:40], saveMetrics = T)
table(nzv$nzv)

## 
## FALSE 
##    37

table(nzv$zeroVar)

## 
## FALSE 
##    37

As you could see there are no features with zero Variance or Ner zero variance (as all the 37 features are false for zero and nearZero Variance)

Now let us look for highly correlated features and eliminate them.

corOWM <- cor(owm[,4:40]) # find correlations between the features 4 through 40
summary(corOWM[upper.tri(corOWM)]) # Look for the correlations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3207  0.3298  0.5216  0.5076  0.7034  0.9925

It is clear that there are features that are highly correlated as indicated by Max correlation i.e 0.9925. Now I want to use the findCorrelations funciton to find the colIDs of the features and eliminate them from the data set.

highcor <- findCorrelation(corOWM, cutoff = 0.75)
sort(highcor)

##  [1]  1  3  8 11 12 14 15 19 20 22 23 24 25 27 30 32 35 36 37

By carefully looking through the above file I will remove the columns from my “owm_m.csv” and save as “owm_cor_rem.csv”. Note: This is because I could not figure out how to merge the features and response. I will work on it later. For now this will work. Now read the file into R and store in a variable called owm.data

owm.data <- read.csv(file = "owm_cor_rem.csv", header = T)
dim(owm.data)

## [1] 313  24

Now you could see that we have eliminated Sixteen features that are highly correlated. Let us do the modeling with this enhanced set of features. Removal of those highly correlated features will reduce noise in model and might improve performance.

Data Partition

set.seed(2)
owm.data.inTrain <- createDataPartition(y = owm.data$Growth, p = 0.8, list = F)
owm.data.train <- owm.data[owm.data.inTrain,]
owm.data.test <- owm.data[-owm.data.inTrain,]
dim(owm.data.train)

## [1] 251  24

dim(owm.data.test)

## [1] 62 24

Just check how the data partitioning worked.Training and testing data sets has 251 and 62 observations respectively.

table(owm.data.train$Growth)

## 
## conventional      organic 
##          125          126

table(owm.data.test$Growth)

## 
## conventional      organic 
##           31           31

Nice. Now let us begin modeling. In the research article, LGOCV is used and I would like to use the repeatedcv. Besides this I want to use preProc function.

# define model controls
ctrl.rf <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel = TRUE)
set.seed(351)
#Build the model
owm.rf <- train(owm.data.train$Growth~., data = owm.data.train, method = "rf",tuneLength = 15, trControl = ctrl.rf, preProc = c("center", "scale"))

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.

owm.rf # check  the model

## Random Forest 
## 
## 251 samples
##  23 predictor
##   2 classes: 'conventional', 'organic' 
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 225, 225, 226, 227, 226, 225, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec       ROC SD      Sens SD   
##    2    0.9414544  0.8940171  0.8863248  0.04076668  0.07222488
##    4    0.9487673  0.9123932  0.8916667  0.03720240  0.06984316
##    6    0.9493740  0.9153846  0.8914530  0.03684804  0.06518537
##    8    0.9522929  0.9230769  0.8863248  0.03547748  0.06718917
##   10    0.9491515  0.9205128  0.8779915  0.03721082  0.06241365
##   13    0.9482406  0.9228632  0.8858974  0.03904680  0.06353603
##   15    0.9480290  0.9200855  0.8777778  0.04103770  0.06560428
##   17    0.9452683  0.9175214  0.8696581  0.04272970  0.06697231
##   19    0.9460970  0.9096154  0.8666667  0.03911553  0.07363573
##   21    0.9434706  0.9147436  0.8777778  0.04341363  0.06871451
##   24    0.9422317  0.9094017  0.8696581  0.04633356  0.07456094
##   26    0.9413578  0.9040598  0.8613248  0.04749999  0.07334910
##   28    0.9392101  0.9036325  0.8566239  0.04914945  0.08040815
##   30    0.9384937  0.9091880  0.8670940  0.05028611  0.07173966
##   33    0.9405462  0.9038462  0.8617521  0.04888764  0.08537425
##   Spec SD   
##   0.09733191
##   0.08233865
##   0.09953435
##   0.10011631
##   0.09659468
##   0.10370989
##   0.10852397
##   0.11432777
##   0.10666004
##   0.09759360
##   0.09509934
##   0.11179812
##   0.10145962
##   0.10779033
##   0.10077579
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 8.

plot(owm.rf) # plot the model

plot of chunk unnamed-chunk-9

x<-varImp(owm.rf) # Ccheck Variable importance
plot(x, top = 20) # plot Variable importance

plot of chunk unnamed-chunk-9 The model #8 is the best model at ROC of 0.95229 with Sens of 0.9230 and Spec of 0.8863.

And the metabolites ** myo.Inositol, L.Tryptophan, Phospho-D-Glycerate, Aminobutanoate, LArginine,** seems to contribute to the more than 60 % of model performance.

Now validate the model with test data set.

owm.rf.predicted <- predict (owm.rf, owm.data.test)
confusionMatrix(owm.data.test$Growth, owm.rf.predicted)

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     conventional organic
##   conventional           29       2
##   organic                 1      30
##                                          
##                Accuracy : 0.9516         
##                  95% CI : (0.865, 0.9899)
##     No Information Rate : 0.5161         
##     P-Value [Acc > NIR] : 5.105e-14      
##                                          
##                   Kappa : 0.9032         
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9667         
##             Specificity : 0.9375         
##          Pos Pred Value : 0.9355         
##          Neg Pred Value : 0.9677         
##              Prevalence : 0.4839         
##          Detection Rate : 0.4677         
##    Detection Prevalence : 0.5000         
##       Balanced Accuracy : 0.9521         
##                                          
##        'Positive' Class : conventional   
##

Cool. Note the Model accuracy 0.9516 and p- value of 5.105e-14. Check the ROC curve.

#caret's ROC curve procedure
library(pROC)
roc0 <- roc(owm.data.test$Growth, predict (owm.rf, owm.data.test, type = "prob")[,1], levels = rev(levels(owm.data.test$Growth)))
plot(roc0, print.thres = c(0.5), type = "S", print.thres.pattern = "%.3f(Spec = %.2f, Sens = %.2f)", print.thres.cex = .8, legacy.axes =T, col = '#0080ff', grid = TRUE)

plot of chunk unnamed-chunk-11

## 
## Call:
## roc.default(response = owm.data.test$Growth, predictor = predict(owm.rf,     owm.data.test, type = "prob")[, 1], levels = rev(levels(owm.data.test$Growth)))
## 
## Data: predict(owm.rf, owm.data.test, type = "prob")[, 1] in 31 controls (owm.data.test$Growth organic) < 31 cases (owm.data.test$Growth conventional).
## Area under the curve: 0.9698

Nice to see the model perfomance with 0.9516 Accuracy over 0.9667 Sensitivity and 0.9375 Sensitivity.This looks randomForest could be used to refine model and our effrots of feature selection have paid off with slight increase in model accuracy to 0.9516 (as compared to 0.9032 from paper-across years).

So the top three Meabolites (80% VarImp) that could distinguish Farming system in the Varieties tested in this study are :

Myo.Inositol L.Tryptophan Phospho.D.Glycerate

melted <- melt(owm.data)
head(melted)

##    Growth  Variety  Year         variable      value
## 1 organic Antonius y2007 X2.Methylcitrate 0.01343727
## 2 organic Antonius y2007 X2.Methylcitrate 0.01374540
## 3 organic Antonius y2007 X2.Methylcitrate 0.01023164
## 4 organic Antonius y2007 X2.Methylcitrate 0.01012037
## 5 organic Antonius y2007 X2.Methylcitrate 0.01206357
## 6 organic Antonius y2007 X2.Methylcitrate 0.01116994

small.melted <- subset(melted, melted$variable == 'myo.Inositol' | melted$variable == 'L.Tryptophan')
qplot(small.melted$variable, y = small.melted$value, data = small.melted, color = Growth, facets = ~Variety, geom = "boxplot")

plot of chunk unnamed-chunk-12 Look at the box plots for individual metabolite summary statistic except for few cultivars these two metabolites can different between Organic and conventional farming systems. Hence I would recommend use of Myo-Inositol and L.Tryptophan as bio markers. Well these are also reported in research article. Only improvement is model accuracy.

There are still some more parameters that could be fine tuned. I will post them at a later date if time permits.

But one information I feel that could significantly enhance the Biomarker discovery could be the location information.

Disclaimer: Observations made here are purely based on my analysis and understanding of data. Any applications derived based on this should be carefully validated experimentally.

Thanks for reading. Your suggestions and feedback are more than welcome.

Data Mining, Analytics and Machine Learning Experiments

Wednesday, 27 January 2016

Metabolite Biomarkers for Wheat - Conventional Vs Organic farming systems

No comments:

Post a Comment

Total Pageviews