While I am sharpening my machine learning skills and looking for non gene expression data from plant sciences, one recent article caught my attention. Authors were attempting to use metabolite data from multi-year, multi-varietal trail conducted with both conventional and Organic farming systems of wheat to predict farming system. It is an excellent research article with nice hypothesis, systematic data collection and analysis. Please refer to the article
Authors concluded article with nice suggestions for further analysis. I thought I could try my machine learning skills and apply Random Forest (rf) modeling with some feature engineering (Authors have done rf modeling already, but said the performance of model was only slightly better than the SVM and not many model tuning parameters were tested.)
Summary points from Supervised Learning/Classification perspective: 1. SVMs trained on entire data set (all years, all culitvars, both treatments) gave an accuracy of 0.9032 at p-value of 1.486e-11) 2. Better accuracies could be obtained while using the subsets of data. Ex. 2007 year data gave an accuracy of 0.9677 p-value of 3.746e-08. 3. Reducing the feature set to verified biological compounds minimizes the risk of systematic errors through compounds minimizes risk of systematic errors through background noise.
One thing I feel would make a big impact in biomarker discovery: Location would play siginificant component as trait is a function of G X E [Genotype by Environment]. So I would love to use the location data to make the biomarkers identified more robust is available. (However for now i am assuming that, all the trials were done at the same location across years.)
Set R environment and read files
setwd("C:/Users/rduvv/Desktop/Ind_Data/wheat")
library(caret)
library(ggplot2)
library(reshape2)
owm <- read.csv(file = "owm_m.csv", header = T)
Let us quickly explore the data:
dim(owm)
## [1] 313 40
colnames(owm)
## [1] "Growth" "Variety"
## [3] "Year" "X2.Methylcitrate"
## [5] "X2.Phospho.D.glycerate" "X4.Aminobutanoate"
## [7] "Adenosine" "Citrate"
## [9] "D.Gluconic.acid" "D.Glucono.1.5.lactone"
## [11] "Ethanolamine" "Fructose"
## [13] "Fumarate" "Glucose"
## [15] "Glycine" "Inositol.1.phosphate"
## [17] "L.Alanine" "L.Arginine"
## [19] "L.Asparagine" "L.Aspartate"
## [21] "L.Citrulline" "L.Isoleucine"
## [23] "L.Leucine" "L.Methionine"
## [25] "L.Phenylalanine" "L.Proline"
## [27] "L.Serine" "L.Threonine"
## [29] "L.Tryptophan" "L.Valine"
## [31] "Lysine" "Malate"
## [33] "Pantothenate" "Ribitol"
## [35] "Succinate" "Sucrose"
## [37] "alpha.alpha.Trehalose" "beta.Alanine"
## [39] "myo.Inositol" "trans.4.Hydroxy.L.proline"
table(owm$Year)
##
## y2007 y2009 y2010
## 160 16 137
table(owm$Variety)
##
## Antonius Caphorn CCP DJ MC2 Probus RdB Runal
## 31 31 29 14 15 31 31 45
## Sandomir Scaro Titlis
## 29 27 30
table(owm$Growth)
##
## conventional organic
## 156 157
I am going to use the caret package as authors did for modeling
One of the key dimentionality reduction method available in Caret package is near zero variance and zero variance. Let us use this to find if there are any features with zero (or) near zero variance.
nzv <- nearZeroVar(owm [,4:40], saveMetrics = T)
table(nzv$nzv)
##
## FALSE
## 37
table(nzv$zeroVar)
##
## FALSE
## 37
As you could see there are no features with zero Variance or Ner zero variance (as all the 37 features are false for zero and nearZero Variance)
Now let us look for highly correlated features and eliminate them.
corOWM <- cor(owm[,4:40]) # find correlations between the features 4 through 40
summary(corOWM[upper.tri(corOWM)]) # Look for the correlations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.3207 0.3298 0.5216 0.5076 0.7034 0.9925
It is clear that there are features that are highly correlated as indicated by Max correlation i.e 0.9925. Now I want to use the findCorrelations funciton to find the colIDs of the features and eliminate them from the data set.
highcor <- findCorrelation(corOWM, cutoff = 0.75)
sort(highcor)
## [1] 1 3 8 11 12 14 15 19 20 22 23 24 25 27 30 32 35 36 37
By carefully looking through the above file I will remove the columns from my “owm_m.csv” and save as “owm_cor_rem.csv”. Note: This is because I could not figure out how to merge the features and response. I will work on it later. For now this will work.
Now read the file into R and store in a variable called owm.data
owm.data <- read.csv(file = "owm_cor_rem.csv", header = T)
dim(owm.data)
## [1] 313 24
Now you could see that we have eliminated Sixteen features that are highly correlated. Let us do the modeling with this enhanced set of features. Removal of those highly correlated features will reduce noise in model and might improve performance.
Data Partition
set.seed(2)
owm.data.inTrain <- createDataPartition(y = owm.data$Growth, p = 0.8, list = F)
owm.data.train <- owm.data[owm.data.inTrain,]
owm.data.test <- owm.data[-owm.data.inTrain,]
dim(owm.data.train)
## [1] 251 24
dim(owm.data.test)
## [1] 62 24
Just check how the data partitioning worked.Training and testing data sets has 251 and 62 observations respectively.
table(owm.data.train$Growth)
##
## conventional organic
## 125 126
table(owm.data.test$Growth)
##
## conventional organic
## 31 31
Nice. Now let us begin modeling. In the research article, LGOCV is used and I would like to use the repeatedcv. Besides this I want to use preProc function.
# define model controls
ctrl.rf <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel = TRUE)
set.seed(351)
#Build the model
owm.rf <- train(owm.data.train$Growth~., data = owm.data.train, method = "rf",tuneLength = 15, trControl = ctrl.rf, preProc = c("center", "scale"))
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.
owm.rf # check the model
## Random Forest
##
## 251 samples
## 23 predictor
## 2 classes: 'conventional', 'organic'
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 225, 225, 226, 227, 226, 225, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec ROC SD Sens SD
## 2 0.9414544 0.8940171 0.8863248 0.04076668 0.07222488
## 4 0.9487673 0.9123932 0.8916667 0.03720240 0.06984316
## 6 0.9493740 0.9153846 0.8914530 0.03684804 0.06518537
## 8 0.9522929 0.9230769 0.8863248 0.03547748 0.06718917
## 10 0.9491515 0.9205128 0.8779915 0.03721082 0.06241365
## 13 0.9482406 0.9228632 0.8858974 0.03904680 0.06353603
## 15 0.9480290 0.9200855 0.8777778 0.04103770 0.06560428
## 17 0.9452683 0.9175214 0.8696581 0.04272970 0.06697231
## 19 0.9460970 0.9096154 0.8666667 0.03911553 0.07363573
## 21 0.9434706 0.9147436 0.8777778 0.04341363 0.06871451
## 24 0.9422317 0.9094017 0.8696581 0.04633356 0.07456094
## 26 0.9413578 0.9040598 0.8613248 0.04749999 0.07334910
## 28 0.9392101 0.9036325 0.8566239 0.04914945 0.08040815
## 30 0.9384937 0.9091880 0.8670940 0.05028611 0.07173966
## 33 0.9405462 0.9038462 0.8617521 0.04888764 0.08537425
## Spec SD
## 0.09733191
## 0.08233865
## 0.09953435
## 0.10011631
## 0.09659468
## 0.10370989
## 0.10852397
## 0.11432777
## 0.10666004
## 0.09759360
## 0.09509934
## 0.11179812
## 0.10145962
## 0.10779033
## 0.10077579
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
plot(owm.rf) # plot the model
x<-varImp(owm.rf) # Ccheck Variable importance
plot(x, top = 20) # plot Variable importance
The model #8 is the best model at ROC of 0.95229 with Sens of 0.9230 and Spec of 0.8863.
And the metabolites ** myo.Inositol, L.Tryptophan, Phospho-D-Glycerate, Aminobutanoate, LArginine,** seems to contribute to the more than 60 % of model performance.
Now validate the model with test data set.
owm.rf.predicted <- predict (owm.rf, owm.data.test)
confusionMatrix(owm.data.test$Growth, owm.rf.predicted)
## Confusion Matrix and Statistics
##
## Reference
## Prediction conventional organic
## conventional 29 2
## organic 1 30
##
## Accuracy : 0.9516
## 95% CI : (0.865, 0.9899)
## No Information Rate : 0.5161
## P-Value [Acc > NIR] : 5.105e-14
##
## Kappa : 0.9032
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9667
## Specificity : 0.9375
## Pos Pred Value : 0.9355
## Neg Pred Value : 0.9677
## Prevalence : 0.4839
## Detection Rate : 0.4677
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9521
##
## 'Positive' Class : conventional
##
Cool. Note the Model accuracy 0.9516 and p- value of 5.105e-14. Check the ROC curve.
#caret's ROC curve procedure
library(pROC)
roc0 <- roc(owm.data.test$Growth, predict (owm.rf, owm.data.test, type = "prob")[,1], levels = rev(levels(owm.data.test$Growth)))
plot(roc0, print.thres = c(0.5), type = "S", print.thres.pattern = "%.3f(Spec = %.2f, Sens = %.2f)", print.thres.cex = .8, legacy.axes =T, col = '#0080ff', grid = TRUE)
##
## Call:
## roc.default(response = owm.data.test$Growth, predictor = predict(owm.rf, owm.data.test, type = "prob")[, 1], levels = rev(levels(owm.data.test$Growth)))
##
## Data: predict(owm.rf, owm.data.test, type = "prob")[, 1] in 31 controls (owm.data.test$Growth organic) < 31 cases (owm.data.test$Growth conventional).
## Area under the curve: 0.9698
Nice to see the model perfomance with 0.9516 Accuracy over 0.9667 Sensitivity and 0.9375 Sensitivity.This looks randomForest could be used to refine model and our effrots of feature selection have paid off with slight increase in model accuracy to 0.9516 (as compared to 0.9032 from paper-across years).
So the top three Meabolites (80% VarImp) that could distinguish Farming system in the Varieties tested in this study are :
Myo.Inositol L.Tryptophan Phospho.D.Glycerate
melted <- melt(owm.data)
head(melted)
## Growth Variety Year variable value
## 1 organic Antonius y2007 X2.Methylcitrate 0.01343727
## 2 organic Antonius y2007 X2.Methylcitrate 0.01374540
## 3 organic Antonius y2007 X2.Methylcitrate 0.01023164
## 4 organic Antonius y2007 X2.Methylcitrate 0.01012037
## 5 organic Antonius y2007 X2.Methylcitrate 0.01206357
## 6 organic Antonius y2007 X2.Methylcitrate 0.01116994
small.melted <- subset(melted, melted$variable == 'myo.Inositol' | melted$variable == 'L.Tryptophan')
qplot(small.melted$variable, y = small.melted$value, data = small.melted, color = Growth, facets = ~Variety, geom = "boxplot")
Look at the box plots for individual metabolite summary statistic except for few cultivars these two metabolites can different between Organic and conventional farming systems. Hence I would recommend use of Myo-Inositol and L.Tryptophan as bio markers. Well these are also reported in research article. Only improvement is model accuracy.
There are still some more parameters that could be fine tuned. I will post them at a later date if time permits.
But one information I feel that could significantly enhance the Biomarker discovery could be the location information.
Disclaimer: Observations made here are purely based on my analysis and understanding of data. Any applications derived based on this should be carefully validated experimentally.
Thanks for reading. Your suggestions and feedback are more than welcome.
No comments:
Post a Comment