Data Mining, Analytics and Machine Learning Experiments

1. Data source

This is an attempt to model the change points in a time series data of cotton production and yields across three countries.

1. Data source

Data is sourced from the PDS and filtered for required columns (Market_Year, Country, Country_Name, Yield and harvested area)of data for required rows (where country_Names are labelled as China, India, United States). Set R working environment pointing to the folder where you have stored data.

Load required r packages and raw data.

library(reshape2)
library(ggfortify)
library(gridExtra)
library(tfplot)
library(changepoint)
library(forecast)
# read raw data
rawYield <- read.csv("threeCountryYield.csv", header = T)
rawArea <- read.csv("threeCountryHarvested.csv", header = T)

2. Are there signifiant differences.

Now we have data as two different data frames viz. rawYield and rawArea. Let us quickly check if there are any overall differences between countries with respect to harvested area and the yield. For this we could use the box plots.

par(mfrow = c(1,2))
boxplot(rawArea$Value.1000.HA.~rawArea$Country_Name, notch = F, col = "brown", main = "Harvested area (Hectares X 1000)")
boxplot(rawYield$Value.KG.HA. ~rawYield$Country_Name, notch = F, col = "green", main = "Yield (kg / Hectare)")

plot of chunk unnamed-chunk-3

On an average Indian farmers are cultivating 8,512,386 hectares with an average yield of 282.456 kg per hectare where as 691.59, 810.98 kg/hectare from 4,673,965 and 5,028,193 hectares in USA and China respectively. These numbers indicate the average values across countries. It is worth checking the yearly trends. We could use time series plots to observe the same. To do so first we must convert our data frames into time series objects.However, there is a little pre-processing step that needs to be done and we will do it with the help of dcast() function of reshape2 package.

# make data ts object compatible
area <- dcast(rawArea, Market_Year ~ Country_Name)
yield <- dcast(rawYield, Market_Year ~ Country_Name)
head(yield, n = 5)

##   Market_Year China India United States
## 1        1960   204   134           500
## 2        1961   208   112           492
## 3        1962   212   141           512
## 4        1963   272   134           579
## 5        1964   335   123           580

##3. Time series analysis plots Now the data is in wide format with proper date (Year indexes). using the ts() function now prepare time series objects.

#prepare time series objects
tsareaData <- ts(area, start = 1960, end = 2016, frequency = 1)
tsyieldData <- ts(yield, start = 1960, end = 2016, frequency = 1)

Cool we have our time series objects ready for analysis. We can begin our analysis with the plotting. Here I want to use the autoplot() function from ggfortify. ggfortify offers a single plotting interface and plots in a unified style using 'ggplot2'.

p1 <- autoplot(tsareaData[,2:4],ts.geom='line', stacked = F, facets = F, xlab = "year", ylab = "Harvested Area (X1000 Hectares)", main = "Changes in harvested area between 1960-2016")
p2 <- autoplot(tsyieldData[,2:4],ts.geom='line', stacked = F, facets = F, xlab = "year", ylab = "Yield (Kg/Ha)", main = "Changes in yield between 1960-2016")
grid.arrange(p1,p2, ncol = 2)

plot of chunk unnamed-chunk-5

One thing surprising is in both china and USA the harvested area seems to be going down where as in India harvested area is going up. This could be the sign of more and more framers are turning towards cotton cultivation. So let us examine changes in cultivated area

# percent change
p3<-autoplot(percentChange(tsyieldData[,2:4], lag = 30), ylab = "Percetn change in Yield", main = "Percentage change in yield over 30 year period")
p4<-autoplot(percentChange(tsareaData[,2:4], lag = 30), ylab = "Percent change in Harvested Area", main = "Percentage change in harvested area over 30 year period")
grid.arrange(p3,p4, ncol = 2)

plot of chunk unnamed-chunk-6

As it is clear in both China and USA though yield gains are not as growing much, the area under cultivation is going down as much as 25 percent.

4. Change Point Analysis

Let us look at the change points. Change point analysis typically used to find the changes in “mean”, “variance” or “both” in a time series data. It is implemented in changepoint. Refer to article : PDF for introduction to Change point analysis methods.

Let us first look at the changes in mean of yields. I am using the “Binary Segmentation” method. other methods such as PELT and various penalties could be explored as needed.

p5<-autoplot(cpt.mean(tsyieldData[,3], method = "BinSeg", Q = 4), xlab = "Year", ylab = "Yield (Kg/ha)", main = "Change points in India's Yiled (1960-2016)")
p6<-autoplot(cpt.mean(tsyieldData[,4], method = "BinSeg", Q = 4), xlab = "Year", ylab = "Yield (Kg/ha)", main = "Change points in USA's Yiled (1960-2016)" )
p7<-autoplot(cpt.mean(tsyieldData[,2], method = "BinSeg", Q = 4), xlab = "Year", ylab = "Yield (Kg/ha)", main = "Change points in Chaina's Yiled (1960-2016)")

p8<-autoplot(cpt.mean(tsareaData[,3], method = "BinSeg", Q = 4), xlab = "Year", ylab = "(X1000) Hectares", main = "Change points in India's cultivated area(1960-2016)")
p9<-autoplot(cpt.mean(tsareaData[,4], method = "BinSeg", Q = 4), xlab = "Year", ylab = "(X1000) Hectares", main = "Change points in USA's cultivated area(1960-2016)" )
p10<-autoplot(cpt.mean(tsareaData[,2], method = "BinSeg", Q = 4), xlab = "Year", ylab = "(X1000) Hectares", main = "Change points in Chaina's cultivated area(1960-2016)")

grid.arrange(p5,p8,p6, p9, p7, p10, ncol = 2, nrow = 3)

plot of chunk unnamed-chunk-7

The change points observed in India could possibly due to the insect protection seed technologies released during 2002 and 2006.

5. Yield potential or predictions/forecasts

Let us see the yield potential of India by using time series forecasts. I have extracted the yield values for India from yield table prepared a time series object with a frequency of 1

india <- yield$India # extracting yield values
tsIndia <- ts(india, start = 1960, end = 2016, frequency = 1)
autoplot(tsIndia)

plot of chunk unnamed-chunk-8

From the above time series plot it is evident that the Indian yield are displaying constant up ward trend with out any seasonality. So i will use the “holts” exponential smoothing on the data to make predictions.

hwfit <- HoltWinters(tsIndia, gamma = FALSE)
hwfit

## Holt-Winters exponential smoothing with trend and without seasonal component.
## 
## Call:
## HoltWinters(x = tsIndia, gamma = FALSE)
## 
## Smoothing parameters:
##  alpha: 0.8798596
##  beta : 0.1233282
##  gamma: FALSE
## 
## Coefficients:
##         [,1]
## a 514.497799
## b   4.664536

autoplot(hwfit$fitted)

plot of chunk unnamed-chunk-9

Out model looks it is able to predict the values closely to the actual observations. Now let us use the forecast package to predict yield potential of the cotton farmers in India for the next 10 years.

hwForecast <- forecast.HoltWinters(hwfit, h = 10)
hwForecast

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2017       519.1623 476.8412 561.4834 454.4378 583.8869
## 2018       523.8269 464.3228 583.3309 432.8232 614.8305
## 2019       528.4914 453.0218 603.9611 413.0705 643.9123
## 2020       533.1559 442.0623 624.2496 393.8402 672.4717
## 2021       537.8205 431.0962 644.5448 374.5998 701.0412
## 2022       542.4850 419.9538 665.0162 355.0897 729.8804
## 2023       547.1496 408.5432 685.7559 335.1695 759.1297
## 2024       551.8141 396.8118 706.8164 314.7586 788.8696
## 2025       556.4786 384.7287 728.2285 293.8098 819.1475
## 2026       561.1432 372.2758 750.0105 272.2954 849.9909

autoplot(hwForecast, ylab = "fitted")

plot of chunk unnamed-chunk-10

Even if the new area is not brought under cultivation and current technologies will work as expected, it seems out farmers should be able to harvest any where between 476.84 (lo of 80% confidence) and 583.88 (high of the 95 % confidence interval). But before making conclusions, let us check our predictions.

6. Model validation

Now that we got the predictions, let us do validation of our forecast.This can be done by examining the autocorrelaiton plots, Box-Ljung test of residuals or simply by observing the distribution of the residuals.

par(mfrow = c(2,2), mai = c(.5, .8, .5, .5))
plot(hwForecast, ylab = "Yield (kg/Hectare)", main = "Forecasted yields")
acf(hwForecast$residuals, main = "Residual Autocorrelation plot")
plot(hwForecast$residuals, ylab = "Residuals", main = "Residual varience")
plot(density(hwForecast$residuals), main = "Density")

plot of chunk unnamed-chunk-11

All the model evaluations are looking good. there are no auto correlations (ACF plot) and residuals seems to follow normal distributions (density plot).

I really hope, we will be able to bring the technologies other than the existing insect control and data driven agronomic recommendations available to the Indian farmers and they will adopt them in order to reach the deserved yield potential.

Thanks for reading.

Data Mining, Analytics and Machine Learning Experiments

Monday, 22 August 2016

Price gamble at Cotton seed trading markets : Interactive visualizations with plotly-R

Thursday, 14 July 2016

R Tutorial on Changepoint detection : Impact of Bt technology adoption - Sowing more - Harvesting Less

1. Data source

2. Are there signifiant differences.

4. Change Point Analysis

5. Yield potential or predictions/forecasts

6. Model validation

Total Pageviews