Data Mining, Analytics and Machine Learning Experiments: Tutorial on univariant time series analysis using R for monthly rainfall predictions.

Introduction

When it is monsoon time, news papers are flooded with predictions about average rain fall during upcoming monsoon. I was wondering how it is done. Through a little bit of net surfing I learned about Time Series Analysis (TSA) . Prime reason we use TSA is to perform time series forecasting. It is primarily based on assumption that, we could predict/forecast future values based on past observations. The two approaches used in TSA are 1) Time domain approach and 2) Frequency domain approach. In this post we will limit yourself to the concept of time domain approach especially with uni variant data ( means, we only have one set of observations. This approach, focuses on predicting future values of a time series as a function of previous observations.Let us explore how this can be applied to predict rainfall in a given month based on average monthly rain fall recorded between 1901-2013. Data required for predicting rain fall is obtained from Indian Data repository.

Acquire data and transform

Setup your working environment and read average monthly rainfall data which is stored as a CVS file. Note the row.names option. I set this as first column in my data frame. This is important because we need to convert “monrain” object into a time series object for all other steps.

library(forecast)
monrain <- read.csv(file = "monRain.csv", header = T, row.names = 1)
head(monrain, n = 5) # identify

##       JAN  FEB  MAR  APR  MAY   JUN   JUL   AUG   SEP   OCT  NOV  DEC
## 1901 34.7 38.6 17.8 38.9 50.6 113.2 241.4 271.6 124.7  52.4 38.7  8.2
## 1902  7.4  4.2 19.0 44.1 48.8 111.7 284.9 201.0 200.2  62.5 29.4 25.2
## 1903 16.7  8.0 31.1 17.1 59.5 120.3 293.2 274.0 198.1 119.5 40.3 18.0
## 1904 14.9  9.7 31.4 33.7 73.8 165.5 260.3 207.7 130.8  69.8 11.2 16.4
## 1905 24.7 20.3 41.8 33.8 55.8  93.7 253.0 201.7 178.1  54.9  9.6 10.1

tail(monrain, n = 5)

##       JAN  FEB  MAR  APR  MAY   JUN   JUL   AUG   SEP   OCT  NOV  DEC
## 2009 12.0 12.0 14.2 25.1 56.0  85.7 280.7 192.5 139.4  71.4 53.7 11.1
## 2010  7.5 17.0 14.0 39.0 73.8 138.1 300.7 274.7 197.7  69.0 61.4 22.7
## 2011  6.8 25.8 22.4 41.1 53.1 183.5 246.0 284.9 186.9  38.1 20.1  7.6
## 2012 26.5 12.7 11.3 47.5 31.7 117.8 250.2 262.4 193.5  58.7 30.7 11.7
## 2013 11.3 40.1 15.7 30.4 57.8 219.8 310.0 254.7 152.7 129.4 14.0  6.7

While there are several ways in which you could convert a regular data frame into a time series object. I am using the transpose option. To do so, you need to know when you started first observation and recorded last observation along with the frequency of observations. In my data the first observation is taken on 1901 Jan and last observation is taken on 2013 Dec.at monthly frequency. With this information I will convert the data frame into a time series Object using ts() function from R base package.Examine the ts object with summary and other functions.

tsmonrain <- ts(as.vector(t(as.matrix(monrain))), start = c(1901, 1), end = c(2013, 12), frequency = 12)
str(tsmonrain) # to understand the structure of the tsmonrain data object

##  Time-Series [1:1356] from 1901 to 2014: 34.7 38.6 17.8 38.9 50.6 ...

summary(tsmonrain) # to obtain the summary of observations

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60   22.80   49.95   98.23  169.80  375.50

start(tsmonrain) # to explore the start of the observation time

## [1] 1901    1

end(tsmonrain) # to explore the end of observations

## [1] 2013   12

frequency(tsmonrain) # to explore the frequency of observations

## [1] 12

Explore data for trend and seasonality

Like any other data analysis, first and foremost thing to perform in any TSA analysis is to explore the data. However, in TSA you will plot your data with specific objective. Primary objective of this is to identify components such as trend, seasonality, and irregularities if any in our data. Why is it so important to find them? Well, if your data has any of those components it is said to be non-stationary and data may not be suitable for forecasting.IN statistical terms you need to fix the variance and mean in our data series.

Let us plot our data with time on x axis and observations(average monthly rainfall) on Y axis and also add a trend line.

plot(tsmonrain, ylab= "Average rain fall (mm/Month)", main = "Monthly average rain fall from 1901-2013", type = "l" )
lines(lowess(tsmonrain), col = "blue")

plot of chunk unnamed-chunk-4

From the above plot, it appears that our data “does not” have any trend (neither upward nor downward) and stationary. That is good.
But from historic knowledge we know that monsoon rain in India is a seasonal phenomenon. There has to be fluctuations in periodic manner within any given year. Let us plot a small window of the time series data and see if we could pick it. We will use the window() function to subset the data for chosen period.

plot(window(tsmonrain, start= 2000, end = 2002, frequency = 12), type = "b", col = "brown",ylab= "Average rain fall (mm/Month)", main = "Monthly average rain fall from 2000-2002")

plot of chunk unnamed-chunk-5
Yes indeed we could see the fluctuation in the monthly ran fall. Let us explore this phenomenon a bit more. How does the Average rainfall in a given month across years is distributed. We will use the box plots.

boxplot(tsmonrain~cycle(tsmonrain), notch = T,col = "purple", xlab = "Calender month", ylab = "average rain fall (mm)", main = " Monthly distribution of rain between 1901-2013 ")

plot of chunk unnamed-chunk-6
From the above plot it is evident that Over past 100 years or so we have received more rain during the second and third quarters or 6, 7, 8 and 9th Months with few outliers here and there. This tells us that there is a seasonal component to our data. So our data is not a true stationary data.

Decompose data to make it stationary

To move ahead with TSA, we need to remove the seasonality component make our data stationary. This process is often referred to as “Decomposition”. There are several ways to decompose your time series data depending on cause of non-stationarity.

Let us decompose data to eliminate the trend, if any, using diff() function and averaging the monthly data. I am only differentiating the data by 4 and 8 months because the plot will not change even after decomposition with Single month or 12 months WRT seasonality.

par(mfrow = c(2,1))
plot(diff(tsmonrain, 4), ylab = "differenced on 4 month mean", main = "Decomposition on 4month Mean") # four month mean
plot(diff(tsmonrain,8), ylab = "differenced on 8 month mean", main = "Decomposition on 8 month Mean") # 8 month mean

plot of chunk unnamed-chunk-7
From the above plots it is evident that irrespective of differentiating it is hard to find a trend. so WRT to trend the data is stationary.

But we are pretty sure that, our data has the seasonality (From Box plots). So we must decompose our data for seasonality. Let us decompose data to eliminate the variance (seasonality) using the log transformation

plot(log10(tsmonrain))

plot of chunk unnamed-chunk-8

We can combine diff and log to eliminate both variance and mean in our time series data.

plot(log10(diff(tsmonrain)))

## Warning in plot(log10(diff(tsmonrain))): NaNs produced

plot of chunk unnamed-chunk-9

Alternatively we could use the Seasonal Trend Decomposition using Loess (stl()) available in base package or decompose().

plot(stl(tsmonrain, s.window = "periodic"), main = "Decomposition of additive time series")

plot of chunk unnamed-chunk-10
Have a look at this plot. On top you have the original data plotted. plot in panel labelled as seasonal we have the data series plotted after removing the seasonal component. Below that is the data after removal of trend component. The final plot is the one which is called as white noise or residuals.

Model building

Now we could go ahead to model building. The type of models could be AR (Auto regression), MA (Moving average), ARMA, ARIMA seasonal ARIMA etc.. To build models you need to provide, p,d,q for the AR and MA separately for a given time point.To obtain this information we will use the Auto Correlation (ACF()) and Partial Auto-Correlation (pacf()) functions. Partial auto correlation function will provide the lags (p) in AR(p) models and p, d,q values for the ARIMA models. Please refer to the above links for details explanation of the same.

par(mfrow = c(1,2))
acf(log10(tsmonrain), main = "auto correlaiton plot")
pacf(log10(tsmonrain), main = "partial auto correlation plot")

plot of chunk unnamed-chunk-11
IN above plots you could figure out the lag values to be used in model building if you are using AR(p)/MA or ARMA/ARIMA. Thanks to R community. Here i would like to use the “auto.arima()” function available in the forecast package. One good reason why I picked the package is i don't need to figure these options. The function call it self will try several options and provide me a best model based on the Akaike information criterion (AIC) . so i will pass the log10 transformed data to the auto.arima() function.

# ARIMA models _ Box-Jenkins approach. 1) Model identification, 2) Parameter estimation 3) Diagnostic checking
arimaFit <- auto.arima(log10(tsmonrain), approximation = F, trace = F)
summary(arimaFit)

## Series: log10(tsmonrain) 
## ARIMA(2,0,0)(1,0,0)[12] with non-zero mean 
## 
## Coefficients:
##          ar1      ar2    sar1  intercept
##       0.1301  -0.0601  0.8590     1.7436
## s.e.  0.0304   0.0280  0.0157     0.0464
## 
## sigma^2 estimated as 0.05575:  log likelihood=27.13
## AIC=-44.25   AICc=-44.21   BIC=-18.19
## 
## Training set error measures:
##                         ME      RMSE       MAE       MPE     MAPE
## Training set -0.0001024415 0.2357753 0.1715335 -3.755715 13.42712
##                   MASE        ACF1
## Training set 0.9892733 0.001706884

The seasonal arima (1) [sar1] model looking good, we could proceed to predictions. Explore the training set error messages.

Predict for future

pred <- predict(arimaFit, n.ahead = 48)
pred$pred

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2014 1.1292263 1.6404653 1.2766853 1.5190472 1.7590512 2.2576379 2.3859335
## 2015 1.2158589 1.6550068 1.3425244 1.5507102 1.7568707 2.1851505 2.2953547
## 2016 1.2902753 1.6674979 1.3990794 1.5779084 1.7549976 2.1228847 2.2175488
## 2017 1.3541980 1.6782275 1.4476593 1.6012712 1.7533887 2.0693991 2.1507144
##            Aug       Sep       Oct       Nov       Dec
## 2014 2.3126189 2.1217574 2.0599926 1.2303773 0.9554556
## 2015 2.2323784 2.0684308 2.0153756 1.3027464 1.0665920
## 2016 2.1634529 2.0226239 1.9770502 1.3649105 1.1620569
## 2017 2.1042467 1.9832764 1.9441291 1.4183087 1.2440600

plot(tsmonrain, type = "l", xlim=c(2012,2017), xlab = "year", ylab = "rain")
lines(10^(pred$pred), col= "blue")
lines(10^(pred$pred + 2*pred$se),col= "red", type = "c", lty=2)
lines(10^(pred$pred-2*pred$se),col= "green", type = "c", lty=2)

plot of chunk unnamed-chunk-13

Now let us take a closer look at the 2016 monthly forecasts

#Try some beutification
current <- 10^( window(pred$pred, start = 2016.000, end = 2016.917, frequency = 12))
current

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2016  19.51081  46.50481  25.06567  37.83627  56.88498 132.70420 165.02463
##            Aug       Sep       Oct       Nov       Dec
## 2016 145.69777 105.34743  94.85280  23.16917  14.52302

plot(10^(window(pred$pred, start = 2016.000, end = 2016.917, frequency = 12)), type = "b", col = "blue", xlab = "Year.Calender Month", ylab = "Rainfall in mm", main = "Monthly predicted rainfall for 2016")
lines(10^(window(pred$pred + pred$se, start = 2016.000, end = 2016.917, frequency = 12)),col= "brown", type = "c", lty=2)
lines(10^(window(pred$pred - pred$se, start = 2016.000, end = 2016.917, frequency = 12)),col= "red", type = "c", lty=2)

plot of chunk unnamed-chunk-14

Data Mining, Analytics and Machine Learning Experiments

Tuesday, 14 June 2016

Tutorial on univariant time series analysis using R for monthly rainfall predictions.