Monday, 6 April 2015

Part 4a: Modelling - predicting the amount of rain

In the fourth and last part of this series, we will build several predictive models and evaluate their accuracies. In Part 4a, our dependent value will be continuous, and we will be predicting the daily amount of rain. Then, in Part 4b, we will deal with the case of a binary outcome, which means we will assign probabilities to the occurrence of rain on a given day. 

In both the continuous and binary cases, we will try to fit the following models:
  • Baseline model - usually, this means we assume there are no predictors (i.e., independent variables). Thus, we have to make an educated guess (not a random one), based on the value of the dependent value alone. This model is important because it will allow us to determine how good, or how bad, are the other ones. For example, imagine a fancy model with 97% of accuracy - is it necessarily good and worth implementing? No, it depends; if the baseline accuracy is 60%, it's probably a good model, but if the baseline is 96.7% it doesn't seem to add much to what we already know, and therefore its implementation will depend on how much we value this 0.3% edge. It turns out that, in real life, there are many instances where the models, no matter how simple or complex, barely beat the baseline. A simple example: try to predict whether some index of the stock market is going up or down tomorrow, based on the movements of the last N days; you may even add other variables, representing the volatility index, commodities, and so on. Even if you build a neural network with lots of neurons, I'm not expecting you to do much better than simply consider that the direction of tomorrow's movement will be the same as today's (in fact, the accuracy of your model can even be worse, due to overfitting!). Predicting stock market movements is a really tough problem;
  • A model from inferential statistics - this will be a (generalised) linear model. In the case of a continuous outcome (Part 4a), we will fit a multiple linear regression; for the binary outcome (Part 4b), the model will be a multiple logistic regression;
  • Two models from machine learning - we will first build a decision tree (regression tree for the continuous outcome, and classification tree for the binary case); these models usually offer high interpretability and decent accuracy; then, we will build random forests, a very popular method, where there is often a gain in accuracy, at the expense of interpretability.

Let's now start working on the models for the continuous outcome (i.e., the amount of rain).


Train and test sets


Here is one of the golden rules of machine learning and modelling in general: models are built using training data, and evaluated on testing data. The reason is overfitting: most models' accuracy can be artificially increased to a point where they "learn" every single detail of the data used to build them; unfortunately, it usually means they lose the capability to generalise. That's why we need unseen data (i.e., the testing set): if we overfit the training data, the performance on the testing data will be poor. In real life, simple models often beat complex ones, because they can generalise much better. We will do a random 70:30 split in our data set (70% will be for training models, 30% to evaluate them). For reproducibility, we will need to set the seed of the random number generator (it means every time I run the code, I'll get the same train and test sets. In case you're following along with the tutorial, you'll get the same sets, too). Here's the code:

> # For reproducibility; 123 has no particular meaning
> set.seed(123
> # randomly pick 70% of the number of observations (365)
> index <- sample(1:nrow(weather),size = 0.7*nrow(weather)) 
> # subset weather to include only the elements in the index
> train <- weather[index,] 
> # subset weather to include all but the elements in the index
> test <- weather [-index,] 
> nrow(train)
[1] 255 
> nrow(test)
[1] 110

Here is a plot showing which points belong to which set (train or test).


# Create a dataframe with train and test indicator...
group <- rep(NA,365)
group <- ifelse(seq(1,365) %in% index,"Train","Test")
df <- data.frame(date=weather$date,rain=weather$rain,group)

# ...and plot it
ggplot(df,aes(x = date,y = rain, color = group)) + geom_point() +
  scale_color_discrete(name="") + theme(legend.position="top")


Evaluation metrics


For the continuous outcome, the main error metric we will use to evaluate our models is the RMSE (root mean squared error). This error measure gives more weight to larger residuals than smaller ones (a residual is the difference between the predicted and the observed value). What this means is that we consider that missing the prediction for the amount of rain by 20 mm, on a given day, is not only twice as bad as missing by 10 mm, but worse than that.

We will use the MAE (mean absolute error) as a secondary error metric. It gives equal weight to the residuals, which means 20 mm is actually twice as bad as 10 mm. One of the advantages of this error measure is that it is easy to interpret: it tells us, on average, the magnitude of the error we get by using the model when compared to the actual observed values.

Both metrics are valid, although the RMSE appears to be more popular, possibly because it amplifies the differences between models' performances in situations where the MAE could lead us to believe they were about equal. If you want to know more about the comparison between the RMSE and the MAE, here is an interesting article.

Let's now build and evaluate some models.


Baseline model

In the absence of any predictor, all we have is the dependent variable (rain amount). What would be our best guess if we had to predict the amount of rain, on a given day, in the test set? I would say that the mean of the rain values, in the training data, is the best value we can come up with. Recall we can only use the training data to build models; the testing data is only there to evaluate them, after making the predictions.

> # Baseline model - predict the mean of the training data
> best.guess <- mean(train$rain) 
> # Evaluate RMSE and MAE on the testing data
> RMSE.baseline <- sqrt(mean((best.guess-test$rain)^2))
> RMSE.baseline
[1] 13.38996 
> MAE.baseline <- mean(abs(best.guess-test$rain))
> MAE.baseline
[1] 8.219123


Multiple linear regression

We will now fit a (multiple) linear regression, which is probably the best known statistical model.  In simple terms, the dependent variable is assumed to be a linear function of several independent variables (predictors), where each of them has a weight (regression coefficient) that is expected to be statistically significant in the final model. The results are usually highly interpretable and, provided some conditions are met, have good accuracy. Before showing the results, here are some important notes:
  • Linear models do not require variables to have a Gaussian distribution (only the errors / residuals must be normally distributed); they do require, however, a linear relation between the dependent and independent variables. As we saw in Part 3b, the distribution of the amount of rain is right-skewed, and the relation with some other variables is highly non-linear. For this reason of linearity, and also to help fixing the problem with residuals having non-constant variance across the range of predictions (called heteroscedasticity), we will do the usual log transformation to the dependent variable. Since we have zeros (days without rain), we can't do a simple ln(x) transformation, but we can do ln(x+1), where x is the rain amount. Why do we choose to apply a logarithmic function? Simply because the regression coefficients can still be interpreted, although in a different way when compared with a pure linear regression. This model we will fit is often called log-linear;
  • What I'm showing below is the final model. I started with all the variables as potential predictors and then eliminated from the model, one by one, those that were not statistically significant (p < 0.05). We need to do it one by one because of multicollinearity (i.e., correlation between independent variables). Some of the variables in our data are highly correlated (for instance, the minimum, average, and maximum temperature on a given day), which means that sometimes when we eliminate a non-significant variable from the model, another one that was previously non-significant becomes statistically significant. This iterative process of backward elimination stops when all the variables in the model are significant (in the case of factors, here we consider that at least one level must be significant);
  • Our dependent variable has lots of zeros and can only take positive values; if you're an expert statistician, perhaps you would like to fit very specific models that can deal better with count data, such as negative binomial, zero-inflated and hurdle models. There are several packages to do it in R. For simplicity, we'll stay with the linear regression model in this tutorial.

> # Create a multiple (log)linear regression model using the training data
> lin.reg <- lm(log(rain+1) ~ season +  h.temp + ave.temp + ave.wind + gust.wind +
              dir.wind + as.numeric(gust.wind.hour), data = train)
> # Inspect the model
> summary(lin.reg)

lm(formula = log(rain + 1) ~ season + h.temp + ave.temp + ave.wind + 
    gust.wind + dir.wind + as.numeric(gust.wind.hour), data = train)

     Min       1Q   Median       3Q      Max 
-1.93482 -0.48296 -0.05619  0.39849  2.21069 

                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 0.028020   0.445699   0.063 0.949926    
seasonSummer                0.370488   0.151962   2.438 0.015523 *  
seasonAutumn                0.696291   0.153209   4.545 8.87e-06 ***
seasonWinter                0.297473   0.159602   1.864 0.063612 .  
h.temp                     -0.197772   0.045925  -4.306 2.45e-05 ***
ave.temp                    0.165975   0.054907   3.023 0.002786 ** 
ave.wind                   -0.104736   0.041431  -2.528 0.012140 *  
gust.wind                   0.062181   0.008612   7.220 7.44e-12 ***
dir.windENE                 0.251267   0.340192   0.739 0.460898    
dir.windESE                 0.303855   0.796890   0.381 0.703330    
dir.windN                   0.405041   0.333188   1.216 0.225357    
dir.windNE                  0.358754   0.339188   1.058 0.291303    
dir.windNNE                 1.803748   0.508091   3.550 0.000467 ***
dir.windNNW                 0.390606   0.309850   1.261 0.208715    
dir.windNW                  0.346885   0.290911   1.192 0.234324    
dir.windS                   1.060690   0.334293   3.173 0.001714 ** 
dir.windSE                  1.210005   0.326271   3.709 0.000261 ***
dir.windSSE                 1.488857   0.331211   4.495 1.10e-05 ***
dir.windSSW                 1.292246   0.349473   3.698 0.000272 ***
dir.windSW                  0.852924   0.384445   2.219 0.027488 *  
dir.windW                   0.680786   0.467912   1.455 0.147042    
dir.windWNW                 0.627473   0.342009   1.835 0.067841 .  
dir.windWSW                 1.279782   0.799807   1.600 0.110940    
as.numeric(gust.wind.hour) -0.026972   0.011341  -2.378 0.018204 *  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.741 on 231 degrees of freedom
Multiple R-squared:  0.6637, Adjusted R-squared:  0.6302 
F-statistic: 19.82 on 23 and 231 DF,  p-value: < 2.2e-16 

> # What is the multiplicative effect of the wind variable?
> exp(lin.reg$coefficients["gust.wind"])

> # Apply the model to the testing data (i.e., make predictions) ...
> # (Don't forget to exponentiate the results to revert the log transformation)
> test.pred.lin <- exp(predict(lin.reg,test))-1
> # ...and evaluate the accuracy
> RMSE.lin.reg <- sqrt(mean((test.pred.lin-test$rain)^2))
> RMSE.lin.reg
[1] 10.26189
> MAE.lin.reg <- mean(abs(test.pred.lin-test$rain))
> MAE.lin.reg
[1] 4.713084

Here are the main conclusions about the model we have just built:
  • The R-squared is 0.66, which means that 66% of the variance in our dependent variable can be explained by the set of predictors in the model; at the same time, the adjusted R-squared is not far from that number, meaning that the original R-squared has not been artificially increased by adding variables to the model. Note that the R-squared can only increase or stay the same by adding variables, whereas the adjusted R-squared can even decrease if the variable added doesn't help the model more than what is expected by chance;
  • All the variables are statistically significant (p < 0.05), as expected from the way the model was built, and the most significant predictor is the wind gust (p = 7.44e-12). The advantage of doing a log transformation is that, if the regression coefficient is small (i.e. -0.1 to 0.1), a unit increase in the independent variable yields an increase of approximately coeff*100% in the dependent variable. To be clear, the coefficient of the wind gust is 0.062181. It means that a unit increase in the gust wind (i.e., increasing the wind by 1 km/h), increases the predicted amount of rain by approximately 6.22%. You can always exponentiate to get the exact value (as I did), and the result is 6.42%. By the same token, for each degree (ÂșC) the daily high temperature increases, the predicted rain increases by exp(-0.197772) = 0.82 (i.e., it decreases by 18%);
  • Both the RMSE and MAE have decreased significantly when compared with the baseline model, which means that this linear model, despite all the linearity issues and the fact that it predicts negative values of rain in some days, is still much better, overall, than our best guess. 
We will see later, when we compare the fitted vs actual values for all models, that this model has an interesting characteristic: it predicts reasonably well daily rain amounts between 0 and 25 mm, but the predicting capability degrades significantly in the 25 to 70mm range.


Decision trees


A decision tree (also known as regression tree for continuous outcome variables) is a simple and popular machine learning algorithm, with a few interesting advantages over linear models: they make no assumptions about the relation between the outcome and predictors (i.e., they allow for linear and non-linear relations); the interpretability of a decision tree could not be higher - at the end of the process, a set of rules, in natural language, relating the outcome to the explanatory variables, can be easily derived from the tree. Furthermore, a decision tree is the basis of a very powerful method that we will also use in this tutorial, called random forest.

How do we grow trees, then? In very simple terms, we start with a root node, which contains all the training data, and split it into two new nodes based on the most important variable (i.e., the variable that better separates the outcome into two groups). For each new node, a split is attempted (always based on the variable that better splits that particular node, which is not necessarily the same variable we started with), until any of the stop crtiteria is met (for instance, in the R implementation, one of the defaults is that the node must have at least 20 observations, otherwise a new split is not attempted). The disadvantage of a decision tree is that they tend to overfit; before a stop criterion is reached, the algorithm keeps splitting nodes over and over, learning all the details of the training data. Big trees are often built that perform very well in the training data, but can't generalise well and therefore do very poorly on the test set.

What to do, then? This is where the extremely important concept of cross-validation comes into play (just as understanding the difference between the training and testing data, it is also of paramount importance to understand what cross-validation is).  When we grow a tree in R (or any other software, for that matter), it internally performs a 10-fold cross-validation. What this means is that 10 cycles are run, and in each of them 90% of the training data is used for building the model, and the remaining 10%, even though they're part of the training data, are only used to predict and evaluate the results. In another words, these 10% are momentarily a test set. It should be obvious that, after the 10 cycles, 10 different chunks of data are used to test the model, which means every single observation is used, at some point, not only to train but also to test the model. Since we are testing at the same time we're growing a tree, we have a error measurement, that we use to find the optimal number of splits. We then prune the original tree, and keep only the optimal number of splits. In sum, when growing a tree, this is what we do:
  • Grow a full tree, usually with the default settings;
  • Examine the cross-validation error (x-error), and find the optimal number of splits. This corresponds, in R, to a value of cp (complexity parameter);
  • Prune the tree using the complexity parameter above.
What usually happens is that either the pruned tree performs better on the test set or it performs about the same as the full-grown tree. Even in the latter case, it is useful to prune the tree, because less splits means less decision rules and higher interpretability, for the same level of performance. Now, let's do all of this in practice.

> # Needed to grow a tree
> library(rpart)
> # To draw a pretty tree (fancyRpartPlot function)
> library(rattle)
> # rpart function applied to a numeric variable => regression tree
> rt <- rpart(rain ~ month + season + l.temp + h.temp + ave.temp + ave.wind +
              gust.wind + dir.wind + dir.wind.8 + as.numeric(h.temp.hour)+
              as.numeric(l.temp.hour)+ as.numeric(gust.wind.hour), data=train)
> # Full-grown tree with 8 splits using 6 different variables 
> # (Not running the line below - do it to see the tree)
> # fancyRpartPlot(rt)
> # As always, predict and evaluate on the test set
> test.pred.rtree <- predict(rt,test) 
> RMSE.rtree <- sqrt(mean((test.pred.rtree-test$rain)^2))
> RMSE.rtree
[1] 11.37786
> MAE.rtree <- mean(abs(test.pred.rtree-test$rain))
> MAE.rtree
[1] 6.140937

Now that we have a full-grown tree, let's see if it's possible to prune it...
> # Check cross-validation results (xerror column)
> # It corresponds to 2 splits and cp = 0.088147
> printcp(rt)

Regression tree:
rpart(formula = rain ~ month + season + l.temp + h.temp + ave.temp + 
      ave.wind + gust.wind + dir.wind + dir.wind.8 + as.numeric(h.temp.hour) + 
      as.numeric(l.temp.hour) + as.numeric(gust.wind.hour), data = train)

Variables actually used in tree construction:
[1] as.numeric(h.temp.hour) ave.temp                dir.wind               
[4] gust.wind               h.temp                  month                  

Root node error: 32619/255 = 127.92

n= 255 

        CP nsplit rel error  xerror    xstd
1 0.358496      0   1.00000 1.00815 0.22993
2 0.102497      1   0.64150 0.72131 0.19653
3 0.088147      2   0.53901 0.71921 0.19592
4 0.045689      3   0.45086 0.75017 0.19972
5 0.028308      4   0.40517 0.79740 0.20348
6 0.024935      5   0.37686 0.80947 0.19980
7 0.011454      6   0.35193 0.79036 0.19880
8 0.010600      7   0.34047 0.75514 0.17788
9 0.010000      8   0.32987 0.75514 0.17788 
> # Get the optimal CP programmatically...
> min.xerror <- rt$cptable[which.min(rt$cptable[,"xerror"]),"CP"]
> min.xerror
[1] 0.08814737 
> # ...and use it to prune the tree
> rt.pruned <- prune(rt,cp = min.xerror) 
> # Plot the pruned tree
> fancyRpartPlot(rt.pruned)

> # Evaluate the new pruned tree on the test set
> test.pred.rtree.p <- predict(rt.pruned,test)
> RMSE.rtree.pruned <- sqrt(mean((test.pred.rtree.p-test$rain)^2))
> RMSE.rtree.pruned
[1] 11.51547
> MAE.rtree.pruned <- mean(abs(test.pred.rtree.p-test$rain))
> MAE.rtree.pruned
[1] 6.138062
As you can see, we were able to prune our tree, from the initial 8 splits on six variables, to only 2 splits on one variable (the maximum wind speed), gaining simplicity without losing performance (RMSE and MAE are about equivalent in both cases). In the final tree, only the wind gust speed is considered relevant to predict the amount of rain on a given day, and the generated rules are as follows (using natural language):
  • If the daily maximum wind speed exceeds 52 km/h (4% of the days), predict a very wet day (37 mm);
  • If the daily maximum wind is between 36 and 52 km/h (23% of the days), predict a wet day (10mm);
  • If the daily maximum wind stays below 36 km/h (73% of the days), predict a dry day (1.8 mm); 
The accuracy of this extremely simple model is only a bit worse than the much more complicated linear regression. Next, instead of growing only one tree, we will grow the whole forest, a method that is very powerful and, more often than not, yields in very good results.


Random forests


What if, instead of growing a single tree, we grow many (ranging from a few hundred to a few thousand), and introduce some sources of randomness, so that each tree is most likely different from the others? What we get is a random forest (Leo Breiman, 2001), a popular algorithm that pretty much every data scientist in the world knows. In fact, when it comes to problems where the focus is not so much in understanding the data, but in make predictions, random forests are often used immediately after preparing the data set, skipping entirely the EDA stage. This is basically how the algorithm works:
  • To grow each tree, a random sample of the training data, with the size of the training data itself, is taken, with replacement. This means that some observations might appear several times in the sample, and others are left out (the proportion of discarded observations converges to 36.8%, for very large sample sizes). In another words, each tree will be grown with its own version of the training data;
  • The variables that are used to grow each tree are also randomly sampled; the sample size is 1/3 and the square root of the number of independent variables, for regression and classification problems, respectively;
  • Each tree is then fully grown, without any pruning, using its own training set and variable set;
  • In classification problems, each tree casts a vote for the class it predicts and the final prediction is the class that has the majority of votes. In regression problems, the predicted value is a weighted average of the value predicted by each tree. 
Some of most interesting characteristics of random forests are:
  • They do not overfit. Even though each component of the forest (i.e. each tree) is probably overfitting a lot, since they are all different, the mistakes they make are in all directions; when the errors are averaged, they kind of cancel each other. Although each classifier is weak (recall the training set and variables are randomly sampled), when put together they become a strong classifier (this is the concept of ensemble learning);
  • There is no need for cross-validation. Remember those 36 to 37% of observations that are left out when sampling from the training set? Well, these data (called out-of-bag) are used not only to estimate the error, but also to measure the importance of each variable. All of this is happening at the same time the model is being built;
  • We can grow as many tree as we want (the limit is the computational power). What usually happens, however, is that the estimated error can no longer be improved after a certain number of trees is grown. Typical number for error convergence is between 100 and 2000 trees, depending on the complexity of the problem.
  • Although we usually improve accuracy, it comes at a cost: interpretability. A random forest is akin to a black box model, where it is hard to understand what's going on inside; anyway, we still have an estimate for variable importance, which is more than some other models can offer.
 Now, let's see the random forest working in practice. 

> # Needed to run the algorithm
> library(randomForest)
> # Convert some factor variables to numeric (train and test sets)
> train$h.temp.hour <- as.numeric(train$h.temp.hour)
> train$l.temp.hour <- as.numeric(train$l.temp.hour)
> train$gust.wind.hour <- as.numeric(train$gust.wind.hour)
> test$h.temp.hour <- as.numeric(test$h.temp.hour)
> test$l.temp.hour <- as.numeric(test$l.temp.hour)
> test$gust.wind.hour <- as.numeric(test$gust.wind.hour)
> # For reproducibility; 123 has no particular meaning
> # Run this immediately before creating the random forest
> set.seed(123)
> # Create a random forest with 1000 trees
> rf <- randomForest(rain ~ month + season + l.temp + h.temp + ave.temp + ave.wind +
              gust.wind + dir.wind + dir.wind.8 + h.temp.hour + l.temp.hour +
              gust.wind.hour, data = train, importance = TRUE, ntree=1000)
> # How many trees are needed to reach the minimum error estimate? 
> # This is a simple problem; it appears that about 100 trees would be enough. 
> which.min(rf$mse)
[1] 93
> # Plot rf to see the estimated error as a function of the number of trees
> # (not running it here)
> # plot(rf) 
> # Using the importance()  function to calculate the importance of each variable
> imp <-[,1],decreasing = TRUE),optional = T)
> names(imp) <- "% Inc MSE"
> imp
               % Inc MSE
gust.wind      21.241930
dir.wind       12.185181
ave.wind       11.864757
h.temp          9.499869
month           9.039023
ave.temp        8.420760
dir.wind.8      7.635521
l.temp          5.789758
season          5.262452
gust.wind.hour  4.725368
h.temp.hour     3.961188
l.temp.hour     1.979231

> # As usual, predict and evaluate on the test set
> test.pred.forest <- predict(rf,test)
> RMSE.forest <- sqrt(mean((test.pred.forest-test$rain)^2))
> RMSE.forest
[1] 10.20872 
> MAE.forest <- mean(abs(test.pred.forest-test$rain))
> MAE.forest
[1] 5.065036

We can see the accuracy improved when compared to the decision tree model, and is just about equal to the performance of the linear regression model. The gust wind speed was, once again, considered the most important predictor; it is estimated that, in the absence of that variable, the error would increase by 21.2%. 


Putting it all together

We have just built and evaluated the accuracy of five different models: baseline, linear regression, fully-grown decision tree, pruned decision tree, and random forest. Let's create a data frame with the RMSE and MAE for each of these methods.

> # Create a data frame with the error metrics for each method
> accuracy <- data.frame(Method = c("Baseline","Linear Regression","Full tree","Pruned tree","Random forest"),
                         RMSE   = c(RMSE.baseline,RMSE.lin.reg,RMSE.rtree,RMSE.rtree.pruned,RMSE.forest),
                         MAE    = c(MAE.baseline,MAE.lin.reg,MAE.rtree,MAE.rtree.pruned,MAE.forest)) 
> # Round the values and print the table
> accuracy$RMSE <- round(accuracy$RMSE,2)
> accuracy$MAE <- round(accuracy$MAE,2
> accuracy
             Method  RMSE  MAE
1          Baseline 13.39 8.22
2 Linear Regression 10.26 4.71
3         Full tree 11.38 6.14
4       Pruned tree 11.52 6.14
5     Random forest 10.21 5.07
All methods beat the baseline, regardless of the error metric, with the random forest and linear regression offering the best performance. It would be interesting, still, to compare the fitted vs. actual values for each model.

# Create a data frame with the predictions for each method
all.predictions <- data.frame(actual = test$rain,
                              baseline = best.guess,
                              linear.regression = test.pred.lin,
                              full.tree = test.pred.rtree,
                              pruned.tree = test.pred.rtree.p,
                              random.forest = test.pred.forest)
> # First six observations 
> head(all.predictions)
   actual baseline linear.regression  full.tree pruned.tree random.forest
2    64.8 5.315294         7.2638856 13.8058824  13.8058824     11.424860
4    20.1 5.315294        20.3352957 37.2090909  37.2090909     24.821385
5     9.4 5.315294        13.4020265 13.8058824  13.8058824     21.777313
6    38.9 5.315294        30.8604212 37.2090909  37.2090909     28.978410
7     2.0 5.315294         8.0923795 13.8058824  13.8058824     11.085088
10    0.0 5.315294         0.7544513  0.8281046   0.8281046      1.999272
> # Needed to melt the columns with the gather() function 
> # tidyr is an alternative to the reshape2 package (see the end of Part3a) 
> library(tidyr)
> # Gather the prediction variables (columns) into a single row (i.e., wide to long)
> # Recall the ggplot2 prefers the long data format
> all.predictions <- gather(all.predictions,key = model,value = predictions,2:6)
> head(all.predictions)
  actual    model predictions
1   64.8 baseline    5.315294
2   20.1 baseline    5.315294
3    9.4 baseline    5.315294
4   38.9 baseline    5.315294
5    2.0 baseline    5.315294
6    0.0 baseline    5.315294 
> tail (all.predictions)
    actual         model predictions
545    2.3 random.forest   12.788688
546    0.0 random.forest    0.370295
547    4.3 random.forest    2.159183
548    0.3 random.forest    6.752708
549    0.0 random.forest    3.310518
550    0.0 random.forest    2.126990 
# Predicted vs. actual for each model
ggplot(data = all.predictions,aes(x = actual, y = predictions)) + 
  geom_point(colour = "blue") + 
  geom_abline(intercept = 0, slope = 1, colour = "red") +
  geom_vline(xintercept = 23, colour = "green", linetype = "dashed") +
  facet_wrap(~ model,ncol = 2) + 
  coord_cartesian(xlim = c(0,70),ylim = c(0,70)) +
  ggtitle("Predicted vs. Actual, by model")

The graph shows that none of the models can predict accurately values over 25 mm of daily rain. In the range from 0 to approximately 22 mm (vertical dashed line for reference), the random forest seems to be the method that better approximates the diagonal line (i.e., smaller prediction error), followed closely by the linear regression. Even though these models can not be considered more than fair, they still do a much better job when compared with the baseline prediction.

In many cases, we build models not only to predict, but also to gain insight from the data. When, in Part 3b, we performed EDA, we saw that the maximum wind speed seemed to be an important predictor. Well, the models confirmed it three times: this variable was the most statistically significant in the linear model, was the only one used to predict in the regression tree, and the one that reduced the estimated error the most in the random forest.

In Part 4b, we will continue building models, this time considering the rain as a binary outcome. This means all models will assign probabilities to the occurrence of rain, for each day in the test set.

Tuesday, 17 March 2015

Part 3b: EDA with ggplot2

In Part 3a I have introduced the plotting system ggplot2. I talked about its concept and syntax with some detail, and then created a few general plots, using the weather data set we've been working with in this series of tutorials. My goal was to show that, in ggplot2, a nice looking and interesting graph can usually be created with just a few lines of code. Given the positive feedback that post has received, I believe I had some success with what I had in mind. Who knows if some of the readers will decide to give ggplot2 a try using their own data? It is often said that a picture is worth a thousand words, but I often see some highly experienced R users, after doing really complex data analyses, plotting the results in a way that falls short of the expectations, given the quality of their analytical work.

In Part 3b, we will continue to rely on visualisations to explore the data, but now with the goal of tackling the following question: Are there any good predictors, in our data, for the occurrence of rain on a given day? The numeric variable representing the amount of rain will be our dependent variable, also known as response or outcome variable, and all the remaining ones will be potential independent variables, also known as predictors or explanatory variables.

After framing the question, and before fitting any model, EDA techniques (such as visualisation) are used to gain insight from the data. Here are the steps I usually take at this stage:
  • Analyse the (continuous) dependent variable - at least, make an histogram of the distribution; when applicable, plot the variable against time to see the trend. Can the continuous variable be transformed to a binary one (i.e., did it rain or not on a particular day?), or to a multicategorical one (i.e., was the rain none, light, moderate, or heavy on a particular day?).
  • Search for correlations between the dependent variable and the continuous independent variables - are there any strong correlations? Are they linear or non-linear?
  • Do the same as above but try to control for other variables (faceting in ggplot2 is very useful  to do this), in order to assess for confounding and effect modification. Does the association between two continuous variables hold for different levels of a third variable, or is modified by them? (e.g., if there is a strong positive correlation between the rain amount and the wind gust maximum speed, does that hold regardless of the season of the year, or does it happen only in the winter?)
  • Search for associations between the dependent variable and the categorical independent variables (factors) - does the mean or median of the dependent variable change depending on the level of the factor? What about the outliers, are they evenly distributed across all levels, or seem to be present in only a few of them? 
So, let's now do some analyses, having the framework described above in mind.


Exploring the dependent variable - daily rain amount



# Time series of the daily rain amount, with smoother curve

ggplot(weather, aes(date,rain)) +
  geom_point(aes(colour = rain)) +
  geom_smooth(colour = "blue", size = 1) +
  scale_colour_gradient2(low = "green", mid = "orange",high = "red", midpoint = 20) +
  scale_y_continuous(breaks = seq(0,80,20)) +
  xlab("Date") +
  ylab("Rain (mm)") +
  ggtitle("Daily rain amount")


# Histogram of the daily rain amount

ggplot(weather,aes(rain)) + 
  geom_histogram(binwidth = 1,colour = "blue", fill = "darkgrey") +
  scale_x_continuous(breaks = seq(0,80,5)) +
  scale_y_continuous(breaks = seq(0,225,25)) +
  xlab("Rain (mm)") +
  ylab ("Frequency (days)") +
  ggtitle("Daily rain amount distribution")
The time series plot shows that the daily rain amount varies wildly throughout the year. There are many dry days interspersed with stretches of consecutive wet ones, often severe, especially in the autumn and winter seasons. The histogram not only confirms what was said above, but also shows that the distribution is extremely right-skewed. As shown below, both informally (comparing the mean to the median), and formally (calculating the actual value), the positive skewness remains even after removing all the days where it did not rain.

> # Heavily right-skewed distribution
> summary(weather$rain)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.300   5.843   5.300  74.900 
> # Right-skewness is still there after removing all the dry days
> summary(subset(weather, rain > 0)$rain)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.30    1.15    5.10   11.17   16.10   74.90 
> # Formal calculation of skewness (e1071 package)
> library(e1071) 
> skewness(weather$rain)
[1] 2.99
> skewness(subset(weather, rain >0)$rain)
[1] 2.04
It should be clear at this point that one possible approach would be to dichotomise the dependent variable (rain vs. no rain).  Note that it is common to consider days with rain only those where the total amount was at least 1mm (to allow for measurement errors), and that's the criterion we will adopt here. Here is the code to do it and a few interesting summaries.

> # Create binary outcome (rained = {Yes, No})
> # Number of dry days
> nrow(subset(weather, rain == 0))
[1] 174 
> # Number of days that will also be considered dry days
> nrow(subset(weather, rain <1 & rain >0))
[1] 45
> # The new binary variable is called "rained"
> weather$rained <- ifelse(weather$rain >= 1, "Yes", "No")
> # Dry and wet days (absolute)
> table(rained = weather$rained) 
 No Yes 
219 146 
> # Dry and wet days (relative)
> prop.table(table(rained = weather$rained)) 
 No Yes 
0.6 0.4
Porto is one of the wettest cities in Europe for a reason. There was rain in exactly 40% of the days of the year, and this is considering those days with rain < 1mm as dry. Should we set the cutoff at 0 mm, and more than 50% of the days in 2014 would have been considered wet.


Looking at the association between rain and season of the year

The time series plot seems to indicate that season of the year might be a good predictor for the occurrence of rain. Let's start by investigating this relation, with both the continuous rain variable and the binary one.

Rain amount (continuous) by season

# Jitter plot - Rain amount by season 
 ggplot(weather, aes(season,rain)) +
  geom_jitter(aes(colour=rain), position = position_jitter(width = 0.2)) +
  scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = 30) +
  scale_y_continuous(breaks = seq(0,80,20)) +
  xlab("Season") +
  ylab ("Rain (mm)") +
  ggtitle("Daily rain amount by season")

> # Rain amount by season
> # {tapply(x,y,z) applies function z to x, for each level of y} 
> tapply(weather$rain,weather$season,summary) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   2.934   2.550  28.200 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   2.936   1.350  68.300 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.300   7.164   6.450  74.900 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    3.95   10.45   14.65   64.80

We can see that most of the extreme values (outliers) are in the winter and autumn. There are still, however, many dry days in both these seasons, and hence the means are too close from each other. Fitting a model to predict the actual amount of rain (not the same as probability of occurrence) based on the season alone might not give the greatest results.

Rain occurrence (binary) by season

# Bar plot - dry and wet days by season (relative)
ggplot(weather,aes(season)) +
  geom_bar(aes(fill = rained), position = "fill") +
  geom_hline(aes(yintercept = prop.table(table(weather$rained))["No"]),
             colour = "blue",linetype = "dashed", size = 1) +
  annotate("text", x = 1, y = 0.65, label = "yr. w/o = 0.60", colour = "blue") +
  xlab("Season") +
  ylab ("Proportion") +
  ggtitle("Proportion of days without and with rain, by season")

> round(prop.table(table(season = weather$season, rained= weather$rained),1),2
season     No  Yes
  Spring 0.74 0.26
  Summer 0.72 0.28
  Autumn 0.57 0.43
  Winter 0.37 0.63

It appears that, when it comes to calculate the likelihood of raining on a particular day, the season of the year may have some predictive power. This is especially true for the winter season, where the rainy days (63%) are well above the yearly average (40%).


Looking at the correlations between rain and all numeric variables

We are now going to calculate the linear (Pearson) correlations between the continuous outcome variable (daily rain amount) and all the numeric variables - both the actual numeric ones (such as the temperature and wind speed), and those that are actually factor variables, but still make sense if we want to see them as numeric (for instance, the hour of the day in which some event occurred).

Note that we are not modelling at this stage, just trying to gain some insight from the data, and therefore we are neither concerned about whether the correlations are significant (p-values and/or confidence intervals) nor if the the relation between any two variables is in fact linear (we will be able to check this later, anyway, when we create the scatter plots with the lowess curves on top).

Here is the code (the comments should make it easier to understand what we are trying to do).

> # Create a new data frame with only the variables than can be numeric
> weather.num <- weather[c("rain","l.temp","h.temp","ave.temp","ave.wind","gust.wind",
+ "l.temp.hour","h.temp.hour","gust.wind.hour")] 
> # Convert the following factor variables to numeric
> weather.num$l.temp.hour <- as.numeric(weather.num$l.temp.hour)
> weather.num$h.temp.hour <- as.numeric(weather.num$h.temp.hour)
> weather.num$gust.wind.hour <- as.numeric(weather.num$gust.wind.hour) 
> # Pass the entire data frame to the cor() function to create a big correlation matrix
> # Pick only the first row of the matrix [1,] ==> correlation between rain and all the other variables
> round(cor(weather.num),2)[1,]
          rain         l.temp         h.temp       ave.temp       ave.wind 
          1.00          -0.14          -0.29          -0.21           0.48 
     gust.wind    l.temp.hour    h.temp.hour gust.wind.hour 
          0.61           0.19          -0.25          -0.16

There seems to be a promising positive correlation between the rain amount and the wind variables, especially the wind gust maximum speed (moderate correlation value of 0.61) , i.e., higher wind speeds tend to be associated with higher amounts of precipitation. Always bear in mind that correlation does not imply causation, therefore while it is true that the wind correlates with rain, this does not necessarily mean that the wind itself is causing the rain. It could actually be the other way around or, what is very common, both of the variables are caused by a third one that we are not even considering.

There are also some negative correlations with the temperatures (especially the daily high) that, even though not as strong as the wind ones, are still worth looking at. It appears higher amounts of rain are correlated with lower high temperatures. But let's think about it for a minute: we saw that it is in the winter when it rains the most, and in Part 3a we also saw that the temperatures were lower in the winter. Here is an example of a potential (yet to be confirmed) interaction with a third variable: it may not be the lower temperature by itself that causes more rain, but the fact the both the precipitation and lower temperatures tend to occur during a specific period of the year.

Since the season seems to have an impact on these variables, I would like to explore it a bit further, calculating all these correlations by season and check whether the values hold. If the correlation between rain and some other variable is very dissimilar across all seasons, then there is the proof for an interaction.

> # Let's split the data frame in four parts, one for each season
> weather.num.season <- split(weather.num,weather$season) 
> # The result is a list...
> class(weather.num.season)
[1] "list" 
> # ...with 4 elements, where...
> length(weather.num.season)
[1] 4 
> # ...each element of the list is a data frame (the seasons), with nine variables
> summary(weather.num.season)
       Length Class      Mode
Spring 9      data.frame list
Summer 9      data.frame list
Autumn 9      data.frame list
Winter 9      data.frame list 
> # Here are the names of each of the data frames of the list
> attributes(weather.num.season)
[1] "Spring" "Summer" "Autumn" "Winter"

> # *apply family of functions are arguably the most powerful in base R, but also the most difficult to master
> # {sapply(x,z) applies function z to each element of x}
> # First go over the elements of the list and calculate the correlation matrix (all against all)
> # For each season, return only the correlation between "rain" and everything else
> sapply(weather.num.season, function (x) round(cor(x)["rain",],2))
               Spring Summer Autumn Winter
rain             1.00   1.00   1.00   1.00
l.temp          -0.33   0.06   0.13   0.05
h.temp          -0.45  -0.26  -0.07  -0.26
ave.temp        -0.43  -0.14   0.06  -0.09
ave.wind         0.34   0.50   0.38   0.66
gust.wind        0.37   0.45   0.66   0.71
l.temp.hour      0.24   0.16   0.01   0.33
h.temp.hour     -0.13  -0.22  -0.16  -0.30
gust.wind.hour  -0.26  -0.34  -0.18   0.06 
> # Not going to run it, but here is an alternative that might be easier (or not) to understand
> # It actually shows the correlations for each element of the list
> # lapply(weather.num.season, function (x) round(cor(x)["rain",],2)) 

What do you conclude from the table above? 

The correlation between the rain and wind varies, but keeps moderately strong, regardless of the season of the year, making this the most promising variable in out data set; the correlation between the rain and daily high temperature does not confirm what I had hypothesised above. In fact, the correlation is even stronger in the spring than in the winter, and we would have to go even deeper if we really needed to understand what is going on (keep in mind we are just analysing one of the possible interactions - the season - but in practice there can be multiple ones). For the purpose of what we are doing now, it is enough to be aware that this correlation is not stable throughout the year, and actually goes to zero during the autumn. Lastly, the correlations between rain and the hour of the events (low and high temperatures, and wind gust) are rather weak but show some stability (see the l.temp.hour and h.temp.hour). They might have some predictive power, at least in a linear model.
Now that we know that the wind has the strongest correlation with the rain, and that it holds true across all seasons, it's time to plot these variables, because we want to learn something we still don't know: what is the shape of this relation? Linear, piecewise linear, curvilinear? Let's find out using ggplot2 and the concept of faceting.


Looking deeper at the wind and rain correlation - faceting technique

The idea of faceting (also called Trellis plots) is a simple but important one. A graphic, of any type and mapping any number of variables, can be divided into several panels, depending on the value of a conditioning variable. This value is either each of the levels of a categorical variable, or a number of ranges of a numeric variable. This helps us check whether there are consistent patterns across all panels. In ggplot2 there are two functions to create facets - facet_wrap() and facet_grid() - that are used when we have one or two conditioning variables, respectively. As everything in ggplot2, this is just another layer in the plot, which means all geometries, mappings, and settings will be replicated across all panels.

Let's then assess the linearity of the correlation of the amount of rain and the maximum wind gust speed, conditioning on the season of the year.

# Amount of rain vs. wind, by season
ggplot(weather,aes(gust.wind,rain)) +
  geom_point(colour = "firebrick") +
  geom_smooth(size = 0.75, se = F) +
  facet_wrap(~season) +
  xlab("Maximum wind speed (km/h)") +
  ylab ("Rain (mm)") +
  ggtitle("Amount of rain vs. maximum wind speed, by season")

This plot confirms what we had already discovered: there is a positive correlation between rain and wind, and the association holds regardless of the season. But now we know more: this correlation is non-linear. In fact, if we were to generalise, we could say there is no correlation at all when the maximum wind speed is below 25 km/h. For values higher than that, there seems to be a linear association in the autumn and winter, not so linear in the spring, and definitely non-linear during the summer. If we wanted to model this relation, we would either fit a non-linear model (such as a regression tree) or we could try to force a piecewise linear model (linear spline), where the equation relating the outcome to the predictors would, itself, be different depending on the value of the wind.


Occurrence of rain - more complex faceting to visualise variable association


To finish this part of the series, let's check whether the variable that seemed to have some predictive power for the amount of rain (maximum wind speed), is also good in the case of a binary outcome (occurrence of rain), but now we will not only control for the season, but also for the daily high temperature (because, as we have seen before, this variable was interacting with both the rain and the season). We will do this simultaneously, using the faceting technique on two variables. But, since the daily high temperature variable is continuous, we need to transform it to categorical first. A common strategy is to split the continuous variables in four groups of (roughly) the same size, i.e., the quartiles. This is very simple to do in R, combining the cut() and quantile() functions.

> # Using the defaults of the quantiles function returns 4 intervals (quartiles)
> quantile(weather$h.temp)
  0%  25%  50%  75% 100% 
 9.8 14.4 19.1 23.3 31.5 
> # All we need to do is to define the quartiles as the breaks of the cut function
> # and label the intervals accordingly
> weather$h.temp.quant <- cut(weather$h.temp, breaks = quantile(weather$h.temp),
                            labels = c("Cool","Mild","Warm","Hot"),include.lowest = T)
> # The result 
> table(weather$h.temp.quant)

Cool Mild Warm  Hot 
  92   91   94   88  


# Occurrence of rain, by season and daily high temperature 

ggplot(weather,aes(rained,gust.wind)) +
  geom_boxplot(aes(colour=rained)) +
  facet_grid(h.temp.quant~season) +
  xlab("Occurrence of rain") +
  ylab ("Maximum wind speed (km/h)") +
  ggtitle("Occurrence of rain, by season and daily high temperature")

The graph reveals a clear pattern: the median of the maximum wind speed is always higher when it rains, and this is not affected by the range the of daily high temperature, even after controlling for the temperature variation within each season of the year.

I think we now have a much better understanding of the data. We know which variables matter the most, and which ones seem to be useless, when it comes to predict the rain, either the actual amount or the probability of its occurrence. Please note that it would be impossible to write about the analysis of every single variable and show every plot; behind the scenes, much more work than what I've shown here has been done.

In Part 4, the last of this series, we will be using a few machine learning algorithms to find out how well the rain can be predicted, and the contribution of each variable to the accuracy and precision of each model.

In the meanwhile, if you have done any kind of analysis with this data set, feel free to share your own findings, either by commenting on the post or by sending me a private message.