Monday, 6 April 2015

Part 4a: Modelling - predicting the amount of rain

In the fourth and last part of this series, we will build several predictive models and evaluate their accuracies. In Part 4a, our dependent value will be continuous, and we will be predicting the daily amount of rain. Then, in Part 4b, we will deal with the case of a binary outcome, which means we will assign probabilities to the occurrence of rain on a given day. 

In both the continuous and binary cases, we will try to fit the following models:
  • Baseline model - usually, this means we assume there are no predictors (i.e., independent variables). Thus, we have to make an educated guess (not a random one), based on the value of the dependent value alone. This model is important because it will allow us to determine how good, or how bad, are the other ones. For example, imagine a fancy model with 97% of accuracy - is it necessarily good and worth implementing? No, it depends; if the baseline accuracy is 60%, it's probably a good model, but if the baseline is 96.7% it doesn't seem to add much to what we already know, and therefore its implementation will depend on how much we value this 0.3% edge. It turns out that, in real life, there are many instances where the models, no matter how simple or complex, barely beat the baseline. A simple example: try to predict whether some index of the stock market is going up or down tomorrow, based on the movements of the last N days; you may even add other variables, representing the volatility index, commodities, and so on. Even if you build a neural network with lots of neurons, I'm not expecting you to do much better than simply consider that the direction of tomorrow's movement will be the same as today's (in fact, the accuracy of your model can even be worse, due to overfitting!). Predicting stock market movements is a really tough problem;
  • A model from inferential statistics - this will be a (generalised) linear model. In the case of a continuous outcome (Part 4a), we will fit a multiple linear regression; for the binary outcome (Part 4b), the model will be a multiple logistic regression;
  • Two models from machine learning - we will first build a decision tree (regression tree for the continuous outcome, and classification tree for the binary case); these models usually offer high interpretability and decent accuracy; then, we will build random forests, a very popular method, where there is often a gain in accuracy, at the expense of interpretability.

Let's now start working on the models for the continuous outcome (i.e., the amount of rain).


Train and test sets


Here is one of the golden rules of machine learning and modelling in general: models are built using training data, and evaluated on testing data. The reason is overfitting: most models' accuracy can be artificially increased to a point where they "learn" every single detail of the data used to build them; unfortunately, it usually means they lose the capability to generalise. That's why we need unseen data (i.e., the testing set): if we overfit the training data, the performance on the testing data will be poor. In real life, simple models often beat complex ones, because they can generalise much better. We will do a random 70:30 split in our data set (70% will be for training models, 30% to evaluate them). For reproducibility, we will need to set the seed of the random number generator (it means every time I run the code, I'll get the same train and test sets. In case you're following along with the tutorial, you'll get the same sets, too). Here's the code:

> # For reproducibility; 123 has no particular meaning
> set.seed(123
> # randomly pick 70% of the number of observations (365)
> index <- sample(1:nrow(weather),size = 0.7*nrow(weather)) 
> # subset weather to include only the elements in the index
> train <- weather[index,] 
> # subset weather to include all but the elements in the index
> test <- weather [-index,] 
> nrow(train)
[1] 255 
> nrow(test)
[1] 110

Here is a plot showing which points belong to which set (train or test).


# Create a dataframe with train and test indicator...
group <- rep(NA,365)
group <- ifelse(seq(1,365) %in% index,"Train","Test")
df <- data.frame(date=weather$date,rain=weather$rain,group)

# ...and plot it
ggplot(df,aes(x = date,y = rain, color = group)) + geom_point() +
  scale_color_discrete(name="") + theme(legend.position="top")


Evaluation metrics


For the continuous outcome, the main error metric we will use to evaluate our models is the RMSE (root mean squared error). This error measure gives more weight to larger residuals than smaller ones (a residual is the difference between the predicted and the observed value). What this means is that we consider that missing the prediction for the amount of rain by 20 mm, on a given day, is not only twice as bad as missing by 10 mm, but worse than that.

We will use the MAE (mean absolute error) as a secondary error metric. It gives equal weight to the residuals, which means 20 mm is actually twice as bad as 10 mm. One of the advantages of this error measure is that it is easy to interpret: it tells us, on average, the magnitude of the error we get by using the model when compared to the actual observed values.

Both metrics are valid, although the RMSE appears to be more popular, possibly because it amplifies the differences between models' performances in situations where the MAE could lead us to believe they were about equal. If you want to know more about the comparison between the RMSE and the MAE, here is an interesting article.

Let's now build and evaluate some models.


Baseline model

In the absence of any predictor, all we have is the dependent variable (rain amount). What would be our best guess if we had to predict the amount of rain, on a given day, in the test set? I would say that the mean of the rain values, in the training data, is the best value we can come up with. Recall we can only use the training data to build models; the testing data is only there to evaluate them, after making the predictions.

> # Baseline model - predict the mean of the training data
> best.guess <- mean(train$rain) 
> # Evaluate RMSE and MAE on the testing data
> RMSE.baseline <- sqrt(mean((best.guess-test$rain)^2))
> RMSE.baseline
[1] 13.38996 
> MAE.baseline <- mean(abs(best.guess-test$rain))
> MAE.baseline
[1] 8.219123


Multiple linear regression

We will now fit a (multiple) linear regression, which is probably the best known statistical model.  In simple terms, the dependent variable is assumed to be a linear function of several independent variables (predictors), where each of them has a weight (regression coefficient) that is expected to be statistically significant in the final model. The results are usually highly interpretable and, provided some conditions are met, have good accuracy. Before showing the results, here are some important notes:
  • Linear models do not require variables to have a Gaussian distribution (only the errors / residuals must be normally distributed); they do require, however, a linear relation between the dependent and independent variables. As we saw in Part 3b, the distribution of the amount of rain is right-skewed, and the relation with some other variables is highly non-linear. For this reason of linearity, and also to help fixing the problem with residuals having non-constant variance across the range of predictions (called heteroscedasticity), we will do the usual log transformation to the dependent variable. Since we have zeros (days without rain), we can't do a simple ln(x) transformation, but we can do ln(x+1), where x is the rain amount. Why do we choose to apply a logarithmic function? Simply because the regression coefficients can still be interpreted, although in a different way when compared with a pure linear regression. This model we will fit is often called log-linear;
  • What I'm showing below is the final model. I started with all the variables as potential predictors and then eliminated from the model, one by one, those that were not statistically significant (p < 0.05). We need to do it one by one because of multicollinearity (i.e., correlation between independent variables). Some of the variables in our data are highly correlated (for instance, the minimum, average, and maximum temperature on a given day), which means that sometimes when we eliminate a non-significant variable from the model, another one that was previously non-significant becomes statistically significant. This iterative process of backward elimination stops when all the variables in the model are significant (in the case of factors, here we consider that at least one level must be significant);
  • Our dependent variable has lots of zeros and can only take positive values; if you're an expert statistician, perhaps you would like to fit very specific models that can deal better with count data, such as negative binomial, zero-inflated and hurdle models. There are several packages to do it in R. For simplicity, we'll stay with the linear regression model in this tutorial.

> # Create a multiple (log)linear regression model using the training data
> lin.reg <- lm(log(rain+1) ~ season +  h.temp + ave.temp + ave.wind + gust.wind +
              dir.wind + as.numeric(gust.wind.hour), data = train)
> # Inspect the model
> summary(lin.reg)

lm(formula = log(rain + 1) ~ season + h.temp + ave.temp + ave.wind + 
    gust.wind + dir.wind + as.numeric(gust.wind.hour), data = train)

     Min       1Q   Median       3Q      Max 
-1.93482 -0.48296 -0.05619  0.39849  2.21069 

                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 0.028020   0.445699   0.063 0.949926    
seasonSummer                0.370488   0.151962   2.438 0.015523 *  
seasonAutumn                0.696291   0.153209   4.545 8.87e-06 ***
seasonWinter                0.297473   0.159602   1.864 0.063612 .  
h.temp                     -0.197772   0.045925  -4.306 2.45e-05 ***
ave.temp                    0.165975   0.054907   3.023 0.002786 ** 
ave.wind                   -0.104736   0.041431  -2.528 0.012140 *  
gust.wind                   0.062181   0.008612   7.220 7.44e-12 ***
dir.windENE                 0.251267   0.340192   0.739 0.460898    
dir.windESE                 0.303855   0.796890   0.381 0.703330    
dir.windN                   0.405041   0.333188   1.216 0.225357    
dir.windNE                  0.358754   0.339188   1.058 0.291303    
dir.windNNE                 1.803748   0.508091   3.550 0.000467 ***
dir.windNNW                 0.390606   0.309850   1.261 0.208715    
dir.windNW                  0.346885   0.290911   1.192 0.234324    
dir.windS                   1.060690   0.334293   3.173 0.001714 ** 
dir.windSE                  1.210005   0.326271   3.709 0.000261 ***
dir.windSSE                 1.488857   0.331211   4.495 1.10e-05 ***
dir.windSSW                 1.292246   0.349473   3.698 0.000272 ***
dir.windSW                  0.852924   0.384445   2.219 0.027488 *  
dir.windW                   0.680786   0.467912   1.455 0.147042    
dir.windWNW                 0.627473   0.342009   1.835 0.067841 .  
dir.windWSW                 1.279782   0.799807   1.600 0.110940    
as.numeric(gust.wind.hour) -0.026972   0.011341  -2.378 0.018204 *  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.741 on 231 degrees of freedom
Multiple R-squared:  0.6637, Adjusted R-squared:  0.6302 
F-statistic: 19.82 on 23 and 231 DF,  p-value: < 2.2e-16 

> # What is the multiplicative effect of the wind variable?
> exp(lin.reg$coefficients["gust.wind"])

> # Apply the model to the testing data (i.e., make predictions) ...
> # (Don't forget to exponentiate the results to revert the log transformation)
> test.pred.lin <- exp(predict(lin.reg,test))-1
> # ...and evaluate the accuracy
> RMSE.lin.reg <- sqrt(mean((test.pred.lin-test$rain)^2))
> RMSE.lin.reg
[1] 10.26189
> MAE.lin.reg <- mean(abs(test.pred.lin-test$rain))
> MAE.lin.reg
[1] 4.713084

Here are the main conclusions about the model we have just built:
  • The R-squared is 0.66, which means that 66% of the variance in our dependent variable can be explained by the set of predictors in the model; at the same time, the adjusted R-squared is not far from that number, meaning that the original R-squared has not been artificially increased by adding variables to the model. Note that the R-squared can only increase or stay the same by adding variables, whereas the adjusted R-squared can even decrease if the variable added doesn't help the model more than what is expected by chance;
  • All the variables are statistically significant (p < 0.05), as expected from the way the model was built, and the most significant predictor is the wind gust (p = 7.44e-12). The advantage of doing a log transformation is that, if the regression coefficient is small (i.e. -0.1 to 0.1), a unit increase in the independent variable yields an increase of approximately coeff*100% in the dependent variable. To be clear, the coefficient of the wind gust is 0.062181. It means that a unit increase in the gust wind (i.e., increasing the wind by 1 km/h), increases the predicted amount of rain by approximately 6.22%. You can always exponentiate to get the exact value (as I did), and the result is 6.42%. By the same token, for each degree (ºC) the daily high temperature increases, the predicted rain increases by exp(-0.197772) = 0.82 (i.e., it decreases by 18%);
  • Both the RMSE and MAE have decreased significantly when compared with the baseline model, which means that this linear model, despite all the linearity issues and the fact that it predicts negative values of rain in some days, is still much better, overall, than our best guess. 
We will see later, when we compare the fitted vs actual values for all models, that this model has an interesting characteristic: it predicts reasonably well daily rain amounts between 0 and 25 mm, but the predicting capability degrades significantly in the 25 to 70mm range.


Decision trees


A decision tree (also known as regression tree for continuous outcome variables) is a simple and popular machine learning algorithm, with a few interesting advantages over linear models: they make no assumptions about the relation between the outcome and predictors (i.e., they allow for linear and non-linear relations); the interpretability of a decision tree could not be higher - at the end of the process, a set of rules, in natural language, relating the outcome to the explanatory variables, can be easily derived from the tree. Furthermore, a decision tree is the basis of a very powerful method that we will also use in this tutorial, called random forest.

How do we grow trees, then? In very simple terms, we start with a root node, which contains all the training data, and split it into two new nodes based on the most important variable (i.e., the variable that better separates the outcome into two groups). For each new node, a split is attempted (always based on the variable that better splits that particular node, which is not necessarily the same variable we started with), until any of the stop crtiteria is met (for instance, in the R implementation, one of the defaults is that the node must have at least 20 observations, otherwise a new split is not attempted). The disadvantage of a decision tree is that they tend to overfit; before a stop criterion is reached, the algorithm keeps splitting nodes over and over, learning all the details of the training data. Big trees are often built that perform very well in the training data, but can't generalise well and therefore do very poorly on the test set.

What to do, then? This is where the extremely important concept of cross-validation comes into play (just as understanding the difference between the training and testing data, it is also of paramount importance to understand what cross-validation is).  When we grow a tree in R (or any other software, for that matter), it internally performs a 10-fold cross-validation. What this means is that 10 cycles are run, and in each of them 90% of the training data is used for building the model, and the remaining 10%, even though they're part of the training data, are only used to predict and evaluate the results. In another words, these 10% are momentarily a test set. It should be obvious that, after the 10 cycles, 10 different chunks of data are used to test the model, which means every single observation is used, at some point, not only to train but also to test the model. Since we are testing at the same time we're growing a tree, we have a error measurement, that we use to find the optimal number of splits. We then prune the original tree, and keep only the optimal number of splits. In sum, when growing a tree, this is what we do:
  • Grow a full tree, usually with the default settings;
  • Examine the cross-validation error (x-error), and find the optimal number of splits. This corresponds, in R, to a value of cp (complexity parameter);
  • Prune the tree using the complexity parameter above.
What usually happens is that either the pruned tree performs better on the test set or it performs about the same as the full-grown tree. Even in the latter case, it is useful to prune the tree, because less splits means less decision rules and higher interpretability, for the same level of performance. Now, let's do all of this in practice.

> # Needed to grow a tree
> library(rpart)
> # To draw a pretty tree (fancyRpartPlot function)
> library(rattle)
> # rpart function applied to a numeric variable => regression tree
> rt <- rpart(rain ~ month + season + l.temp + h.temp + ave.temp + ave.wind +
              gust.wind + dir.wind + dir.wind.8 + as.numeric(h.temp.hour)+
              as.numeric(l.temp.hour)+ as.numeric(gust.wind.hour), data=train)
> # Full-grown tree with 8 splits using 6 different variables 
> # (Not running the line below - do it to see the tree)
> # fancyRpartPlot(rt)
> # As always, predict and evaluate on the test set
> test.pred.rtree <- predict(rt,test) 
> RMSE.rtree <- sqrt(mean((test.pred.rtree-test$rain)^2))
> RMSE.rtree
[1] 11.37786
> MAE.rtree <- mean(abs(test.pred.rtree-test$rain))
> MAE.rtree
[1] 6.140937

Now that we have a full-grown tree, let's see if it's possible to prune it...
> # Check cross-validation results (xerror column)
> # It corresponds to 2 splits and cp = 0.088147
> printcp(rt)

Regression tree:
rpart(formula = rain ~ month + season + l.temp + h.temp + ave.temp + 
      ave.wind + gust.wind + dir.wind + dir.wind.8 + as.numeric(h.temp.hour) + 
      as.numeric(l.temp.hour) + as.numeric(gust.wind.hour), data = train)

Variables actually used in tree construction:
[1] as.numeric(h.temp.hour) ave.temp                dir.wind               
[4] gust.wind               h.temp                  month                  

Root node error: 32619/255 = 127.92

n= 255 

        CP nsplit rel error  xerror    xstd
1 0.358496      0   1.00000 1.00815 0.22993
2 0.102497      1   0.64150 0.72131 0.19653
3 0.088147      2   0.53901 0.71921 0.19592
4 0.045689      3   0.45086 0.75017 0.19972
5 0.028308      4   0.40517 0.79740 0.20348
6 0.024935      5   0.37686 0.80947 0.19980
7 0.011454      6   0.35193 0.79036 0.19880
8 0.010600      7   0.34047 0.75514 0.17788
9 0.010000      8   0.32987 0.75514 0.17788 
> # Get the optimal CP programmatically...
> min.xerror <- rt$cptable[which.min(rt$cptable[,"xerror"]),"CP"]
> min.xerror
[1] 0.08814737 
> # ...and use it to prune the tree
> rt.pruned <- prune(rt,cp = min.xerror) 
> # Plot the pruned tree
> fancyRpartPlot(rt.pruned)

> # Evaluate the new pruned tree on the test set
> test.pred.rtree.p <- predict(rt.pruned,test)
> RMSE.rtree.pruned <- sqrt(mean((test.pred.rtree.p-test$rain)^2))
> RMSE.rtree.pruned
[1] 11.51547
> MAE.rtree.pruned <- mean(abs(test.pred.rtree.p-test$rain))
> MAE.rtree.pruned
[1] 6.138062
As you can see, we were able to prune our tree, from the initial 8 splits on six variables, to only 2 splits on one variable (the maximum wind speed), gaining simplicity without losing performance (RMSE and MAE are about equivalent in both cases). In the final tree, only the wind gust speed is considered relevant to predict the amount of rain on a given day, and the generated rules are as follows (using natural language):
  • If the daily maximum wind speed exceeds 52 km/h (4% of the days), predict a very wet day (37 mm);
  • If the daily maximum wind is between 36 and 52 km/h (23% of the days), predict a wet day (10mm);
  • If the daily maximum wind stays below 36 km/h (73% of the days), predict a dry day (1.8 mm); 
The accuracy of this extremely simple model is only a bit worse than the much more complicated linear regression. Next, instead of growing only one tree, we will grow the whole forest, a method that is very powerful and, more often than not, yields in very good results.


Random forests


What if, instead of growing a single tree, we grow many (ranging from a few hundred to a few thousand), and introduce some sources of randomness, so that each tree is most likely different from the others? What we get is a random forest (Leo Breiman, 2001), a popular algorithm that pretty much every data scientist in the world knows. In fact, when it comes to problems where the focus is not so much in understanding the data, but in make predictions, random forests are often used immediately after preparing the data set, skipping entirely the EDA stage. This is basically how the algorithm works:
  • To grow each tree, a random sample of the training data, with the size of the training data itself, is taken, with replacement. This means that some observations might appear several times in the sample, and others are left out (the proportion of discarded observations converges to 36.8%, for very large sample sizes). In another words, each tree will be grown with its own version of the training data;
  • The variables that are used to grow each tree are also randomly sampled; the sample size is 1/3 and the square root of the number of independent variables, for regression and classification problems, respectively;
  • Each tree is then fully grown, without any pruning, using its own training set and variable set;
  • In classification problems, each tree casts a vote for the class it predicts and the final prediction is the class that has the majority of votes. In regression problems, the predicted value is a weighted average of the value predicted by each tree. 
Some of most interesting characteristics of random forests are:
  • They do not overfit. Even though each component of the forest (i.e. each tree) is probably overfitting a lot, since they are all different, the mistakes they make are in all directions; when the errors are averaged, they kind of cancel each other. Although each classifier is weak (recall the training set and variables are randomly sampled), when put together they become a strong classifier (this is the concept of ensemble learning);
  • There is no need for cross-validation. Remember those 36 to 37% of observations that are left out when sampling from the training set? Well, these data (called out-of-bag) are used not only to estimate the error, but also to measure the importance of each variable. All of this is happening at the same time the model is being built;
  • We can grow as many tree as we want (the limit is the computational power). What usually happens, however, is that the estimated error can no longer be improved after a certain number of trees is grown. Typical number for error convergence is between 100 and 2000 trees, depending on the complexity of the problem.
  • Although we usually improve accuracy, it comes at a cost: interpretability. A random forest is akin to a black box model, where it is hard to understand what's going on inside; anyway, we still have an estimate for variable importance, which is more than some other models can offer.
 Now, let's see the random forest working in practice. 

> # Needed to run the algorithm
> library(randomForest)
> # Convert some factor variables to numeric (train and test sets)
> train$h.temp.hour <- as.numeric(train$h.temp.hour)
> train$l.temp.hour <- as.numeric(train$l.temp.hour)
> train$gust.wind.hour <- as.numeric(train$gust.wind.hour)
> test$h.temp.hour <- as.numeric(test$h.temp.hour)
> test$l.temp.hour <- as.numeric(test$l.temp.hour)
> test$gust.wind.hour <- as.numeric(test$gust.wind.hour)
> # For reproducibility; 123 has no particular meaning
> # Run this immediately before creating the random forest
> set.seed(123)
> # Create a random forest with 1000 trees
> rf <- randomForest(rain ~ month + season + l.temp + h.temp + ave.temp + ave.wind +
              gust.wind + dir.wind + dir.wind.8 + h.temp.hour + l.temp.hour +
              gust.wind.hour, data = train, importance = TRUE, ntree=1000)
> # How many trees are needed to reach the minimum error estimate? 
> # This is a simple problem; it appears that about 100 trees would be enough. 
> which.min(rf$mse)
[1] 93
> # Plot rf to see the estimated error as a function of the number of trees
> # (not running it here)
> # plot(rf) 
> # Using the importance()  function to calculate the importance of each variable
> imp <-[,1],decreasing = TRUE),optional = T)
> names(imp) <- "% Inc MSE"
> imp
               % Inc MSE
gust.wind      21.241930
dir.wind       12.185181
ave.wind       11.864757
h.temp          9.499869
month           9.039023
ave.temp        8.420760
dir.wind.8      7.635521
l.temp          5.789758
season          5.262452
gust.wind.hour  4.725368
h.temp.hour     3.961188
l.temp.hour     1.979231

> # As usual, predict and evaluate on the test set
> test.pred.forest <- predict(rf,test)
> RMSE.forest <- sqrt(mean((test.pred.forest-test$rain)^2))
> RMSE.forest
[1] 10.20872 
> MAE.forest <- mean(abs(test.pred.forest-test$rain))
> MAE.forest
[1] 5.065036

We can see the accuracy improved when compared to the decision tree model, and is just about equal to the performance of the linear regression model. The gust wind speed was, once again, considered the most important predictor; it is estimated that, in the absence of that variable, the error would increase by 21.2%. 


Putting it all together

We have just built and evaluated the accuracy of five different models: baseline, linear regression, fully-grown decision tree, pruned decision tree, and random forest. Let's create a data frame with the RMSE and MAE for each of these methods.

> # Create a data frame with the error metrics for each method
> accuracy <- data.frame(Method = c("Baseline","Linear Regression","Full tree","Pruned tree","Random forest"),
                         RMSE   = c(RMSE.baseline,RMSE.lin.reg,RMSE.rtree,RMSE.rtree.pruned,RMSE.forest),
                         MAE    = c(MAE.baseline,MAE.lin.reg,MAE.rtree,MAE.rtree.pruned,MAE.forest)) 
> # Round the values and print the table
> accuracy$RMSE <- round(accuracy$RMSE,2)
> accuracy$MAE <- round(accuracy$MAE,2
> accuracy
             Method  RMSE  MAE
1          Baseline 13.39 8.22
2 Linear Regression 10.26 4.71
3         Full tree 11.38 6.14
4       Pruned tree 11.52 6.14
5     Random forest 10.21 5.07
All methods beat the baseline, regardless of the error metric, with the random forest and linear regression offering the best performance. It would be interesting, still, to compare the fitted vs. actual values for each model.

# Create a data frame with the predictions for each method
all.predictions <- data.frame(actual = test$rain,
                              baseline = best.guess,
                              linear.regression = test.pred.lin,
                              full.tree = test.pred.rtree,
                              pruned.tree = test.pred.rtree.p,
                              random.forest = test.pred.forest)
> # First six observations 
> head(all.predictions)
   actual baseline linear.regression  full.tree pruned.tree random.forest
2    64.8 5.315294         7.2638856 13.8058824  13.8058824     11.424860
4    20.1 5.315294        20.3352957 37.2090909  37.2090909     24.821385
5     9.4 5.315294        13.4020265 13.8058824  13.8058824     21.777313
6    38.9 5.315294        30.8604212 37.2090909  37.2090909     28.978410
7     2.0 5.315294         8.0923795 13.8058824  13.8058824     11.085088
10    0.0 5.315294         0.7544513  0.8281046   0.8281046      1.999272
> # Needed to melt the columns with the gather() function 
> # tidyr is an alternative to the reshape2 package (see the end of Part3a) 
> library(tidyr)
> # Gather the prediction variables (columns) into a single row (i.e., wide to long)
> # Recall the ggplot2 prefers the long data format
> all.predictions <- gather(all.predictions,key = model,value = predictions,2:6)
> head(all.predictions)
  actual    model predictions
1   64.8 baseline    5.315294
2   20.1 baseline    5.315294
3    9.4 baseline    5.315294
4   38.9 baseline    5.315294
5    2.0 baseline    5.315294
6    0.0 baseline    5.315294 
> tail (all.predictions)
    actual         model predictions
545    2.3 random.forest   12.788688
546    0.0 random.forest    0.370295
547    4.3 random.forest    2.159183
548    0.3 random.forest    6.752708
549    0.0 random.forest    3.310518
550    0.0 random.forest    2.126990 
# Predicted vs. actual for each model
ggplot(data = all.predictions,aes(x = actual, y = predictions)) + 
  geom_point(colour = "blue") + 
  geom_abline(intercept = 0, slope = 1, colour = "red") +
  geom_vline(xintercept = 23, colour = "green", linetype = "dashed") +
  facet_wrap(~ model,ncol = 2) + 
  coord_cartesian(xlim = c(0,70),ylim = c(0,70)) +
  ggtitle("Predicted vs. Actual, by model")

The graph shows that none of the models can predict accurately values over 25 mm of daily rain. In the range from 0 to approximately 22 mm (vertical dashed line for reference), the random forest seems to be the method that better approximates the diagonal line (i.e., smaller prediction error), followed closely by the linear regression. Even though these models can not be considered more than fair, they still do a much better job when compared with the baseline prediction.

In many cases, we build models not only to predict, but also to gain insight from the data. When, in Part 3b, we performed EDA, we saw that the maximum wind speed seemed to be an important predictor. Well, the models confirmed it three times: this variable was the most statistically significant in the linear model, was the only one used to predict in the regression tree, and the one that reduced the estimated error the most in the random forest.

In Part 4b, we will continue building models, this time considering the rain as a binary outcome. This means all models will assign probabilities to the occurrence of rain, for each day in the test set.


  1. Thanks for a beautiful write up on using R for climate analysis using R, please add a few more pages on the topic, thanks Dr.Samuel, Bangalore India

  2. Hi Pedro, Congrats for amazing and very instructive post. Can you upload a pdf copy of this post and related ones, i.e. from Part 1 to Part 4, if you don't mind?
    Looking forward to seeing the pdf uploaded
    Ayubo, Mozambique

    1. Thank you! I will consider the pdf once the tutorial is complete.

  3. Thank you for creating this tutorial on ggplot, very well done. I especially like that you are using weather data for the analysis.

  4. Thanks Pedro. Very insightful and easy to follow.

  5. Thanks Pedro for this nice tutorial. Could you include a gam and glm to your rain data?

  6. Thank you for the detailed tutorial. Your articles helped me to reinforce what I am learning in the Machine Learning course offered by Coursera.

    1. Yes, there are some great machine learn courses online. Other than Coursera's, I also liked Stanford's Statistical Learning and MIT's Analytics Edge.

  7. Just followed all the tutorials—great job. It really has some great explanations of different issues that often is skipped in other tutorials. Thanks. When can I expect to see the last post from you?

  8. This is excellent - very clear and comprehensive! Can i ask when 4b will be released?
    Thanks !!

  9. Pedro, part 4b please!!!! This series is way too good not to be finished with some classification models. Muito obrigado!

    1. Thank you for your comment. I've been working on the last part for the past couple of weeks and I hope to be able to publish it in March.

  10. This comment has been removed by the author.

  11. Thank You Pedro for this great series.
    Hope you will do more analysis and share with us

  12. Thank you very much for your time and effort by giving these examples so clearly. Pedro, part 4b please... Muito obrigado!

  13. Pretty good tutorial, thank you for your time and effort! :)
    It's a shame that rest of the tutorial is missing, but even this is cool. :)
    Thank you!

  14. Thanks for this tutorial - I really learned a lot!! Looking forward to seeing Part 4b

  15. Dear Pedro,

    That is quite informative. I benefited many more from this interesting lesson. Please keep promising! How we can manage for data with morethan a year? and pleased to see your Part 4b.

    Thank You,
    Mamo, Ethiopia

  16. Great tutorials Pedro. You're able to communicate your thought process succinctly and makes for awesome posts and minimal code.


  17. Great work. It is very helpfull, thank you!

  18. Fantastic post.

    Really enjoyed reading it and it held my attention all the way through! Keep it up.

    Read my Latest Post

  19. Wonderful articles, I was struggle to find a good example for my class that have to use rainfall data and came across your article. Very well written and I will cite your work when I use some parts of it in my lecture.

    btw, I havent seen much of your post since this one. I hope you are doing all good till now.