Analytical Minds: March 2015

Tuesday, 17 March 2015

Part 3b: EDA with ggplot2

In Part 3a I have introduced the plotting system ggplot2. I talked about its concept and syntax with some detail, and then created a few general plots, using the weather data set we've been working with in this series of tutorials. My goal was to show that, in ggplot2, a nice looking and interesting graph can usually be created with just a few lines of code. Given the positive feedback that post has received, I believe I had some success with what I had in mind. Who knows if some of the readers will decide to give ggplot2 a try using their own data? It is often said that a picture is worth a thousand words, but I often see some highly experienced R users, after doing really complex data analyses, plotting the results in a way that falls short of the expectations, given the quality of their analytical work.

In Part 3b, we will continue to rely on visualisations to explore the data, but now with the goal of tackling the following question: Are there any good predictors, in our data, for the occurrence of rain on a given day? The numeric variable representing the amount of rain will be our dependent variable, also known as response or outcome variable, and all the remaining ones will be potential independent variables, also known as predictors or explanatory variables.

After framing the question, and before fitting any model, EDA techniques (such as visualisation) are used to gain insight from the data. Here are the steps I usually take at this stage:

Analyse the (continuous) dependent variable - at least, make an histogram of the distribution; when applicable, plot the variable against time to see the trend. Can the continuous variable be transformed to a binary one (i.e., did it rain or not on a particular day?), or to a multicategorical one (i.e., was the rain none, light, moderate, or heavy on a particular day?).
Search for correlations between the dependent variable and the continuous independent variables - are there any strong correlations? Are they linear or non-linear?
Do the same as above but try to control for other variables (faceting in ggplot2 is very useful to do this), in order to assess for confounding and effect modification. Does the association between two continuous variables hold for different levels of a third variable, or is modified by them? (e.g., if there is a strong positive correlation between the rain amount and the wind gust maximum speed, does that hold regardless of the season of the year, or does it happen only in the winter?)
Search for associations between the dependent variable and the categorical independent variables (factors) - does the mean or median of the dependent variable change depending on the level of the factor? What about the outliers, are they evenly distributed across all levels, or seem to be present in only a few of them?

So, let's now do some analyses, having the framework described above in mind.

Exploring the dependent variable - daily rain amount

# Time series of the daily rain amount, with smoother curve

ggplot(weather, aes(date,rain)) +
  geom_point(aes(colour = rain)) +
  geom_smooth(colour = "blue", size = 1) +
  scale_colour_gradient2(low = "green", mid = "orange",high = "red", midpoint = 20) +
  scale_y_continuous(breaks = seq(0,80,20)) +
  xlab("Date") +
  ylab("Rain (mm)") +
  ggtitle("Daily rain amount")

# Histogram of the daily rain amount

ggplot(weather,aes(rain)) + 
  geom_histogram(binwidth = 1,colour = "blue", fill = "darkgrey") +
  scale_x_continuous(breaks = seq(0,80,5)) +
  scale_y_continuous(breaks = seq(0,225,25)) +
  xlab("Rain (mm)") +
  ylab ("Frequency (days)") +
  ggtitle("Daily rain amount distribution")

The time series plot shows that the daily rain amount varies wildly throughout the year. There are many dry days interspersed with stretches of consecutive wet ones, often severe, especially in the autumn and winter seasons. The histogram not only confirms what was said above, but also shows that the distribution is extremely right-skewed. As shown below, both informally (comparing the mean to the median), and formally (calculating the actual value), the positive skewness remains even after removing all the days where it did not rain.

> # Heavily right-skewed distribution
> summary(weather$rain)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.300   5.843   5.300  74.900

> # Right-skewness is still there after removing all the dry days
> summary(subset(weather, rain > 0)$rain)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.30    1.15    5.10   11.17   16.10   74.90

> # Formal calculation of skewness (e1071 package)
> library(e1071)

> skewness(weather$rain)
[1] 2.99
> skewness(subset(weather, rain >0)$rain)
[1] 2.04

It should be clear at this point that one possible approach would be to dichotomise the dependent variable (rain vs. no rain). Note that it is common to consider days with rain only those where the total amount was at least 1mm (to allow for measurement errors), and that's the criterion we will adopt here. Here is the code to do it and a few interesting summaries.

> # Create binary outcome (rained = {Yes, No})
 
> # Number of dry days
> nrow(subset(weather, rain == 0))
[1] 174

> # Number of days that will also be considered dry days
> nrow(subset(weather, rain <1 & rain >0))
[1] 45
 
> # The new binary variable is called "rained"
> weather$rained <- ifelse(weather$rain >= 1, "Yes", "No")
 
> # Dry and wet days (absolute)
> table(rained = weather$rained)

rained
 No Yes 
219 146

> # Dry and wet days (relative)
> prop.table(table(rained = weather$rained))

rained
 No Yes 
0.6 0.4

Porto is one of the wettest cities in Europe for a reason. There was rain in exactly 40% of the days of the year, and this is considering those days with rain < 1mm as dry. Should we set the cutoff at 0 mm, and more than 50% of the days in 2014 would have been considered wet.

Looking at the association between rain and season of the year

The time series plot seems to indicate that season of the year might be a good predictor for the occurrence of rain. Let's start by investigating this relation, with both the continuous rain variable and the binary one.

Rain amount (continuous) by season

# Jitter plot - Rain amount by season

 ggplot(weather, aes(season,rain)) +
  geom_jitter(aes(colour=rain), position = position_jitter(width = 0.2)) +
  scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = 30) +
  scale_y_continuous(breaks = seq(0,80,20)) +
  xlab("Season") +
  ylab ("Rain (mm)") +
  ggtitle("Daily rain amount by season")

> # Rain amount by season
> # {tapply(x,y,z) applies function z to x, for each level of y}

> tapply(weather$rain,weather$season,summary)

$Spring
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   2.934   2.550  28.200 

$Summer
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   2.936   1.350  68.300 

$Autumn
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.300   7.164   6.450  74.900 

$Winter
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    3.95   10.45   14.65   64.80

We can see that most of the extreme values (outliers) are in the winter and autumn. There are still, however, many dry days in both these seasons, and hence the means are too close from each other. Fitting a model to predict the actual amount of rain (not the same as probability of occurrence) based on the season alone might not give the greatest results.

Rain occurrence (binary) by season

# Bar plot - dry and wet days by season (relative)
ggplot(weather,aes(season)) +
  geom_bar(aes(fill = rained), position = "fill") +
  geom_hline(aes(yintercept = prop.table(table(weather$rained))["No"]),
             colour = "blue",linetype = "dashed", size = 1) +
  annotate("text", x = 1, y = 0.65, label = "yr. w/o = 0.60", colour = "blue") +
  xlab("Season") +
  ylab ("Proportion") +
  ggtitle("Proportion of days without and with rain, by season")

> round(prop.table(table(season = weather$season, rained= weather$rained),1),2)

         rained
season     No  Yes
  Spring 0.74 0.26
  Summer 0.72 0.28
  Autumn 0.57 0.43
  Winter 0.37 0.63

It appears that, when it comes to calculate the likelihood of raining on a particular day, the season of the year may have some predictive power. This is especially true for the winter season, where the rainy days (63%) are well above the yearly average (40%).

Looking at the correlations between rain and all numeric variables

We are now going to calculate the linear (Pearson) correlations between the continuous outcome variable (daily rain amount) and all the numeric variables - both the actual numeric ones (such as the temperature and wind speed), and those that are actually factor variables, but still make sense if we want to see them as numeric (for instance, the hour of the day in which some event occurred).

Note that we are not modelling at this stage, just trying to gain some insight from the data, and therefore we are neither concerned about whether the correlations are significant (p-values and/or confidence intervals) nor if the the relation between any two variables is in fact linear (we will be able to check this later, anyway, when we create the scatter plots with the lowess curves on top).

Here is the code (the comments should make it easier to understand what we are trying to do).

> # Create a new data frame with only the variables than can be numeric
> weather.num <- weather[c("rain","l.temp","h.temp","ave.temp","ave.wind","gust.wind",
+ "l.temp.hour","h.temp.hour","gust.wind.hour")]

> # Convert the following factor variables to numeric
> weather.num$l.temp.hour <- as.numeric(weather.num$l.temp.hour)
> weather.num$h.temp.hour <- as.numeric(weather.num$h.temp.hour)
> weather.num$gust.wind.hour <- as.numeric(weather.num$gust.wind.hour)

> # Pass the entire data frame to the cor() function to create a big correlation matrix
> # Pick only the first row of the matrix [1,] ==> correlation between rain and all the other variables
> round(cor(weather.num),2)[1,]
          rain         l.temp         h.temp       ave.temp       ave.wind 
          1.00          -0.14          -0.29          -0.21           0.48 
     gust.wind    l.temp.hour    h.temp.hour gust.wind.hour 
          0.61           0.19          -0.25          -0.16

There seems to be a promising positive correlation between the rain amount and the wind variables, especially the wind gust maximum speed (moderate correlation value of 0.61) , i.e., higher wind speeds tend to be associated with higher amounts of precipitation. Always bear in mind that correlation does not imply causation, therefore while it is true that the wind correlates with rain, this does not necessarily mean that the wind itself is causing the rain. It could actually be the other way around or, what is very common, both of the variables are caused by a third one that we are not even considering.

There are also some negative correlations with the temperatures (especially the daily high) that, even though not as strong as the wind ones, are still worth looking at. It appears higher amounts of rain are correlated with lower high temperatures. But let's think about it for a minute: we saw that it is in the winter when it rains the most, and in Part 3a we also saw that the temperatures were lower in the winter. Here is an example of a potential (yet to be confirmed) interaction with a third variable: it may not be the lower temperature by itself that causes more rain, but the fact the both the precipitation and lower temperatures tend to occur during a specific period of the year.

Since the season seems to have an impact on these variables, I would like to explore it a bit further, calculating all these correlations by season and check whether the values hold. If the correlation between rain and some other variable is very dissimilar across all seasons, then there is the proof for an interaction.

> # Let's split the data frame in four parts, one for each season
> weather.num.season <- split(weather.num,weather$season)

> # The result is a list...
> class(weather.num.season)
[1] "list"

> # ...with 4 elements, where...
> length(weather.num.season)
[1] 4

> # ...each element of the list is a data frame (the seasons), with nine variables
> summary(weather.num.season)
       Length Class      Mode
Spring 9      data.frame list
Summer 9      data.frame list
Autumn 9      data.frame list
Winter 9      data.frame list

> # Here are the names of each of the data frames of the list
> attributes(weather.num.season)
$names
[1] "Spring" "Summer" "Autumn" "Winter"

> # *apply family of functions are arguably the most powerful in base R, but also the most difficult to master
> # {sapply(x,z) applies function z to each element of x}
> # First go over the elements of the list and calculate the correlation matrix (all against all)
> # For each season, return only the correlation between "rain" and everything else
> sapply(weather.num.season, function (x) round(cor(x)["rain",],2))
               Spring Summer Autumn Winter
rain             1.00   1.00   1.00   1.00
l.temp          -0.33   0.06   0.13   0.05
h.temp          -0.45  -0.26  -0.07  -0.26
ave.temp        -0.43  -0.14   0.06  -0.09
ave.wind         0.34   0.50   0.38   0.66
gust.wind        0.37   0.45   0.66   0.71
l.temp.hour      0.24   0.16   0.01   0.33
h.temp.hour     -0.13  -0.22  -0.16  -0.30
gust.wind.hour  -0.26  -0.34  -0.18   0.06

> # Not going to run it, but here is an alternative that might be easier (or not) to understand
> # It actually shows the correlations for each element of the list
> # lapply(weather.num.season, function (x) round(cor(x)["rain",],2))

What do you conclude from the table above?

The correlation between the rain and wind varies, but keeps moderately strong, regardless of the season of the year, making this the most promising variable in out data set; the correlation between the rain and daily high temperature does not confirm what I had hypothesised above. In fact, the correlation is even stronger in the spring than in the winter, and we would have to go even deeper if we really needed to understand what is going on (keep in mind we are just analysing one of the possible interactions - the season - but in practice there can be multiple ones). For the purpose of what we are doing now, it is enough to be aware that this correlation is not stable throughout the year, and actually goes to zero during the autumn. Lastly, the correlations between rain and the hour of the events (low and high temperatures, and wind gust) are rather weak but show some stability (see the l.temp.hour and h.temp.hour). They might have some predictive power, at least in a linear model.

Now that we know that the wind has the strongest correlation with the rain, and that it holds true across all seasons, it's time to plot these variables, because we want to learn something we still don't know: what is the shape of this relation? Linear, piecewise linear, curvilinear? Let's find out using ggplot2 and the concept of faceting.

Looking deeper at the wind and rain correlation - faceting technique

The idea of faceting (also called Trellis plots) is a simple but important one. A graphic, of any type and mapping any number of variables, can be divided into several panels, depending on the value of a conditioning variable. This value is either each of the levels of a categorical variable, or a number of ranges of a numeric variable. This helps us check whether there are consistent patterns across all panels. In ggplot2 there are two functions to create facets - facet_wrap() and facet_grid() - that are used when we have one or two conditioning variables, respectively. As everything in ggplot2, this is just another layer in the plot, which means all geometries, mappings, and settings will be replicated across all panels.

Let's then assess the linearity of the correlation of the amount of rain and the maximum wind gust speed, conditioning on the season of the year.

# Amount of rain vs. wind, by season

ggplot(weather,aes(gust.wind,rain)) +
  geom_point(colour = "firebrick") +
  geom_smooth(size = 0.75, se = F) +
  facet_wrap(~season) +
  xlab("Maximum wind speed (km/h)") +
  ylab ("Rain (mm)") +
  ggtitle("Amount of rain vs. maximum wind speed, by season")

This plot confirms what we had already discovered: there is a positive correlation between rain and wind, and the association holds regardless of the season. But now we know more: this correlation is non-linear. In fact, if we were to generalise, we could say there is no correlation at all when the maximum wind speed is below 25 km/h. For values higher than that, there seems to be a linear association in the autumn and winter, not so linear in the spring, and definitely non-linear during the summer. If we wanted to model this relation, we would either fit a non-linear model (such as a regression tree) or we could try to force a piecewise linear model (linear spline), where the equation relating the outcome to the predictors would, itself, be different depending on the value of the wind.

Occurrence of rain - more complex faceting to visualise variable association

To finish this part of the series, let's check whether the variable that seemed to have some predictive power for the amount of rain (maximum wind speed), is also good in the case of a binary outcome (occurrence of rain), but now we will not only control for the season, but also for the daily high temperature (because, as we have seen before, this variable was interacting with both the rain and the season). We will do this simultaneously, using the faceting technique on two variables. But, since the daily high temperature variable is continuous, we need to transform it to categorical first. A common strategy is to split the continuous variables in four groups of (roughly) the same size, i.e., the quartiles. This is very simple to do in R, combining the cut() and quantile() functions.

> # Using the defaults of the quantiles function returns 4 intervals (quartiles)
> quantile(weather$h.temp)
  0%  25%  50%  75% 100% 
 9.8 14.4 19.1 23.3 31.5

> # All we need to do is to define the quartiles as the breaks of the cut function

> # and label the intervals accordingly
> weather$h.temp.quant <- cut(weather$h.temp, breaks = quantile(weather$h.temp),
                            labels = c("Cool","Mild","Warm","Hot"),include.lowest = T)

> # The result

> table(weather$h.temp.quant)

Cool Mild Warm  Hot 
  92   91   94   88

# Occurrence of rain, by season and daily high temperature 

ggplot(weather,aes(rained,gust.wind)) +
  geom_boxplot(aes(colour=rained)) +
  facet_grid(h.temp.quant~season) +
  xlab("Occurrence of rain") +
  ylab ("Maximum wind speed (km/h)") +
  ggtitle("Occurrence of rain, by season and daily high temperature")

The graph reveals a clear pattern: the median of the maximum wind speed is always higher when it rains, and this is not affected by the range the of daily high temperature, even after controlling for the temperature variation within each season of the year.

I think we now have a much better understanding of the data. We know which variables matter the most, and which ones seem to be useless, when it comes to predict the rain, either the actual amount or the probability of its occurrence. Please note that it would be impossible to write about the analysis of every single variable and show every plot; behind the scenes, much more work than what I've shown here has been done.

In Part 4, the last of this series, we will be using a few machine learning algorithms to find out how well the rain can be predicted, and the contribution of each variable to the accuracy and precision of each model.

In the meanwhile, if you have done any kind of analysis with this data set, feel free to share your own findings, either by commenting on the post or by sending me a private message.

Thursday, 5 March 2015

Part 3a: Plotting with ggplot2

We will start off this first section of Part 3 with a brief introduction of the plotting system ggplot2. Then, with the attention focused mainly on the syntax, we will create a few graphs, based on the weather data we have prepared previously.

Next, in Part 3b, where we will be doing actual EDA, specific visualisations using ggplot2 will be developed, in order to address the following question: Are there any good predictors, in our data, for the occurrence of rain on a given day?

Lastly, in Part 4, we will use Machine Learning algorithms to check to which extent rain can be predicted based on the variables available to us.

But for now, let's see what this ggplot2 system is all about and create a few nice looking figures.

The concept of ggplot2

The R package ggplot2, created by Hadley Wickham, is an implementation of Leland Wilkinson's Grammar of Graphics, which is a systematic approach to describe the components of a graphic. In maintenance mode (i.e., no active development) since February 2014, ggplot2 it is the most downloaded R package of all time.

In ggplot2, the graphics are constructed out of layers. The data being plotted, coordinate system, scales, facets, labels and annotations, are all examples of layers. This means that any graphic, regardless of how complex it might be, can be built from scratch by adding simple layers, one after another, until one is satisfied with the result. Each layer of the plot may have several different components, such as: data, aesthetics (mapping), statistics (stats), etc. Besides the data (i.e., the R data frame from where we want to plot something), aesthetics and geometries are particularly important, and we should make a clear distinction between them:

Aesthetics are visual elements mapped to data; the most common aesthetic attributes are the x and y values, colour, size, and shape;
Geometries are the actual elements used to plot the data, for example: point, line, bar, map, etc.

So, what does this mean in practice? Suppose we want to draw a scatter plot to show the relationship between the minimum and maximum temperatures on each day, controlling for the season of the year using colour (good for categorical variables), and the rain using size (good for continuous variables). We would map the aesthetics x, y, colour and size to the respective data variables, and then draw the actual graphic using a geometry - in this case, geom_point().

Here are some important aspects to have in mind:

Not all aesthetics of a certain geometry have to be mapped by the user, only the required ones. For example, in the case above we didn't map the shape of the point, and therefore the system would use the default value (circle); it should be obvious, however, that x and y have no defaults and therefore need to be either mapped or set;
Some aesthetics are specific of certain geometries - for instance, ymin and ymax are required aesthetics for geom_errorbar(), but don't make sense in the context of geom_point();
We can map an aesthetic to data (for example, colour = season), but we can also set it to a constant value (for example, colour = "blue"). There are a few subtleties when writing the actual code to map or set an aesthetic, as we will see below.

The syntax of ggplot2

A basic graphic in ggplot2 consists of initializing an object with the function ggplot() and then add the geometry layer.

ggplot() - this function is always used first, and initializes a ggplot object. Here we should declare all the components that are intended to be common to all the subsequent layers. In general, these components are the data (the R data frame) and some of the aesthetics (mapping visual elements to data, via the aes() function). As an example:

ggplot(data = dataframe, aes(x = var1, y= var2, colour = var3, size = var4))

geom_xxx() - this layer is added after ggplot() to actually draw the graphic. If all the required aesthetics were already declared in ggplot(), it can be called without any arguments. If an aesthetic previously declared in ggplot() is also declared in the geom_xxx layer, the latter overwrites the former. This is also the place to map or set specific components specific to the layer. A few real examples from our weather data should make it clear:

# This draws a scatter plot where season controls the colour of the point

ggplot(data = weather, aes(x = l.temp, y= h.temp, colour = season) + geom_point()

# This draws a scatter plot but colour is now controlled by dir.wind (overwrites initial definition)

ggplot(data = weather, aes(x = l.temp, y= h.temp, colour = season) + geom_point(aes(colour=dir.wind))

# This sets the parameter colour to a constant instead of mapping it to data. This is outside aes()!

ggplot(data = weather, aes(x = h.temp, y= l.temp) + geom_point(colour = "blue")

# Mapping to data (aes) and setting to a constant inside the geom layer

ggplot(data = weather, aes(x = l.temp, y= h.temp) + geom_point(aes(size=rain), colour = "blue")

# This is a MISTAKE - mapping with aes() when intending to set an aesthetic. It will create a column in the data named "blue", which is not what we want.

ggplot(data = weather, aes(x = l.temp, y= h.temp) + geom_point(aes (colour = "blue"))

When learning the ggplot2 system, this is a probably a good way to make the process easier:

Decide what geom you need - go here, identify the geom best suited for your data, and then check which aesthetics it understands (both the required, that you will need to specify, and the optional ones, that more often than not are useful);
Initialize the plot with ggplot() and map the aesthetics that you intend to be common to all the subsequent layers (don't worry too much about this, you can always overwrite aesthetics if needed);
Make sure you understand if you want an aesthetic to be mapped to data or set to a constant. Only in the former the aes() function is used.

Although aesthetics and geometries are probably the most important part of ggplot2, there are more layers/components that we will be using often, such as scales (both colour scales and axes-controlling ones), statistics, and faceting (this one surely in Part 3b). Now, let's make some plots using different geometries!

Plotting a time series - point geom with smoother curve

# Time series of average daily temperature, with smoother curve

ggplot(weather,aes(x = date,y = ave.temp)) +
  geom_point(colour = "blue") +
  geom_smooth(colour = "red",size = 1) +
  scale_y_continuous(limits = c(5,30), breaks = seq(5,30,5)) +
  ggtitle ("Daily average temperature") +
  xlab("Date") +  ylab ("Average Temperature ( ºC )")

Instead of setting the colour to blue, we can map it to the value of the temperature itself, using the aes() function. Moreover, we can define the color gradient we want - in this case, I told that cold days should be blue and warm days red. I forced the gradient to go through green, and pure green should occur when the temperature is 16 ºC. There are many ways to work with colour, and it would be impossible for us to cover everything.

# Same but with colour varying

ggplot(weather,aes(x = date,y = ave.temp)) + 
  geom_point(aes(colour = ave.temp)) +
  scale_colour_gradient2(low = "blue", mid = "green" , high = "red", midpoint = 16) + 
  geom_smooth(color = "red",size = 1) +
  scale_y_continuous(limits = c(5,30), breaks = seq(5,30,5)) +
  ggtitle ("Daily average temperature") +
  xlab("Date") +  ylab ("Average Temperature ( ºC )")

The smoother curve (technically a loess) drawn on the graphs above shows the typical pattern for a city in the northern hemisphere, i.e., higher temperatures in July (summer) and lower in January (winter).

Analysing the temperature by season - density geom

# Distribution of the average temperature by season - density plot

ggplot(weather,aes(x = ave.temp, colour = season)) +
  geom_density() +
  scale_x_continuous(limits = c(5,30), breaks = seq(5,30,5)) +
  ggtitle ("Temperature distribution by season") +
  xlab("Average temperature ( ºC )") +  ylab ("Probability")

I find this graph interesting. Spring and autumn seasons are often seen as the transition from cold to warm and warm to cold days, respectively. The spread of their distributions reflect the high thermal amplitude of these seasons. On the other hand, winter and summer average temperatures are much more concentrated around a few values, and hence the peaks shown on the graph.

Analysing the temperature by month - violin geom with jittered points overlaid

# Label the months - Jan...Dec is better than 1...12

weather$month = factor(weather$month,
                       labels = c("Jan","Fev","Mar","Apr",
                                  "May","Jun","Jul","Aug","Sep",
                                  "Oct","Nov","Dec"))

# Distribution of the average temperature by month - violin plot,
# with a jittered point layer on top, and with size mapped to amount of rain

ggplot(weather,aes(x = month, y = ave.temp)) +
  geom_violin(fill = "orange") +
  geom_point(aes(size = rain), colour = "blue", position = "jitter") +
  ggtitle ("Temperature distribution by month") +
  xlab("Month") +  ylab ("Average temperature ( ºC )")

A violin plot is sort of a mixture of a box plot with a histogram (or even better, a rotated density plot). I also added a point layer on top (with jitter for better visualisation), in which the size of the circle is mapped to the continuous variable representing the amount of rain. As can be seen, there were several days in the autumn and winter of 2014 with over 60 mm of precipitation, which is more than enough to cause floods in some vulnerable zones. In subsequent parts, we will try to learn more about rain and its predictors, both with visualisation techniques and machine learning algorithms.

Analysing the correlation between low and high temperatures

# Scatter plot of low vs high daily temperatures, with a smoother curve for each season

ggplot(weather,aes(x = l.temp, y = h.temp)) +
  geom_point(colour = "firebrick", alpha = 0.3) + 
  geom_smooth(aes(colour = season),se= F, size = 1.1) +
  ggtitle ("Daily low and high temperatures") +
  xlab("Daily low temperature ( ºC )") +  ylab ("Daily high temperature ( ºC )")

The scatter plot shows a positive correlation between the low and high temperatures for a given day, and it holds true regardless of the season of the year. This is a good example of a graphic where we set an aesthetic for one geom (point colour) and mapped an aesthetic for another (smoother curve colour).

Distribution of low and high temperatures by the time of day

In this final example, we would like to plot the frequencies of two data series (i.e. two variables) on the same graph, namely, the lowest and highest daily temperatures against the hour when they occurred. The way these variables are represented in our data set is called wide data form, which is used, for instance, in MS Excel, but it is not the first choice in ggplot2, where the long data form is preferred.

Let's think a bit about this important, albeit theoretical, concept. When we plotted the violin graph before, did we have wide data (12 numerical variables, each one containing the temperatures for each month), or long data (1 numerical variable with all temperatures and 1 grouping variable with the month information)? Definitely the latter. This is exactly need what we need to do to our 2 temperature variables: create a grouping variable that, for each row, represents either a low temperature or a high temperature, and a numerical variable indicating the corresponding hour of the day. The example in the code below should help understanding this concept (for further information on this topic, please have a look at the tidy data article mentioned in Part 1 of this tutorial).

> # We will use the melt function from reshape2 to convert from wide to long form
> library(reshape2)

> # select only the variables that are needed (day and temperatures) and assign to a new data frame
> temperatures <- weather[c("day.count","h.temp.hour","l.temp.hour")]

> # Temperature variables in columns
> head(temperatures)
  day.count h.temp.hour l.temp.hour
1         1           0           1
2         2          11           8
3         3          14          21
4         4           2          11
5         5          13           2
6         6           0          20

> dim(temperatures)
[1] 365   3

> # The temperatures are molten into a single variable called l.h.temp

> temperatures <- melt(temperatures,id.vars = "day.count",
                       variable.name = "l.h.temp", value.name = "hour")

> # See the difference?
> head(temperatures)
  day.count    l.h.temp hour
1         1 h.temp.hour    0
2         2 h.temp.hour   11
3         3 h.temp.hour   14
4         4 h.temp.hour    2
5         5 h.temp.hour   13
6         6 h.temp.hour    0

> tail(temperatures)
    day.count    l.h.temp hour
725       360 l.temp.hour    7
726       361 l.temp.hour    4
727       362 l.temp.hour   23
728       363 l.temp.hour    8
729       364 l.temp.hour    5
730       365 l.temp.hour    8

> # We now have twice the rows. Each day has an observation for low and high temperatures
> dim(temperatures)
[1] 730   3

> # Needed to force the order of the factor's level
> temperatures$hour <- factor(temperatures$hour,levels=0:23)

# Now we can just fill the colour by the grouping variable to visualise the two distributions 
ggplot(temperatures) +
  geom_bar(aes(x = hour, fill = l.h.temp)) +
  scale_fill_discrete(name= "", labels = c("Daily high","Daily low")) +
  scale_y_continuous(limits = c(0,100)) +
  ggtitle ("Low and high temperatures - time of the day") +
  xlab("Hour") +  ylab ("Frequency")

The graph clearly shows with we know from experience: the daily lowest temperatures tend to occur early in the morning and the highest in the afternoon. But can you see those outliers in the distribution? There were a few days were the highest temperature was during the night, and a few others where the lowest was during the day. Can this somehow be correlated to rain? This is the kind of questions we will be exploring in Part 3b, where we will use EDA techniques (visualisations) to identify potential predictors for the occurrence of rain. Then, in Part 4, using data mining and machine learning algorithms, not only we will find the best predictors for rain (if any) and determine the accuracy of the models, but also the individual contribution of each variable in our data set.

More to come soon... stay tuned!