Saturday, 14 February 2015

Part 1: Introduction


The ultimate goal of every data scientist is to extract as much valuable information as possible from a given data set. We want to be able to predict the future based on the past, to discover very deep and hidden patterns in the data, and to expand the current base of knowledge in some specific domain. With this in mind, several machine learning algorithms, from Neural Networks to Support Vector Machines, from Naïve Bayes to Random Forests, have been developed over the years. In many situations, when correctly applied, these can provide greater insight from the data than any human could do, how clever they might be, based on his analytical skills.
 
Daily experience shows, however, that the process of analysing data seldom goes past the Exploratory Data Analysis (EDA). In fact, more often than not, the classical EDA approach of summarising and plotting is sufficient for the analyst to build a solid intuition of what the data is trying to convey, and sometimes raising new questions that might be addressed afterwards, if a model is to be developed.

Unfortunately, in many cases we cannot perform EDA right away. Raw data is often messy, unstructured, badly coded, inconsistent, or just plain wrong. It is often said that the analyst spends 80% of the time preparing the data, and only 20% actually doing analysis and modelling. This initial data wrangling consists of handling missing values, remove duplicates, transforming variables, format variable types, recode values, detect outliers to ascertain data integrity, and others. Our goal is to have what is called tidy data by the end of the process.

Data analysis using R


In this series of intermediate-level tutorials, I will guide you through the process of analysing an actual data set. We will start by preparing the data and then use a few EDA techniques to get a grasp of what’s in it. To accomplish this, we will be using the lingua franca of data science, the R programming language. There are many advantages and disadvantages to using R, and I summarise below the ones I personally deem the most relevant.

Pros
  • There is a big community of R users – as of 2014, there were more than 5 000 user-contributed packages only in the main repository (CRAN), and around 150 000 R functions (software popularity). If you need a function for some specific purpose, it is highly likely that someone else has already created it before. If you have any doubt about the R syntax, you’ll probably find the answer online easily (stackoverflow);
  • The superb graphics capabilities offered by the ggplot2 package - this plotting system implements the grammar of graphics, a new way of thinking about the visual representation of the data. If you have to pick a single topic lo learn in R, ggplot2 is arguably the best option. It is a language by itself that will allow you to produce high quality plots in a short amount of time;
  •  It’s free! (This is not unique to R, though).

Cons
  • R has a steep learning curve – the paradigm is different from the mainstream languages and even from other statistical packages. It is highly interactive, where to complete an analysis you often call one function, take the result and use it to feed the next function and so on, in a cycle that can be quite extensive;
  • Some of the syntax is far from intuitive and even cumbersome – for example, using the intuitive sort() function to sort a data frame yields in nasty results; you need to use the order() function instead, not directly, but in a rather convoluted way. It must be said, however, that several packages have been developed to make the analysis much easier, namely plyr/dplyr to manipulate (filter, transform, summarise) data, lubridate and stringr to lessen the burden when dealing with dates and strings, respectively, sqldf to run SQL statements directly over R data frames, among a few other packages;
  • R can be a bit slow, but that’s more in the context of developing algorithms than about performing interactive EDA. In fact, the bottleneck of the process is the analyst himself. The time we spend thinking about what information to extract and how we want to visualise it is several orders of magnitude greater than the time it takes R to actually draw the graphics.

Without further ado, let’s have a look at the data set we will be using in this series of tutorials.

The weather data set


The data set consists of daily records of several meteorological parameters, measured in the city of Porto over the year of 2014. We have, then, 365 observations for each of the following 14 variables:

day.count – number of days passed since the beginning of the year
day – day of the month
month – month of the year
season – season of the year
l.temp, h.temp, ave.temp – lowest, highest and average temperature for the day (in ºC)
l.temp.time, h.temp.time – hour of the day when l.temp and h.temp occurred
rain – amount of precipitation (in mm)
ave.wind – average wind speed for the day (in km/h)
gust.wind – maximum wind speed for the day (in km/h)
gust.wind.time – hour of the day when gust.wind occurred
dir.wind – dominant wind direction for the day

Now, let’s move on to Part 2 of this tutorial, where we will start by inspecting the data and prepare it, so we can then proceed to perform EDA.

1 comment: