The ultimate goal of every data scientist is to extract as much valuable
information as possible from a given data set. We want to be able to predict
the future based on the past, to discover very deep and hidden patterns in the
data, and to expand the current base of knowledge in some specific domain. With
this in mind, several machine learning algorithms, from Neural Networks to
Support Vector Machines, from Naïve Bayes to Random Forests, have been
developed over the years. In many situations, when correctly applied, these can
provide greater insight from the data than any human could do, how clever they might be, based on his analytical skills.
Daily experience shows, however, that the process of analysing data
seldom goes past the Exploratory Data Analysis (EDA). In fact, more often than
not, the classical EDA approach of summarising and plotting is sufficient for
the analyst to build a solid intuition of what the data is trying to convey, and
sometimes raising new questions that might be addressed afterwards, if a model
is to be developed.
Unfortunately, in many cases we cannot perform EDA right away. Raw data is
often messy, unstructured, badly coded, inconsistent, or just plain wrong. It
is often said that the analyst spends 80% of the time preparing the data, and
only 20% actually doing analysis and modelling. This initial data wrangling
consists of handling missing values, remove duplicates, transforming variables,
format variable types, recode values, detect outliers to ascertain data
integrity, and others. Our goal is to have what is called tidy data by the end
of the process.
Data analysis using R
In this series of intermediate-level tutorials, I will guide you through
the process of analysing an actual data set. We will start by preparing the
data and then use a few EDA techniques to get a grasp of what’s in it. To
accomplish this, we will be using the lingua
franca of data science, the R programming language. There are many
advantages and disadvantages to using R, and I summarise below the ones I
personally deem the most relevant.
Pros
- There is a big community of R users – as of 2014, there were more than 5 000 user-contributed packages only in the main repository (CRAN), and around 150 000 R functions (software popularity). If you need a function for some specific purpose, it is highly likely that someone else has already created it before. If you have any doubt about the R syntax, you’ll probably find the answer online easily (stackoverflow);
- The superb graphics capabilities offered by the ggplot2 package - this plotting system implements the grammar of graphics, a new way of thinking about the visual representation of the data. If you have to pick a single topic lo learn in R, ggplot2 is arguably the best option. It is a language by itself that will allow you to produce high quality plots in a short amount of time;
- It’s free! (This is not unique to R, though).
Cons
- R has a steep learning curve – the paradigm is different from the mainstream languages and even from other statistical packages. It is highly interactive, where to complete an analysis you often call one function, take the result and use it to feed the next function and so on, in a cycle that can be quite extensive;
- Some of the syntax is far from intuitive and even cumbersome – for example, using the intuitive sort() function to sort a data frame yields in nasty results; you need to use the order() function instead, not directly, but in a rather convoluted way. It must be said, however, that several packages have been developed to make the analysis much easier, namely plyr/dplyr to manipulate (filter, transform, summarise) data, lubridate and stringr to lessen the burden when dealing with dates and strings, respectively, sqldf to run SQL statements directly over R data frames, among a few other packages;
- R can be a bit slow, but that’s more in the context of developing algorithms than about performing interactive EDA. In fact, the bottleneck of the process is the analyst himself. The time we spend thinking about what information to extract and how we want to visualise it is several orders of magnitude greater than the time it takes R to actually draw the graphics.
Without further ado, let’s have a look at the data set we will be using in this series of tutorials.
The weather data set
The data set consists of daily records of several meteorological parameters,
measured in the city of Porto
over the year of 2014. We have, then, 365 observations for each of the
following 14 variables:
day.count – number of days passed since the beginning of the year
day – day of the month
month – month of the year
season – season of the year
l.temp, h.temp, ave.temp – lowest, highest and average temperature for
the day (in ºC)
l.temp.time, h.temp.time – hour of the day when l.temp and h.temp occurred
rain – amount of precipitation (in mm)
ave.wind – average wind speed for the day (in km/h)
gust.wind – maximum wind speed for the day (in km/h)
gust.wind.time – hour of the day when gust.wind occurred
dir.wind – dominant wind direction for the day
Now, let’s move on to Part 2 of this tutorial, where we will start by
inspecting the data and prepare it, so we can then proceed to perform EDA.
Thanks so much
ReplyDelete