Friday 20 February 2015

Part 2: Data Preparation


In Part 1 I have introduced the weather data set we will be using in this series of tutorials. We are now going to have the data prepared for the subsequent EDA. We will recode and transform variables, change their types, and perform some basic data checks. Feel free to follow along with the analysis (click here to download the weather data), bearing in mind you can type ?function_name to get help about some specific R function, for instance, ?head

Importing the data


To start off, let's read in the data to an R data frame and run some basic commands:

# Make sure the file is in your current working directory
 
> weather <- read.csv("weather_2014.csv",sep=";",stringsAsFactors=FALSE)
 
> dim(weather)
[1] 365  14

> names(weather)
 [1] "day.count"      "day"            "month"          "season"        
 [5] "l.temp"         "h.temp"         "ave.temp"       "l.temp.time"   
 [9] "h.temp.time"    "rain"           "ave.wind"       "gust.wind"     
[13] "gust.wind.time" "dir.wind"
 
> head(weather)
  day.count day month season l.temp h.temp ave.temp l.temp.time h.temp.time rain
1         1   1     1 Winter   12.7   14.0     13.4       01:25       23:50 32.0
2         2   2     1 Winter   11.3   14.7     13.5       07:30       11:15 64.8
3         3   3     1 Winter   12.6   14.7     13.6       21:00       14:00 12.7
4         4   4     1 Winter    7.7   13.9     11.3       10:35       01:50 20.1
5         5   5     1 Winter    8.8   14.6     13.0       01:40       12:55  9.4
6         6   6     1 Winter   11.8   14.4     13.1       19:35       00:05 38.9
  ave.wind gust.wind gust.wind.time dir.wind
1     11.4      53.1          15:45        S
2      5.6      41.8          22:25        S
3      4.3      38.6          00:00      SSW
4     10.3      66.0          09:05       SW
5     11.6      51.5          13:50      SSE
6      9.9      57.9          08:10      SSE 
 
 
It seems we have correctly loaded the data into R. Notice that the variables in the data file are separated not by a comma, but by a semicolon, and hence the need to set the sep = ";" argument. We also told R not to import strings as factors (i.e., categorical variables). In many cases some character variables are indeed strings and some integer variables are actually factors. This means there is almost always manual work to be done after importing, and therefore I prefer to read in the variables without any initial processing.

Let's now have a look at the structure of the weather data frame, using one of the most useful R functions, str().

> str(weather)
'data.frame': 365 obs. of  14 variables:
 $ day.count     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ day           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ season        : chr  "Winter" "Winter" "Winter" "Winter" ...
 $ l.temp        : num  12.7 11.3 12.6 7.7 8.8 11.8 11.4 12.4 9.2 8.3 ...
 $ h.temp        : num  14 14.7 14.7 13.9 14.6 14.4 14.8 15.6 18.4 14.8 ...
 $ ave.temp      : num  13.4 13.5 13.6 11.3 13 13.1 13.5 14.1 12.9 11 ...
 $ l.temp.time   : chr  "01:25" "07:30" "21:00" "10:35" ...
 $ h.temp.time   : chr  "23:50" "11:15" "14:00" "01:50" ...
 $ rain          : num  32 64.8 12.7 20.1 9.4 38.9 2 1.5 0 0 ...
 $ ave.wind      : num  11.4 5.6 4.3 10.3 11.6 9.9 6.6 5.9 0.2 1.4 ...
 $ gust.wind     : num  53.1 41.8 38.6 66 51.5 57.9 38.6 33.8 16.1 24.1 ...
 $ gust.wind.time: chr  "15:45" "22:25" "00:00" "09:05" ...
 $ dir.wind      : chr  "S" "S" "SSW" "SW" ...

Based on the output of this function, and having in the mind the goal of producing visualisations and potentially build models, how would you change the variables in the data set? Here are my thoughts:
  • Day and month are coded as integers but they should be factors (categorical variables); the same applies to the character variables representing the season and wind direction;
  • Looking at the first values for the wind direction - "S" "S" "SSW" "SW" - it seems that a 16-wind compass rose has been used. Since there are only 365 days in the year, do we have sufficient observations for each of the 16 groups, or could we try to group them into 8 principal winds or even only the 4 cardinal directions?
  • The day count (number of days since the beginning of the year) is useful, but we would like to have dates on the x axis when plotting instead of an index. This variable should therefore be transformed;
  • The three time variables show the exact minute where the corresponding event occurred. We would most likely benefit by doing some aggregation, and rounding to the nearest hour seems a good option. After that we would convert the hour variable to factor.

It is also quite common to check for missing values (coded as NA by default). As with almost everything, there are many ways to do it in R. Here are just two of them:

# One way (sum of NA over the entire data set)
> sum(is.na(weather))
[1] 0

# Another way (number of complete observations)
> nrow(weather)
[1] 365 
 
> sum(complete.cases(weather))
[1] 365 
 
> nrow(weather) == sum(complete.cases(weather))
[1] TRUE
 

 

Create factors

 

Variables can be easily coerced to factors using the as.factor() function, which is an abbreviated form of the main function, factor(). The first one, however, will order the levels by the alphabet when the original variable is a string (class character in R), which might be something we don't want for some variables, for example, the season of the year. The code below shows how factors are created and why they differ from the other types of variables.

> # Before (365 independent strings)
> class(weather$season)
[1] "character"

> summary(weather$season)
   Length     Class      Mode 
      365 character character 

> weather$season <- factor(weather$season,
                    levels = c("Spring","Summer","Autumn","Winter"))

> # After (4 categories, ordered by "levels")
> class(weather$season)
[1] "factor"

> summary(weather$season)
Spring Summer Autumn Winter 
    92     92     91     90
 
> # Using as.factor() when the order doesn't matter or original var. is integer

> weather$day <- as.factor(weather$day)
> weather$month <- as.factor(weather$month)
> weather$dir.wind <- as.factor(weather$dir.wind)
 

 

Dealing with the wind


Let's start by checking whether there are actually 16 directions in the dir.wind variable and, if so, determine whether we have sufficient number of observations in each group.

> # Number of unique values
> length(unique(weather$dir.wind))
[1] 16 
 
> # Absolute frequency (table function)
> table(weather$dir.wind)

  E ENE ESE   N  NE NNE NNW  NW   S  SE SSE SSW  SW   W WNW WSW 
 11  15   2  18  25   8  37 108  26  24  31  17  11   5  24   3 
 
> # Making it relative (prop.table function)
> rel <- round(prop.table(table(weather$dir.wind))*100,1)
> rel

   E  ENE  ESE    N   NE  NNE  NNW   NW    S   SE  SSE  SSW   SW    W  WNW  WSW 
 3.0  4.1  0.5  4.9  6.8  2.2 10.1 29.6  7.1  6.6  8.5  4.7  3.0  1.4  6.6  0.8 
 
> # Bringing some order to the table
> sort(rel,decreasing = TRUE)

  NW  NNW  SSE    S   NE   SE  WNW    N  SSW  ENE    E   SW  NNE    W  WSW  ESE 
29.6 10.1  8.5  7.1  6.8  6.6  6.6  4.9  4.7  4.1  3.0  3.0  2.2  1.4  0.8  0.5
 
It can be seen that the relative frequency is less than 5% for more than half of the groups. Unfortunately, we don't have the actual wind direction in degrees. But, making use of some domain knowledge and the information in the table above, let's try to give a reasonable answer to the following question: Is it more likely for a value in the NNW group to be closer to NW to N?
(assuming the direction isn't exactly the midpoint between NW and N, in which case it would be a pure NNW). The numbers show that NW would be far more likely. Even though this logic doesn't necessarily apply to all of the directions we are trying to eliminate, let's assume this criterion of recode them as the closest ordinal direction (by definition, "NW","NE","SE", "SW" are called ordinal) in a new variable.

As long as the analyst knows to explain why and how some new variable was created, and bears in mind it may lack accuracy, it is perfectly fine to add it to the data set. It may be useful or useless, and that is something he will try to figure out during the stages of visualisation or modelling.

> # Transforming wind direction variable: from 16 to 8 principal winds 
 
> # Create a copy from the original variable...
> weather$dir.wind.8 <- weather$dir.wind 
 
> # ...and then simply recode some of the variables
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in%  c("NNE","ENE"),
                               "NE",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("NNW","WNW"),
                               "NW",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("WSW","SSW"),
                               "SW",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("ESE","SSE"),
                               "SE",as.character(weather$dir.wind.8)) 
 
> # create factors, ordered by "levels" 
> weather$dir.wind.8 <- factor(weather$dir.wind.8,
                        levels = c("N","NE","E","SE","S","SW","W","NW"))
 
 
> # Checking the length of the new variable 
> length(unique(weather$dir.wind.8))
[1] 8 
 
> # A 2-way table (direction vs season), with relative frequencies calculated
    over margin = 2 (i.e., the columns)  
 
> round(prop.table(table(weather$dir.wind.8,weather$season),margin = 2)*100,1)
    
     Spring Summer Autumn Winter
  N     1.1    3.3   12.1    3.3
  NE   14.1    5.4   20.9   12.2
  E     0.0    0.0    5.5    6.7
  SE   13.0   14.1   20.9   14.4
  S     5.4   12.0    4.4    6.7
  SW    6.5    8.7    2.2   16.7
  W     2.2    0.0    1.1    2.2
  NW   57.6   56.5   33.0   37.8 


The function ifelse() is one of the classical ways to populate a new variable based on the value, or calculation over the value, of any other(s). When the original variable is numeric, cut() is often simpler and used instead.

Just as a side note, it is uncommon and considered bad practice to use for loops and if statements in R when the goal is to loop through the rows of a column and apply some function when a certain condition is met. R supports vectorisation, which is a much more efficient way to accomplish the same thing. In fact, the ifelse() function is the vectorised way that makes the for-if-else construct unnecessary.

 

 

We need a date


To create a date in R, we just need to pass to the function as.Date() a string with an appropriate format. We can then add and subtract days using the usual math operators. Here is the code to calculate the date based on the day.count variable.

> first.day <- "2014-01-01"
> class(first.day)
[1] "character" 
 
> first.day <- as.Date(first.day)
> class(first.day)
[1] "Date" 
 
> Here is where we actually calculate the date 
> weather$date  <- first.day + weather$day.count - 1 
 
> head(weather$day.count)
[1] 1 2 3 4 5 6 
 
> head(weather$date)
[1] "2014-01-01" "2014-01-02" "2014-01-03" "2014-01-04" "2014-01-05" "2014-01-06"

 

 

Not so hard times


The last thing we need to do is to round (to the nearest hour) the time at which a certain event occurred  (lower temperature, higher temperature, and wind gust). Working with times in R is a bit more complicated than working with dates, with the former having two alternative classes to represent it: POSIXct and POSIXlt. The first stores the date and time as a simple number, representing the seconds since the UNIX epoch (Jan 1, 1970); the second stores the date and time in a list, with elements for seconds, hours, years, among others. Since we are interested in extracting the hour information, after rounding, we will use the more complete POSIXlt class.

> # Store date and time as POSIXlt class
> l.temp.time.date <- as.POSIXlt(paste(weather$date,weather$l.temp.time))
> head(l.temp.time.date)
[1] "2014-01-01 01:25:00 GMT" "2014-01-02 07:30:00 GMT" "2014-01-03 21:00:00 GMT"
[4] "2014-01-04 10:35:00 GMT" "2014-01-05 01:40:00 GMT" "2014-01-06 19:35:00 GMT" 
 
> # Round to the nearest hour
> l.temp.time.date <- round(l.temp.time.date,"hours")
> head(l.temp.time.date)
[1] "2014-01-01 01:00:00 GMT" "2014-01-02 08:00:00 GMT" "2014-01-03 21:00:00 GMT"
[4] "2014-01-04 11:00:00 GMT" "2014-01-05 02:00:00 GMT" "2014-01-06 20:00:00 GMT" 
 
> # Which attributes are stored in the POSIXlt time variable?
> attributes(l.temp.time.date)
$names
 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"   "isdst" 
[10] "zone"   "gmtoff"

$class
[1] "POSIXlt" "POSIXt" 

$tzone
[1] ""    "GMT" "BST"

> # Extract the value of the hour attribute as a number and add it to the data set
> weather$l.temp.hour <- l.temp.time.date [["hour"]]
> head(weather$l.temp.hour)
[1]  1  8 21 11  2 20 
 
> # Lastly, the integer is converted to factor
> weather$l.temp.hour <- as.factor(weather$l.temp.hour)
> head(weather$l.temp.hour)
[1] 1  8  21 11 2  20
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 17 18 19 20 21 22 23

 

 

The prepared data set


After all the processing we have done, let's call str() again to see what our final data set looks like. We now have a date variable (to plot a time series) and several factor variables (commonly used to identify different groups of a numerical variable when creating a data visualisation).

> str(weather)
'data.frame': 365 obs. of  19 variables:
 $ day.count     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ day           : Factor w/ 31 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ month         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ season        : Factor w/ 4 levels "Spring","Summer",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ l.temp        : num  12.7 11.3 12.6 7.7 8.8 11.8 11.4 12.4 9.2 8.3 ...
 $ h.temp        : num  14 14.7 14.7 13.9 14.6 14.4 14.8 15.6 18.4 14.8 ...
 $ ave.temp      : num  13.4 13.5 13.6 11.3 13 13.1 13.5 14.1 12.9 11 ...
 $ l.temp.time   : chr  "01:25" "07:30" "21:00" "10:35" ...
 $ h.temp.time   : chr  "23:50" "11:15" "14:00" "01:50" ...
 $ rain          : num  32 64.8 12.7 20.1 9.4 38.9 2 1.5 0 0 ...
 $ ave.wind      : num  11.4 5.6 4.3 10.3 11.6 9.9 6.6 5.9 0.2 1.4 ...
 $ gust.wind     : num  53.1 41.8 38.6 66 51.5 57.9 38.6 33.8 16.1 24.1 ...
 $ gust.wind.time: chr  "15:45" "22:25" "00:00" "09:05" ...
 $ dir.wind      : Factor w/ 16 levels "E","ENE","ESE",..: 9 9 12 13 11 11 10 10 4 7 ...
 $ dir.wind.8    : Factor w/ 8 levels "N","NE","E","SE",..: 5 5 6 6 4 4 4 4 1 8 ...
 $ date          : Date, format: "2014-01-01" "2014-01-02" ...
 $ l.temp.hour   : Factor w/ 20 levels "0","1","2","3",..: 2 9 18 12 3 17 8 1 8 9 ...
 $ h.temp.hour   : Factor w/ 19 levels "0","1","2","3",..: 1 8 11 3 10 1 12 10 11 9 ...
 $ gust.wind.hour: Factor w/ 24 levels "0","1","2","3",..: 17 23 1 10 15 9 13 15 15 15 ...

  

 

Final notes on data "wrangling" and R


As we have seen in Part 1, the process of cleaning and transforming the raw data is almost always required prior to starting the actual analysis. In fact, in many real life cases, the visualisation and modeling stages are easier and less time-consuming than having the data ready to be explored.

The R language is extremely powerful to create visualisations and build models. It is often considered, however, that the language is not the most user-friendly when it comes to prepare the data (still true, but not as much as a few years ago). Here are some of the alternatives we, the analysts, have at our disposal:
  • For reasonably simple and tidy data sets, like the one we have been using in this tutorial, Excel is usually sufficient; a combination of Pivot Tables, Vlookup() and/or Index()/Match() and a few basic formatting functions would have accomplished the same we have done here;
  • When some more advanced processing is required (for instance, the use of regex), and even though R supports regular expressions and provides functions in its base package, some prefer to use other languages (in this case, Perl or Python would be good options);
  • An ever increasing alternative option is to use other R packages that provide wrapper functions for the ones in its base. For instance, we would probably have dealt with our three time variables easily using the lubridate package than using the base functions; had we needed to format strings and the stringr package would provide us with several consistent wrappers, making it simpler to process the text.

Onward and upward to Part 3 of this series of tutorials, where we will be creating a few  visualisations to gain insight from our weather data.        

7 comments:

  1. Hey,
    Nicely written and explained. I hope you'll keep up the good work and the series will continue
    All the best
    Slawek

    ReplyDelete
  2. Awesome walk through of real world data sciencing. I have seen articles and video tutorials about "tidy data" but can't remember such a practical example of transforming variables and why. I've always disliked R's automatic preprocessing strings into factors. thanks for offering a lucid alternative.

    ReplyDelete
    Replies
    1. Thanks Jeff! It's always nice to have our work validated by the readers.

      Delete
  3. Thanks for your offer, it's so useful. Please keep up.

    ReplyDelete
  4. Pavithran TM
    nice work.. super stuff told in a simple way.. keep the good work going:)

    ReplyDelete
  5. Thanks for a very well thought out set of examples. I think it is really useful.

    In the final structure print out of the 'weather' data frame, there is an item called 'h.temp.hour'. For completeness, you may want to generate that in the same section as 'l.temp.hour' was generated.

    Thanks again for the hard work you did in creating this.

    ReplyDelete