class: center, middle, title-slide .title[ # Exploratory Data Analysis ] .subtitle[ ## JSC 370: Data Science II ] --- <!--Yeah... I have really long code chunks, so I just changed the default size :)--> <style type="text/css"> pre{ font-size:20px; } code.r,code.cpp{ font-size:large } </style> ## Exploratory Data Analysis * Exploratory data analysis is the process of summarizing data * It should be the first step in your analysis pipeline * It involves: -- + 1) checking data (import issues, outliers, missing values, data errors) + 2) cleaning data + 3) summary statistics of key variables (univariate and bivariate) + 4) basic plots and graphs --- ## Pipeline .center[ ![](img/data-science.png) ] EDA involves Import -> Tidy -> Transform -> Visualize. Basically it is everything before we do modeling, prediction or inference. --- ## EDA Checklist The goal of EDA is to better understand your data. Let's use the **checklist**: -- 1. Formulate a question -- 2. Read in the data -- 3. Check the dimensions and headers and footers of the data -- 4. Check the variable types in the data -- 5. Take a closer look at some/all of the variables -- 6. Validate with an external source -- 7. Conduct some summary statistics to answer the initial question -- 8. Make exploratory graphs --- ## Case study We are going to use a dataset created from the National Center for Environmental Information (https://www.ncei.noaa.gov/). The data are 2019 hourly measurements from weather stations across the continental U.S. .center[ ![:scale 60%](img/weather.png) ] --- ## Formulate a Question It is a good idea to first have a question such as: -- * what weather stations reported the hottest and coldest daily temperatures? * what day of the month was on average the hottest? * is there covariation between temperature and humidity in my dataset? --- ## Read in the Data There are several ways to read in data (some depend on the type of data you have): * `read.table` or `read.csv` in base R for delimited files * `readRDS` if you have a .rds dataset * `read_csv`, `read_csv2`, `read_delim`, `read_fwf` from `library(readr)` that is part of the tidyverse * `readxl()` from `library(readxl)` for .xls and .xlsx files * `read_sas`, `read_spss`, `read_stata` from `library(haven)` * `fread` from `library(data.table)` for efficiently importing large datasets that are regular delimited files --- ## Read in the Data We will focus on base R, `data.table` and the `tidyverse`. Let's load the libraries we need to read in the data: ```r library(data.table) library(tidyverse) ``` Let's load in the data with `data.table`. I have it stored locally, but we will see how to load it straight from github in the lab. ```r met <- data.table::fread("met_all.gz") ``` --- ## Read in the Data `data.table` is a more efficient version of base R's `data.frame`. We create a `data.table` from an external data source by reading it in using `fread()` We can convert existing `data.frame` or `list` objects to `data.table` by using `setDT()` The nice thing about `data.table` is that it handles large datasets efficiently and it never reads/converts character type variables to factors, which can be a pain with `data.frame` The other nice thing about `data.table` is that many of the base R functions work just as if it was a `data.frame` --- ## Check the data We should check the dimensions of the data set. This can be done several ways: ```r dim(met) ``` ``` ## [1] 2377343 30 ``` ```r nrow(met) ``` ``` ## [1] 2377343 ``` ```r ncol(met) ``` ``` ## [1] 30 ``` --- ## Check the data * We see that there are 2,377,343 records of hourly temperature in August 2019 from all of the weather stations in the US. The data set has 30 variables. * We should also check the top and bottom of the dataset to make sure that it imported correctly. Use `head(met)` and `tail(met)` for this. * Next we can take a deeper dive into the contents of the data with `str()` --- ## Check variables ```r str(met) ``` ``` ## Classes 'data.table' and 'data.frame': 2377343 obs. of 30 variables: ## $ USAFID : int 690150 690150 690150 690150 690150 690150 690150 690150 690150 690150 ... ## $ WBAN : int 93121 93121 93121 93121 93121 93121 93121 93121 93121 93121 ... ## $ year : int 2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ... ## $ month : int 8 8 8 8 8 8 8 8 8 8 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ hour : int 0 1 2 3 4 5 6 7 8 9 ... ## $ min : int 56 56 56 56 56 56 56 56 56 56 ... ## $ lat : num 34.3 34.3 34.3 34.3 34.3 34.3 34.3 34.3 34.3 34.3 ... ## $ lon : num -116 -116 -116 -116 -116 ... ## $ elev : int 696 696 696 696 696 696 696 696 696 696 ... ## $ wind.dir : int 220 230 230 210 120 NA 320 10 320 350 ... ## $ wind.dir.qc : chr "5" "5" "5" "5" ... ## $ wind.type.code : chr "N" "N" "N" "N" ... ## $ wind.sp : num 5.7 8.2 6.7 5.1 2.1 0 1.5 2.1 2.6 1.5 ... ## $ wind.sp.qc : chr "5" "5" "5" "5" ... ## $ ceiling.ht : int 22000 22000 22000 22000 22000 22000 22000 22000 22000 22000 ... ## $ ceiling.ht.qc : int 5 5 5 5 5 5 5 5 5 5 ... ## $ ceiling.ht.method: chr "9" "9" "9" "9" ... ## $ sky.cond : chr "N" "N" "N" "N" ... ## $ vis.dist : int 16093 16093 16093 16093 16093 16093 16093 16093 16093 16093 ... ## $ vis.dist.qc : chr "5" "5" "5" "5" ... ## $ vis.var : chr "N" "N" "N" "N" ... ## $ vis.var.qc : chr "5" "5" "5" "5" ... ## $ temp : num 37.2 35.6 34.4 33.3 32.8 31.1 29.4 28.9 27.2 26.7 ... ## $ temp.qc : chr "5" "5" "5" "5" ... ## $ dew.point : num 10.6 10.6 7.2 5 5 5.6 6.1 6.7 7.8 7.8 ... ## $ dew.point.qc : chr "5" "5" "5" "5" ... ## $ atm.press : num 1010 1010 1011 1012 1013 ... ## $ atm.press.qc : int 5 5 5 5 5 5 5 5 5 5 ... ## $ rh : num 19.9 21.8 18.5 16.9 17.4 ... ## - attr(*, ".internal.selfref")=<externalptr> ``` --- ## Check variables * First, we see that `str()` gives us the class of the data, which in this case is both `data.table` and `data.frame`, as well as the dimensions of the data * We also see the variable names and their type (integer, numeric, character) * We can identify major problems with the data at this stage (e.g. a variable that has all missing values) -- We can get summary statistics on our `data.table` using `summary()` as we would with a regular `data.frame` --- ## Check variables ```r summary(met[,8:13]) ``` ``` ## lat lon elev wind.dir ## Min. :24.55 Min. :-124.29 Min. : -13.0 Min. : 3 ## 1st Qu.:33.97 1st Qu.: -98.02 1st Qu.: 101.0 1st Qu.:120 ## Median :38.35 Median : -91.71 Median : 252.0 Median :180 ## Mean :37.94 Mean : -92.15 Mean : 415.8 Mean :185 ## 3rd Qu.:41.94 3rd Qu.: -82.99 3rd Qu.: 400.0 3rd Qu.:260 ## Max. :48.94 Max. : -68.31 Max. :9999.0 Max. :360 ## NA's :785290 ## wind.dir.qc wind.type.code ## Length:2377343 Length:2377343 ## Class :character Class :character ## Mode :character Mode :character ## ## ## ## ``` --- ## Check variables more closely We know that we are supposed to have hourly measurements of weather data for the month of August 2019 for the entire US. We should check that we have all of these components. Let us check: * the year * the month * the hours * the range of locations (latitude and longitude) --- ## Check variables more closely We can generate tables for integer values ```r table(met$hour) ``` ``` ## ## 0 1 2 3 4 5 6 7 8 9 10 ## 99434 93482 93770 96703 110504 112128 106235 101985 100310 102915 101880 ## 11 12 13 14 15 16 17 18 19 20 21 ## 100470 103605 97004 96507 97635 94942 94184 100179 94604 94928 96070 ## 22 23 ## 94046 93823 ``` ```r table(met$month) ``` ``` ## ## 8 ## 2377343 ``` --- ## Check variables more closely For numeric values we should do a summary to see the quantiles, max, min ```r table(met$year) ``` ``` ## ## 2019 ## 2377343 ``` ```r summary(met$lat) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 24.55 33.97 38.35 37.94 41.94 48.94 ``` ```r summary(met$lon) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -124.29 -98.02 -91.71 -92.15 -82.99 -68.31 ``` --- ## Check variables more closely If we return to our initial question, what weather stations reported the hottest and coldest temperatures, we should take a closer look at our key variable, temperature (temp) ```r summary(met$temp) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -40.00 19.60 23.50 23.59 27.80 56.00 60089 ``` It looks like the temperatures are in Celsius. A temperature of -40 in August is really cold, we should see if this is an implausible value. --- ## Check variables more closely It also looks like there are a lot of missing data. Let us check the proportion of missings ```r mean(is.na(met$temp)) ``` ``` ## [1] 0.0252757 ``` 2.5% of the data are missing, which is not a huge amount. --- ## Check variables more closely In `data.table` we can easily subset the data and select certain columns ```r met_ss <- met[temp == -40.00, .(hour, lat, lon, elev, wind.sp)] dim(met_ss) ``` ``` ## [1] 36 5 ``` ```r summary(met_ss) ``` ``` ## hour lat lon elev wind.sp ## Min. : 0.00 Min. :29.12 Min. :-89.55 Min. :36 Min. : NA ## 1st Qu.: 2.75 1st Qu.:29.12 1st Qu.:-89.55 1st Qu.:36 1st Qu.: NA ## Median : 5.50 Median :29.12 Median :-89.55 Median :36 Median : NA ## Mean : 5.50 Mean :29.12 Mean :-89.55 Mean :36 Mean :NaN ## 3rd Qu.: 8.25 3rd Qu.:29.12 3rd Qu.:-89.55 3rd Qu.:36 3rd Qu.: NA ## Max. :11.00 Max. :29.12 Max. :-89.55 Max. :36 Max. : NA ## NA's :36 ``` --- ## Check variables more closely In `dplyr` we can do the same thing using `filter` and `select` ```r met_ss <- filter(met, temp == -40.00) %>% select(USAFID, day, hour, lat, lon, elev, wind.sp) dim(met_ss) ``` ``` ## [1] 36 7 ``` ```r summary(met_ss) ``` ``` ## USAFID day hour lat lon ## Min. :720717 Min. :1 Min. : 0.00 Min. :29.12 Min. :-89.55 ## 1st Qu.:720717 1st Qu.:1 1st Qu.: 2.75 1st Qu.:29.12 1st Qu.:-89.55 ## Median :720717 Median :1 Median : 5.50 Median :29.12 Median :-89.55 ## Mean :720717 Mean :1 Mean : 5.50 Mean :29.12 Mean :-89.55 ## 3rd Qu.:720717 3rd Qu.:1 3rd Qu.: 8.25 3rd Qu.:29.12 3rd Qu.:-89.55 ## Max. :720717 Max. :1 Max. :11.00 Max. :29.12 Max. :-89.55 ## ## elev wind.sp ## Min. :36 Min. : NA ## 1st Qu.:36 1st Qu.: NA ## Median :36 Median : NA ## Mean :36 Mean :NaN ## 3rd Qu.:36 3rd Qu.: NA ## Max. :36 Max. : NA ## NA's :36 ``` --- ## Validate against an external source We should check outside sources to make sure that our data make sense. For example the observation with -40C is suspicious, so we should look up the location of the weather station. Go to [Google maps](https://www.google.com/maps/) and enter the coordinates for the site with -40C (29.12, -89.55) .center[ ![:scale 30%](img/map.png) ] It doesn't make much sense to have a -40C reading in the Gulf of Mexico off the coast of Louisiana. --- ## Summary statistics If we return to our initial question, we need to generate a list of weather stations that are ordered from highest to lowest. We can then examine the top and bottom of this new dataset. First let us remove the -40C observations and then sort. In `data.table` we can use `order()` to sort. With over 2 million observations, `data.table` is the best solution since it is more computationally efficient. ```r met <- met[temp > -40] setorder(met, temp) ``` This is equivalent to ```r met <- met[temp > -10][order(temp)] ``` --- ## Summary statistics ```r head(met)[,c(1,8:10,24)] ``` ``` ## USAFID lat lon elev temp ## 1: 722817 38.767 -104.3 1838 -17.2 ## 2: 722817 38.767 -104.3 1838 -17.2 ## 3: 722817 38.767 -104.3 1838 -17.2 ## 4: 722817 38.767 -104.3 1838 -17.2 ## 5: 722817 38.767 -104.3 1838 -17.2 ## 6: 722817 38.767 -104.3 1838 -17.2 ``` ```r tail(met)[,c(1,8:10,24)] ``` ``` ## USAFID lat lon elev temp ## 1: 720267 38.955 -121.081 467 52.0 ## 2: 690150 34.300 -116.166 696 52.8 ## 3: 690150 34.296 -116.162 625 52.8 ## 4: 690150 34.300 -116.166 696 53.9 ## 5: 690150 34.300 -116.166 696 54.4 ## 6: 720267 38.955 -121.081 467 56.0 ``` --- ## Summary statistics The maximum hourly temperature is 56C at site 720267, and the minimum hourly temperature is -17.2C at site 722817. --- ## Summary statistics We need to transform our data to answer our initial question. Let's find the **daily** mean max and min temperatures in `data.table` ```r met_daily <- met[, .( temp = mean(temp), lat = mean(lat), lon = mean(lon), elev = mean(elev) ), by = c("USAFID", "day")][order(temp)] head(met_daily) ``` ``` ## USAFID day temp lat lon elev ## 1: 722817 3 -17.200000 38.767 -104.3 1838 ## 2: 722817 1 -17.133333 38.767 -104.3 1838 ## 3: 722817 6 -17.066667 38.767 -104.3 1838 ## 4: 726130 11 4.278261 44.270 -71.3 1909 ## 5: 726130 31 4.304348 44.270 -71.3 1909 ## 6: 726130 10 4.583333 44.270 -71.3 1909 ``` ```r tail(met_daily) ``` ``` ## USAFID day temp lat lon elev ## 1: 722749 5 40.85714 33.26900 -111.8120 379.0000 ## 2: 723805 5 40.97500 34.76800 -114.6180 279.0000 ## 3: 720339 14 41.00000 32.14600 -111.1710 737.0000 ## 4: 723805 4 41.18333 34.76800 -114.6180 279.0000 ## 5: 722787 5 41.35714 33.52700 -112.2950 325.0000 ## 6: 690150 31 41.71667 34.29967 -116.1657 690.0833 ``` --- ## Summary statistics Let's find the **daily** mean max and min temperatures in `dplyr` ```r met_daily_dplyr <- met %>% group_by(USAFID, day) %>% summarize(temp = mean(temp)) %>% arrange(desc(temp)) ``` (try it yourself) --- ## Summary statistics The maximum **daily** temperature is 41.7166667 C at site 690150 and the minimum daily temperature is -17.2C at site 722817. --- ## Exploratory graphs With exploratory graphs we aim to: * debug any issues remaining in the data * understand properties of the data * look for patterns in the data * inform modeling strategies Exploratory graphs do not need to be perfect, we will look at presentation ready plots next week. --- ## Exploratory graphs Examples of exploratory graphs include: * histograms * boxplots * scatterplots * simple maps --- ## Exploratory Graphs Focusing on the variable of interest, temperature, let's look at the distribution (after removing -40C) ```r hist(met$temp) ``` <img src="slides_files/figure-html/unnamed-chunk-14-1.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- ## Exploratory Graphs Let's look at the daily data ```r hist(met_daily$temp) ``` <img src="slides_files/figure-html/unnamed-chunk-15-1.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- ## Exploratory Graphs A boxplot gives us an idea of the quantiles of the distribution and any outliers ```r boxplot(met$temp, col = "blue") ``` <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="60%" height="60%" style="display: block; margin: auto;" /> --- ## Exploratory Graphs Let's look at the daily data ```r boxplot(met_daily$temp, col = "blue") ``` <img src="slides_files/figure-html/unnamed-chunk-17-1.png" width="60%" height="60%" style="display: block; margin: auto;" /> --- ## Exploratory Graphs A map will show us where the weather stations are located. First let's get the unique latitutdes and longitudes and see how many meteorological sites there are ```r met_stations <- (unique(met[,c("lat","lon")])) dim(met_stations) ``` ``` ## [1] 2827 2 ``` --- ## Exploratory Graphs A map will show us where the weather stations are located. First let's get the unique latitutdes and longitudes and see how many meteorological sites there are. ```r library(leaflet) leaflet(met_stations) %>% addProviderTiles('CartoDB.Positron') %>% addCircles(lat = ~lat, lng = ~lon, opacity = 1, fillOpacity = 1, radius = 400) ```
--- ## Exploratory Graphs Let's map the locations of the max and min daily temperatures. ```r min <- met_daily[1] # First observation. max <- met_daily[.N] # Last obs, .N is a special symbol in data.table leaflet() %>% addProviderTiles('CartoDB.Positron') %>% addCircles( data = min, lat = ~lat, lng = ~lon, popup = "Min temp.", opacity = 1, fillOpacity = 1, radius = 400, color = "blue" ) %>% addCircles( data = max, lat = ~lat, lng = ~lon, popup = "Max temp.", opacity=1, fillOpacity=1, radius = 400, color = "red" ) ``` (next slide) ---
--- ## Exploratory Graphs Scatterplots help us look at pairwise relationships. Let's see if there is any trend in temperature with latitude ```r plot(met_daily$lat, met_daily$temp, pch=19, cex=0.5) ``` <img src="slides_files/figure-html/unnamed-chunk-21-1.png" width="40%" height="40%" style="display: block; margin: auto;" /> --- equivalent to ```r met_daily[, plot(lat, temp, pch = 19, cex =0.5)] ``` <img src="slides_files/figure-html/plot_daily-bis-1.png" width="40%" style="display: block; margin: auto;" /> There is a clear decrease in temperatures as you increase in latitude (i.e as you go north). --- ## Exploratory Graphs We can add a simple linear regression line to this plot using `lm()` and `abline()`. We can also add a title and change the axis labels. ```r mod <- lm(temp ~ lat, data = met_daily) met_daily[, plot( lat, temp, pch=19, cex=0.5, main = "Temperature and Latitude", xlab = "Latitude", ylab = "Temperature (deg C)") ] abline(mod, lwd=2, col="red") ``` (next slide) --- <img src="slides_files/figure-html/unnamed-chunk-22-1.png" width="50%" height="40%" style="display: block; margin: auto;" /> --- Using `ggplot2` (next class) ```r library(ggplot2) ggplot(data = met_daily, mapping = aes(x = lat, y = temp)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Temperature and Latitute", xlab = "Latitute", y = "Temperature (deg C)") ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` <img src="slides_files/figure-html/last-ggplot2-1.png" width="50%" style="display: block; margin: auto;" /> --- # Summary In EDA we: * have an initial question that we aim to answer * import, check, clean the data * perform any data transformations to answer the initial question * make some basic graphs to examine the data and visualize the initial question