#install.packages(c("data.table","leaflet"))
library(data.table)
library(leaflet)
library(tidyverse)
We will work with the meteorological data presented in lecture. Recall the dataset consists of weather station readings in the continental US.
The objective of the lab is to find the weather station with the highest elevation and look at patterns in the time series of its wind speed and temperature.
First download and then read in with data.table:fread()
fn <- "https://raw.githubusercontent.com/JSC370/jsc370-2023/main/labs/lab03/met_all.gz"
if (!file.exists("met_all.gz"))
download.file(fn, destfile = "met_all.gz", method = "curl", timeout = 60)
met <- data.table::fread("met_all.gz")
str(met)
## Classes 'data.table' and 'data.frame': 2377343 obs. of 30 variables:
## $ USAFID : int 690150 690150 690150 690150 690150 690150 690150 690150 690150 690150 ...
## $ WBAN : int 93121 93121 93121 93121 93121 93121 93121 93121 93121 93121 ...
## $ year : int 2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
## $ month : int 8 8 8 8 8 8 8 8 8 8 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : int 0 1 2 3 4 5 6 7 8 9 ...
## $ min : int 56 56 56 56 56 56 56 56 56 56 ...
## $ lat : num 34.3 34.3 34.3 34.3 34.3 34.3 34.3 34.3 34.3 34.3 ...
## $ lon : num -116 -116 -116 -116 -116 ...
## $ elev : int 696 696 696 696 696 696 696 696 696 696 ...
## $ wind.dir : int 220 230 230 210 120 NA 320 10 320 350 ...
## $ wind.dir.qc : chr "5" "5" "5" "5" ...
## $ wind.type.code : chr "N" "N" "N" "N" ...
## $ wind.sp : num 5.7 8.2 6.7 5.1 2.1 0 1.5 2.1 2.6 1.5 ...
## $ wind.sp.qc : chr "5" "5" "5" "5" ...
## $ ceiling.ht : int 22000 22000 22000 22000 22000 22000 22000 22000 22000 22000 ...
## $ ceiling.ht.qc : int 5 5 5 5 5 5 5 5 5 5 ...
## $ ceiling.ht.method: chr "9" "9" "9" "9" ...
## $ sky.cond : chr "N" "N" "N" "N" ...
## $ vis.dist : int 16093 16093 16093 16093 16093 16093 16093 16093 16093 16093 ...
## $ vis.dist.qc : chr "5" "5" "5" "5" ...
## $ vis.var : chr "N" "N" "N" "N" ...
## $ vis.var.qc : chr "5" "5" "5" "5" ...
## $ temp : num 37.2 35.6 34.4 33.3 32.8 31.1 29.4 28.9 27.2 26.7 ...
## $ temp.qc : chr "5" "5" "5" "5" ...
## $ dew.point : num 10.6 10.6 7.2 5 5 5.6 6.1 6.7 7.8 7.8 ...
## $ dew.point.qc : chr "5" "5" "5" "5" ...
## $ atm.press : num 1010 1010 1011 1012 1013 ...
## $ atm.press.qc : int 5 5 5 5 5 5 5 5 5 5 ...
## $ rh : num 19.9 21.8 18.5 16.9 17.4 ...
## - attr(*, ".internal.selfref")=<externalptr>
table(met$year)
##
## 2019
## 2377343
table(met$day)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 75975 75923 76915 76594 76332 76734 77677 77766 75366 75450 76187 75052 76906
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 77852 76217 78015 78219 79191 76709 75527 75786 78312 77413 76965 76806 79114
## 27 28 29 30 31
## 79789 77059 71712 74931 74849
table(met$hour)
##
## 0 1 2 3 4 5 6 7 8 9 10
## 99434 93482 93770 96703 110504 112128 106235 101985 100310 102915 101880
## 11 12 13 14 15 16 17 18 19 20 21
## 100470 103605 97004 96507 97635 94942 94184 100179 94604 94928 96070
## 22 23
## 94046 93823
summary(met$temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -40.00 19.60 23.50 23.59 27.80 56.00 60089
summary(met$elev)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -13.0 101.0 252.0 415.8 400.0 9999.0
summary(met$wind.sp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 2.10 2.46 3.60 36.00 79693
It looks like the elevation variable has observations with 9999.0, which is probably an indicator for missing. We should take a deeper look at the data dictionary to confirm. The wind speed variable is ok but there are a lot of missing data.
After checking the data we should make the appropriate modifications.
Replace elevations with 9999 as NA
.
# in base R
met$elev[met$elev == 9999] <- NA
summary(met$elev)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -13 101 252 413 400 4113 710
# in data.table
met[elev == 9999.0, elev := NA]
# in tidyverse
met <- met |>
mutate(elev = ifelse(elev == 9999, NA, elev))
At what elevation is the highest weather station?
NA
.We also have the issue of the minimum temperature being -40C, so we should remove those observations.
table(met$temp == -40, useNA = "always")
##
## FALSE TRUE <NA>
## 2317218 36 60089
met <- met[temp > -40]
dim(met)
## [1] 2317218 30
We can check that the correct number of records are kept.
sum(is.na(met$temp))
## [1] 0
Note that the >
filter removed all rows with NA temp
values
We should check the suspicious temperature value (where is it located?) and validate that the range of elevations make sense (-13 m to 4113 m).
Google is your friend here.
Fix any problems that arise in your checks.
summary(met$temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -17.20 19.60 23.50 23.59 27.80 56.00
We again notice that there is a -17.2C temperature reading that seems suspicious.
met2 <- met[order(temp)]
head(met2[ , .(temp, lat, lon, elev, wind.sp)], 20)
## temp lat lon elev wind.sp
## 1: -17.2 38.767 -104.300 1838 7.2
## 2: -17.2 38.767 -104.300 1838 7.7
## 3: -17.2 38.767 -104.300 1838 0.0
## 4: -17.2 38.767 -104.300 1838 0.0
## 5: -17.2 38.767 -104.300 1838 2.6
## 6: -17.2 38.767 -104.300 1838 7.7
## 7: -17.0 27.901 -98.052 78 8.8
## 8: -17.0 27.901 -98.052 78 8.2
## 9: -17.0 27.901 -98.052 78 5.1
## 10: -17.0 38.767 -104.300 1838 5.1
## 11: -17.0 38.767 -104.300 1838 1.5
## 12: -17.0 38.767 -104.300 1838 11.8
## 13: -17.0 38.767 -104.300 1838 1.5
## 14: -17.0 38.767 -104.300 1838 6.7
## 15: -3.0 44.683 -111.116 2025 0.0
## 16: -3.0 44.683 -111.116 2025 0.0
## 17: -3.0 44.683 -111.116 2025 0.0
## 18: -3.0 44.683 -111.116 2025 0.0
## 19: -2.4 36.422 -105.290 2554 0.0
## 20: -2.0 44.683 -111.116 2025 0.0
-17.0C is also suspicious. Checking the locations by the latitude and longitude values on Google Maps show locations in Colorado and Texas. Considering that these values were summer observations, they are likely errorneous or outliers.
met <- met[temp > -15]
met2 <- met[order(temp)]
head(unique(met2[ , .(lat, lon, elev, temp)]))
## lat lon elev temp
## 1: 44.683 -111.116 2025 -3.0
## 2: 36.422 -105.290 2554 -2.4
## 3: 44.683 -111.116 2025 -2.0
## 4: 44.544 -110.421 2388 -1.7
## 5: 37.633 -118.850 2173 -1.5
## 6: 44.544 -110.421 2388 -1.1
Remember to keep the initial question in mind. We want to pick out the weather station with maximum elevation and examine its windspeed and temperature.
Some ideas: select the weather station with maximum elevation; look at the correlation between temperature and wind speed; look at the correlation between temperature and wind speed with hour and day of the month.
highest <- met[elev == max(met$elev, na.rm = TRUE)]
summary(highest[ , .(elev, lat, lon, month, day, hour, min, temp, wind.sp)])
## elev lat lon month day
## Min. :4113 Min. :39.8 Min. :-105.8 Min. :8 Min. : 1.0
## 1st Qu.:4113 1st Qu.:39.8 1st Qu.:-105.8 1st Qu.:8 1st Qu.: 8.0
## Median :4113 Median :39.8 Median :-105.8 Median :8 Median :16.0
## Mean :4113 Mean :39.8 Mean :-105.8 Mean :8 Mean :16.1
## 3rd Qu.:4113 3rd Qu.:39.8 3rd Qu.:-105.8 3rd Qu.:8 3rd Qu.:24.0
## Max. :4113 Max. :39.8 Max. :-105.8 Max. :8 Max. :31.0
##
## hour min temp wind.sp
## Min. : 0.00 Min. : 6.00 Min. : 1.00 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:13.00 1st Qu.: 6.00 1st Qu.: 4.100
## Median :12.00 Median :36.00 Median : 8.00 Median : 6.700
## Mean :11.66 Mean :34.38 Mean : 8.13 Mean : 7.245
## 3rd Qu.:18.00 3rd Qu.:53.00 3rd Qu.:10.00 3rd Qu.: 9.800
## Max. :23.00 Max. :59.00 Max. :15.00 Max. :21.100
## NA's :168
You can compute the correlations individually . . .
cor(highest$temp, highest$wind.sp, use = "complete")
## [1] -0.09373843
cor(highest$temp, highest$hour, use = "complete")
## [1] 0.4397261
cor(highest$wind.sp, highest$day, use = "complete")
## [1] 0.3643079
cor(highest$wind.sp, highest$hour, use = "complete")
## [1] 0.08807315
cor(highest$temp, highest$day, use = "complete")
## [1] -0.003857766
or in a table. (The correlation between day and hour is meaningless.)
cor(highest[ , .(temp, wind.sp, day, hour)], use = "complete")
## temp wind.sp day hour
## temp 1.000000000 -0.09373843 -0.006130763 0.435680110
## wind.sp -0.093738431 1.00000000 0.364307915 0.088073152
## day -0.006130763 0.36430791 1.000000000 -0.004546462
## hour 0.435680110 0.08807315 -0.004546462 1.000000000
We should look at the distributions of all of the key variables to make sure there are no remaining issues with the data.
hist(met$elev, breaks=100)
hist(met$temp)
hist(met$wind.sp)
One thing we should consider for later analyses is to log transform wind speed and elevation as the are very skewed.
hist(log(met$elev))
## Warning in log(met$elev): NaNs produced
hist(log(met$wind.sp))
Look at where the weather station with highest elevation is located.
leaflet(highest) %>%
addProviderTiles('OpenStreetMap') %>%
addCircles(lat = ~lat, lng = ~lon,
opacity = 1, fillOpacity = 1, radius = 100)
Look at the time series of temperature and wind speed at this location. For this we will need to create a date-time variable for the x-axis.
library(lubridate)
highest$date <- with(highest, ymd_h(paste(year, month, day, hour, sep= ' ')))
summary(highest$date)
## Min. 1st Qu.
## "2019-08-01 00:00:00.0000" "2019-08-08 11:00:00.0000"
## Median Mean
## "2019-08-16 22:00:00.0000" "2019-08-16 14:09:56.8823"
## 3rd Qu. Max.
## "2019-08-24 11:00:00.0000" "2019-08-31 22:00:00.0000"
highest <- highest[order(date)]
head(highest[ , .(date, year, month, day, hour)])
## date year month day hour
## 1: 2019-08-01 00:00:00 2019 8 1 0
## 2: 2019-08-01 00:00:00 2019 8 1 0
## 3: 2019-08-01 01:00:00 2019 8 1 1
## 4: 2019-08-01 01:00:00 2019 8 1 1
## 5: 2019-08-01 01:00:00 2019 8 1 1
## 6: 2019-08-01 02:00:00 2019 8 1 2
With the date-time variable we can plot the time series of temperature and wind speed.
plot(highest$date, highest$temp, type = 'l')
plot(highest$date, highest$wind.sp, type = 'l')
To highlight daily patterns of the temperature, we may consider overlaying daily temperatures or plotting their averages, min, and max over a day.
ggplot(highest, aes(x = hour, y = temp, group = day)) +
theme_minimal() +
geom_line()
highest |>
group_by(hour) |>
summarise(mean_temp = mean(temp), max_temp = max(temp), min_temp = min(temp)) |>
ggplot(aes(x = hour, y = mean_temp)) +
theme_minimal() +
geom_ribbon(aes(ymin = min_temp, ymax = max_temp),
alpha = .5, fill = "lightgrey") +
geom_line()
You will have more chances to explore different plots throughout the course.