plot_ly()
and ggplotly()
functionsplot_geo()
We will work with COVID data downloaded from the New York Times. The dataset consists of COVID-19 cases and deaths in each US state during the course of the COVID epidemic.
The objective of this lab is to explore relationships between cases, deaths, and population sizes of US states, and plot data to demonstrate this
library(data.table)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ lubridate::hour() masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ lubridate::mday() masks data.table::mday()
## ✖ lubridate::minute() masks data.table::minute()
## ✖ lubridate::month() masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second() masks data.table::second()
## ✖ purrr::transpose() masks data.table::transpose()
## ✖ lubridate::wday() masks data.table::wday()
## ✖ lubridate::week() masks data.table::week()
## ✖ lubridate::yday() masks data.table::yday()
## ✖ lubridate::year() masks data.table::year()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(knitr)
library(widgetframe)
## Loading required package: htmlwidgets
cv_states_readin <-
data.table::fread("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
state_pops <- data.table::fread("https://raw.githubusercontent.com/COVID19Tracking/associated-data/master/us_census_data/us_census_2018_population_estimates_states.csv")
state_pops$abb <- state_pops$state
state_pops$state <- state_pops$state_name
state_pops$state_name <- NULL
cv_states <- merge(cv_states_readin, state_pops, by="state")
head
, and tail
of
the datadim(cv_states)
head(cv_states)
tail(cv_states)
str(cv_states)
cv_states$date <- as.Date(cv_states$date, format="%Y-%m-%d")
state_list <- unique(cv_states$state)
cv_states$state <- factor(cv_states$state, levels = state_list)
abb_list <- unique(cv_states$abb)
cv_states$abb <- factor(cv_states$abb, levels = abb_list)
cv_states = cv_states[order(cv_states$state, cv_states$date),]
str(cv_states)
head(cv_states)
tail(cv_states)
head(cv_states)
summary(cv_states)
min(cv_states$date)
max(cv_states$date)
new_cases
and new_deaths
and
correct outliersAdd variables for new cases, new_cases
, and new
deaths, new_deaths
:
new_cases
equal to the difference
between cases on date i and date i-1, starting on date i=2Filter to dates after October 1, 2022
Use ggplotly
for EDA: See if there are outliers or
values that don’t make sense for new_cases
and
new_deaths
. Which states and which dates have strange
values?
Correct outliers: Set negative values for new_cases
or new_deaths
to 0
Inspect data again interactively
Add population-normalized (by 100,000) variables for each
variable type (rounded to 1 decimal place). Make sure the variables you
calculate are in the correct format (numeric
). You can use
the following variable names:
per100k
= cases per 100,000 populationnewper100k
= new cases per 100,000deathsper100k
= deaths per 100,000newdeathsper100k
= new deaths per 100,000Add a “naive CFR” variable representing
deaths / cases
on each date for each state
Create a dataframe representing values on the most recent date,
cv_states_today
plot_ly()
Create a scatterplot using plot_ly()
representing
pop_density
vs. various variables (e.g. cases
,
per100k
, deaths
, deathsper100k
)
for each state on most recent date (cv_states_today
)
Remove those outliers and replot.
Choose one plot. For this plot:
Add hoverinfo specifying the state name, cases per 100k, and deaths per 100k, similarly to how we did this in the lecture notes
Add layout information to title the chart and the axes
Enable hovermode = "compare"
ggplotly()
and geom_smooth()
pop_density
vs. newdeathsper100k
create a chart with the same variables using
gglot_ly()
pop_density
correlates with newdeathsper100k
?Create a line chart of the naive_CFR
for all states
over time using plot_ly()
naive_CFR
for
the states that had an increase in September. How have they changed over
time?Create one more line chart, for Florida only, which shows
new_cases
and new_deaths
together in one plot.
Hint: use add_layer()
Create a heatmap to visualize new_cases
for each state
on each date greater than January 1st, 2023 - Start by mapping selected
features in the dataframe into a matrix using the tidyr
package function pivot_wider()
, naming the rows and
columns, as done in the lecture notes - Use plot_ly()
to
create a heatmap out of this matrix. Which states stand out?
new_cases
for each state over time becomes more clear by
filtering to only look at dates every two weeks.#create heatmap
naive_CFR
by state on
March 15, 2023pick.date = "2023-03-15"
# Create the map
naive_CFR
by state
on most recent date# Map for today's date