Friday February 17, 2023 by midnight.
For this assignment, we will be analyzing data on alcohol consumption and life expectancy. The learning objectives are to conduct data wrangling and visualization keeping key questions in mind. We will also do a regression analysis.
Download the two datasets: life expectancy and alcohol consumption. These data sets contain information for male and female life span, and alcohol consumption per captia for several countries. Before merging, prepare both datasets as follows:
Put the life expectancy data in “tidy” format by creating a new
column “Sex”. You may want to use pivot_longer
function
from the tidyr
package.
Filter the alcohol consumption data to exclude rows with data for “Both sexes”.
For conveinence, you may rename any variables which have complicated names.
Merge these datasets by country name and year
Create a summary tables showing the average and sd of life expectancy and alcohol consumption by year, and sex.
Create a new categorical variable named “consumption_level” using the alcohol total per capita variable. For female and male separately, calculate the quartiles of alcohol consumption. Categorize consumption level as low (0-q1) medium (q1-q3), and high (q3+). To make sure the variable is rightly coded, create a summary table that contains the minimum total alcohol consumption, maximum alcohol consumption, and number of observations for each category.
The primary questions of interest are: 1. What is the association between life expectancy and alcohol consumption? 2. Does this association differ by Sex? 3. How has life expectancy and alcohol consumption changed over time?
Follow the EDA checklist from week 3 and the previous assignment. Be sure to focus on the key variables.
Visualization Create the following figures and interpret them. Be sure to include easily understandable axes, titles, and legends.
Stacked histogram of alcohol consumption by sex. Use different color schemes than the ggplot default.
Facet plot by year for 2000, 2010, and 2019 showing scatterplots with regression lines of life expectancy and alcohol consumption
A linear model of life expectancy as a function of time, adjusted for sex. Compare the summary for Canada, and a second country of your choice.
A barplot of male and female life expectancy for the 10 countries with largest discrepancies in 2019.
A boxplot of life expectancy by alcohol consumption level and sex for the year 2019.
Choose a visualization to examine the association life expectancy with alcohol consumption over time.
Construct a multiple linear regression model to examine the association between alcohol consumption and life expectancy, adjusted for time and sex. First use time as a linear predictor variable, and then fit another model where you put a cubic regression spline on time. Provide summaries of your models, plots of the linear and non-linear associations, and interpretation of the linear and non-linear associations.