In: Statistics and Probability
Background
This part is based on a study of premature mortality in Great Britain between 2012 and 2014. The dataset from this study includes information on premature mortality for 378 local authorities in Great Britain from 2012 to 2014. Premature mortality is measured as the number of individuals that die before the age of 70 in a cohort of 100,000. In addition to total premature mortality, the dataset also includes a breakdown by gender and socioeconomic indicators such as income, education and employment for each local authority.
Dataset
You can run the following line of code in R and this will load the data directly from the course website:
pmdata <- read.csv("https://uclspp.github.io/datasets/data/pmgb2012_2014.csv")
Codebook
The codebook describes the variables in the dataset.
Variable Description
code Unique identifierer for each local authority
country 1 = England, 2 = Scotland, 3 = Wales
pop_density 1 = low, 2 = medium, 3 = high
pmdeaths_total Number of premature deaths out of 100,00
pmdeaths_female Number of premature deaths among women, out of 100,000
pmdeaths_male Number of premature deaths among men, out of 100,000
mean_income Mean income in the local authority
edu_level3 Qualification: proportion of the population with A level
edu_level4 Qualification: proportion of the population with degree-level education or equivalent
2a. Descriptive Statistics
• Using the appropriate measures, report and interpret the central tendency and dispersion for the following variables:
– edu_level3
– edu_level4
– pop_density
2b. Visualization •
Produce a scatter plot of premature mortality (pmdeaths_total) on the y-axis and degree-level education (edu_level4 ) on the x-axis
• Provide an explanation of the substantive meaning of the graphs. What do they tell us about the association between premature mortality and levels of education in Great Britain?
• Produce a box plot that compares premature mortality in England, Scotland, Wales.
• What does the plot tell us about how premature mortality varies across the three countries?
2c. Difference in Means
• Calculate the mean difference between premature mortality among men and women in Great Britain.
• Conduct t-test to establish whether the difference between the premature mortality of men and women is statistically significant at the 95% confidence level.
• Interpret the results of the t-test both statistically and substantively
• Interpret the confidence interval of the difference in means
2d. Linear Regression
• Estimate a linear regression model to analyse the relationship between mean income and premature mortality in each local authority. The dependent variable is pmdeaths_total and the independent variable is mean_income
• Present a table with the output of the regression model
• Interpret the main coefficient of interest (mean_income)
• Interpret the estimated intercept term of the regression model
• Interpret the R2 term of the regression model
pmdata <- read.csv("https://uclspp.github.io/datasets/data/pmgb2012_2014.csv")
head(pmdata)
code country pop_density pmdeaths_total pmdeaths_female pmdeaths_male mean_income edu_level3 edu_level4 1 416 2 2 20484.33 16368.25 24600.41 36950 0.09638724 0.3315938 2 417 2 1 15327.69 13122.84 17532.53 37008 0.10156433 0.2695638 3 267 1 2 14930.00 11442.00 18418.00 28537 0.11180530 0.2198936 4 92 1 1 16548.00 13522.00 19574.00 31561 0.11522675 0.2277338 5 98 1 1 15267.50 12273.00 18262.00 31396 0.11985204 0.2315130 6 423 2 1 17216.90 14495.01 19938.78 28226 0.10897409 0.2363351 |
|
|
2a)
In statistics, most common measures of central tendency are mean and median.
Mean is the sum of all measurements divided by the number of observations in the data.
Median is the middle value that separates the higher half from the lower half of the data.
Mode is the most frequent value in the data.
edu_level3
> mean(pmdata$edu_level3,na.rm=TRUE)
[1] 0.1201406
> median(pmdata$edu_level3,na.rm=TRUE)
[1] 0.1190243
Mean for edu_level3 is 0.12 and median is 0.119
edu_level4
> mean(pmdata$edu_level4,na.rm=TRUE) [1] 0.2679692 > median(pmdata$edu_level4,na.rm=TRUE) [1] 0.2569474 Mean for edu_level4 is 0.26797 and median is 0.25695 pop_density
plot(pmdata$edu_level4, pmdata$pmdeaths_total, main="Scatterplot", xlab="edu_level4 ", ylab="pmdeaths_total ", pch=19) # Add fit lines abline(lm(pmdata$pmdeaths_total~pmdata$edu_level4), col="red") # regression line (y~x) The above scatterplot shows inverse relationship between the two variables (ie pmdeaths_total and edu_level4) ie as edu_level4 increases, pmdeaths_total descreases and vice versa. # Boxplot of pmdeaths_total by Country boxplot(pmdeaths_total~country,data=pmdata, main="Premature Mortality by Country", xlab="1 = England, 2 = Scotland, 3 = Wales", ylab="Number of premature deaths out of 100,00") Boxplot tells us that the mean Premature Morality is highest for Scotland and Lowest for England 2c) We can use selective mean function to get the mean for country = 1(England) > mean(pmdata$pmdeaths_female[pmdata$country==1]) [1] 12322.2 Subtracting from mean of men, we get the mean difference between men and women.
|