In: Statistics and Probability
To find the dataset needed for this problem, you’ll first need to open the “swiss” dataset that is contained in R by running the following line:
> data('swiss')
Now you can rename the “swiss” dataset and use it to answer the question below. Name the data frame with your UT EID:
> my_variable <- swiss
This dataset contains socio-economic indicators for the French-speaking provinces of Switzerland in the year 1888. Among the variables, “Agriculture” is the percentage of the workforce that were farmers and “Education” is the percentage of the population that were formally educated.
> my_variable$Farmers[ ] <- 'More than half'
> my_variable$Farmers[ ] <- 'Less than half'
Note: Please replace where ever ut123 appears (multiple places) in the following code with your actual UT EID (This should start with a character and should have no spaces)
R code with comments (all statements starting with # are comments)
a) We will use a box plot as well as histogram to describe the distribution of education
#get the dataset swiss
data('swiss')
#assign it to you UT EID. Please replace ut123 with your actual UT
EID
ut1234 <- swiss
par(mfrow=c(2,1))
#a) boxplot
boxplot(ut1234$Education,main="Distribution of
Education",xlab="Education (% formally
educated)",horizontal=TRUE)
#print the 5 number summary
summary(ut1234$Education)
#plot a histogram
hist(ut1234$Education,main="Distribution of
Education",xlab="Education (% formally educated)")
#find the mean
xbar<-mean(ut1234$Education)
#get the sample standard deviation
s<-sd(ut1234$Education)
sprintf('Mean value of %%formally educated is %.2f%%',xbar)
sprintf('sample standard deviation of %%formally educated is
%.2f%%',s)
#get this plot
And get this output
It can be seen from the box plot and the histogram that the distribution of educated is skewed towards right (longer right tail). The median is 8% indicating that 50% of the provinces have less than 8% of the population that were formally educated. The mean is higher than the median indicating the skewness. The box plot also indicates the presence of the outliers (more than 1.5 times IQR from Q3)
b &c) Create the variables and get the counts
R code:
#part b)
ut1234$Farmers[ut1234$Agriculture>=50] <- 'More than
half'
ut1234$Farmers[ut1234$Agriculture<50 ] <- 'Less than
half'
#part c) get the counts
table(ut1234$Farmers)
#get this
26 provinces had a majority of farmers in the workforce and 21 provinces did not have majority of farmers in the workforce.
d) R code
#part d) compare percentages
par(mfrow=c(1,1))
boxplot(ut1234$Education~ut1234$Farmers,main="Distribution of
Education",ylab="Proportion of farmers in the
workforce",xlab="Education (% formally
educated)",horizontal=TRUE)
#get this plot
We can see that both the mean and median percentages of the population that were formally educated is lower when the farmers make up majority of the workforce, compared to when they make up less than half the work force.
We can also see that the distribution of education is skewed towards the right for provinces that were not majority farmers. However the distribution of education is far more symmetrical about the median, for provinces that were majority farmers.
We can further see that the distribution of education has more spread (has higher variance, or higher IQR) when the farmers making up less than half the work force compared to when they were majority farmers.