In: Statistics and Probability
Question
a The following six sales figures (X) were randomly
sampled
$28, $34, $40, $44, $52, $54, with ∑X =$252 and
∑X2 = $211,096
i Determine the median and mean. What do these suggest about the distribution of the sales data?
ii What is the standard deviation of the sales data?
iii If a frequency distribution for the sales was constructed, what would be the mid-point and frequency and relative frequency for class “Sales $30 to under $40”?
iv Estimate the 1st quartile for the sales data.
b
You have obtained a data set for 2015, that contains all the work place accidents that occurred at a state government department. The data includes: the gender of worker involved, their education level (high school only, college, university), the cause of the accident (stairs, staplers, other), the cost to the department in dollars, the day of the week, and number of stories in the building in which the accident occurred.
i Identify which variables are categorical and state whether they are nominal or ordinal.
ii Identify which variables are numerical and state whether they are discrete or continuous.
iii If you wished to use the data to predict the average cost to the department per accident for all state government departments in 2015, can the data be used for such a purpose? Why or why not? If so, how? If not, what additional data, or variables, would allow this?
iv If you wished to use the data to predict the chance of an accident occurring for Mondays in 2015 at the state government department, can the data be used for such a purpose? Why or why not? If so, how? If not, what additional data or variables, would allow this?
a) GIVEN:
The following six sales figures
(X) were randomly sampled:
$28, $34, $40, $44, $52, $54, with ∑X =$252 and
∑X2 = $11,096
Number of data points in sample (n) = 6
(i) MEAN:
The arithmetic mean is the sum of all of the data points divided by the number of data points.
MEDIAN:
The median is the middle point in a dataset—half of the data points are smaller than the median and half of the data points are larger.
To find the median:
$28, $34, $40, $44, $52, $54
In our problem, the number of datapoints is 6 which is even, thus the median is the average of the two middle data points in the list.
Since the mean and median are equal (42), the distribution of the sales data is symmetric and has zero skewness.
STANDARD DEVIATION:
The formula for standard deviation is,
Mean
28 | -14 | 196 |
34 | -8 | 64 |
40 | -2 | 4 |
44 | 2 | 4 |
52 | 10 | 100 |
54 | 12 | 144 |
The standard deviation is,
(iii) FREQUENCY DISTRIBUTION OF SALES DATA:
CLASS | FREQUENCY (f) | RELATIVE FREQUENCY |
20-29 | 1 | 1/6 = 0.167 |
30-39 | 1 | 1/6 = 0.167 |
40-49 | 2 | 2/6 = 0.333 |
50-59 | 2 | 2/6 = 0.333 |
Thus the frequency for class “Sales $30 to under $40” is 1 and its relative frequency is 0.167.
The class midpoint is the lower class limit plus the upper class limit divided by 2.
The lower limit for class “Sales $30 to under $40” is 30 and the upper class limit is 39.
Thus the midpoint for class “Sales $30 to under $40” is .
(iv) FIRST QUARTILE:
The first quartile, denoted by Q1 , is the median of the lower half of the data set. This means that about 25% of the numbers in the data set lie below Q1 and about 75% lie above Q1 .
Since the median of sales data is . The lower half of the data below median is $28, $34, $40. The first quartile is the median of $28, $34, $40.
Since the number of datapoints in lower half of the data is 3 which is odd, the median is the middle data point in the list. Thus the first quartile is .
b) GIVEN:
The data includes: the gender of worker involved, their education level (high school only, college, university), the cause of the accident (stairs, staplers, other), the cost to the department in dollars, the day of the week, and number of stories in the building in which the accident occurred.
(i) CATEGORICAL VARIABLES:
A categorical variable also known as discrete or qualitative variable is one that has two or more categories (values). Categorical variables can be further categorized as either nominal, ordinal or dichotomous.
NOMINAL:
Nominal variables are variables that have two or more categories, but which do not have an intrinsic order.
ORDINAL:
Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked.
(ii) NUMERICAL VARIABLES:
The values of a numerical variable are numbers. They can be further classified into discrete and continuous variables.
DISCRETE:
A variable whose values are whole numbers (counts) is called discrete. For example, the number of items bought by a customer in a supermarket is discrete.
CONTINUOUS:
A variable that may contain any value within some range is called continuous. For example, the time that the customer spends in the supermarket is continuous.
(iii) If you wished to use the data to predict the average cost to the department per accident for all state government departments in 2015, can the data be used for such a purpose?
The data can be used to predict the average cost to the department per accident for all state government departments in 2015 using the given data. We can run linear regression model by using cost to the department in dollars as dependent or response variable and other variables "the gender of worker involved, their education level (high school only, college, university), the cause of the accident (stairs, staplers, other), the day of the week, and number of stories in the building in which the accident occurred" as independent variables to predict the average cost to the department per accident for all state government departments in 2015. There is no need of additional variables or data.
(iv) If you wished to use the data to predict the chance of an accident occurring for Mondays in 2015 at the state government department, can the data be used for such a purpose?
To predict the chance of an accident occurring for Mondays in 2015 at the state government department, we should create a new additional categorical variable whether the accidents occurred in mondays or not with two categories (YES OR NO) which should be used as a dependent variable and remaining variables (the gender of worker involved, their education level (high school only, college, university), the cause of the accident (stairs, staplers, other), the cost to the department in dollars, and number of stories in the building in which the accident occurred.) as independent variables and logistic regression model is used to predict the chance of an accident occurring for Mondays in 2015 at the state government department.