In: Statistics and Probability
I have a statistics question:
An orange juice processing plant has three production lines. The production lines fill juice into 400 ml packages. The production manager of the plant would like to know if the production lines are all filling the packages the same amount.
A sample of 25 packages from each production line are taken and the data is saved in juice.csv. The production manager would like to know:
The production manager already attempted to answer their question by applying a one-way ANOVA to the data. This resulted in a test where the assumption of normality failed. This was unable to be corrected by transforming the response variable. Use another method to try and answer the production managers question. The production manager is only interested in whether a difference exists, not where the differences are.
Test at the 5% significance level.
An orange juice processing plant has three production lines. The production lines fill juice into 400 ml packages. The production manager of the plant would like to know if the production lines are all filling the packages the same amount.
A sample of 25 packages from each production line are taken and the data is saved in juice.csv. The production manager would like to know:
The production manager already attempted to answer their question by applying a one-way ANOVA to the data. This resulted in a test where the assumption of normality failed. This was unable to be corrected by transforming the response variable. Use another method to try and answer the production managers question. The production manager is only interested in whether a difference exists, not where the differences are.
Test at the 5% significance level.
What test should I use I am not sure, I am using Rstudio what would the process be? What code would I run?
Solution :
Answer : Use Kruskal-wallis Rank Sum Test to test whether the three production lines fill packages with the same amount of packages, on average.
Explanation : Process for conducting test using R is as follows:
Here the manager wants to know if there is a significant difference between the means of three production lines.
Here in fruit.csv we have been given 25 samples of each production line . Here the assumption of normality has been violated.
Therefore the assumption of normality required for one way anova has failed. Thus the distribution of data in fruit.csv is non-parametric.
The non-parametric alternative to One Way ANOVA is the Kruskal-Wallis Rank sum Test.
kruskal.test function in r is used to test equality between means or averages of different groups in same as in one way anova but the difference here is kruskal test is used when the data is non parametric.
Kruskal Wallis Test in R is computed as follows :
Here there are three production lines hence we have 3 samples of size 25 each.
Step 1 : Read the data file in r and store it in fruit_data
Code :
fruit_data=read.table("FileLocation/fruit.csv",sep=",",header=T);fruit_data attach(fruit_data) |
Once the data is loaded in R software we can use the Kruskal wallis test.
Step 2: compute the Kruskal-Wallis Test
Code:
kruskal.test(response~group),data=fruit_data) |
where,
response is the values of the juice filled into 400 ml packages from different production lines
group are the three production lines, thus having 3 groups
data is the data from the fruit.csv file provided.
Step 3: Interpretating the result on basis of p-value obtained .
After running the code provided in Step 2 we get the chi-squared and p-value.
chi-squared is the value of the test statistics and the
p-value is the value calculated at 5% level of significance
Now if the p-value is less than 0.05 we reject the null hypothesis and conclude that the means of three different production lines differ significantly.
Or if the p-value is greater than 0.05 then we accept the null hypothesis and conclude that the three production lines fill the packages with same amount of juice on average.
When p-value is greater than 0.05 we can say that there is no statistically significant difference between the means or averages of the three probuction lines.
[ Here native stats package fron r library is used]
____________________________________________________________________