In: Statistics and Probability
The below provides information about the dataset.
Input variables (based on physicochemical tests) | |
Fixed acidity | Numeric |
Volatile acidity | Numeric |
Citric acid | Numeric |
Residual sugar | Numeric |
Chlorides | Numeric |
Free sulfur dioxide | Numeric |
Total sulfur dioxide | Numeric |
Density | Numeric |
pH | Numeric |
Sulphates | Numeric |
Alcohol | Numeric (%) |
Wine Type | red or white |
Output variable (based on sensory data) | |
Quality | Score between 0 and 10 in ordinal |
Data set
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | winetype | quality |
7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | red | 5 |
7.8 | 0.88 | 0 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.2 | 0.68 | 9.8 | red | 5 |
7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.997 | 3.26 | 0.65 | 9.8 | red | 5 |
11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.998 | 3.16 | 0.58 | 9.8 | red | 6 |
7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | red | 5 |
7.4 | 0.66 | 0 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | red | 5 |
7.9 | 0.6 | 0.06 | 1.6 | 0.069 | 15 | 59 | 0.9964 | 3.3 | 0.46 | 9.4 | red | 5 |
7.3 | 0.65 | 0 | 1.2 | 0.065 | 15 | 21 | 0.9946 | 3.39 | 0.47 | 10 | red | 7 |
7.8 | 0.58 | 0.02 | 2 | 0.073 | 9 | 18 | 0.9968 | 3.36 | 0.57 | 9.5 | red | 7 |
7.5 | 0.5 | 0.36 | 6.1 | 0.071 | 17 | 102 | 0.9978 | 3.35 | 0.8 | 10.5 | red |
5 |
Using R, build a linear regression model, logistic regression and classification model for wine quality prediction and no data partition is needed.
Which approach is best, regression or classification models? Why?
After entering the data in Excel we can do the analysis in R
R code along with output is attached herewith.
Here from the summary statistics both the model looks to be fine but the response variable is a ordinal varaiable so it is more appropriate to fit a classification model i.e., Logistic regression model.
Because in linear regression model many response values may be greater than 7 or less than 5
and we cannot interpret them on the other hand if they are categorical then it will be classified to one of the classes.