In: Computer Science
Question 2 Consider the one-dimensional data set shown below.
x | 0.5 | 3.0 | 4.5 | 4.6 | 4.9 | 5.2 | 5.3 | 5.5 | 7.0 | 9.5 |
y | - | - | + | + | + | - | - | + | - | - |
Classify the data point x = 5.0 according to its 1st, 3rd, 5th, and 9th nearest neighbors using K-nearest neighbor classifier.
Question 3 Use data set mushrooms.csv available for developing supervised model. The data set contains two classes namely,edible and poisonous. Perform following analysis on the data set.
Data Set:
1. Understand distribution of classes in the data set using suitable plots.
2. Develop supervised models: Decision tree and k-nearest neighbor
3. Identify best k in Cross-validation evaluating method for supervised models in step 3.
4. Discuss results achieved by each supervised model using confusion matrix, sensitivity, specificity,accuracy, F1-score and ROC curve.
5. Provide your opinion on why there exist variation in performance by models.
Answer 2:
1-nearest neighbor: +
3-nearest neighbor: −
5-nearest neighbor: +
9-nearest neighbor: −
Answer 3:
(1)
Distribution of classes
In [20]:
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.85 0.97 0.91 1257 1 0.97 0.82 0.89 1181 avg / total 0.91 0.90 0.90 2438
2(i) Decision Tree Model
In: from sklearn.tree import DecisionTreeClassifier as DT classifier = DT(criterion='entropy',random_state=42) classifier.fit(X_train,y_train)
Out: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=42, splitter='best')
(ii)k-nearest neighbor model In: from sklearn.neighbors import KNeighborsClassifier as KNN classifier = KNN() classifier.fit(X_train,y_train)
Out:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform')
(4)
Decision Tree Results
In:
print_score(classifier,X_train,y_train,X_test,y_test,train=False)
Test results: Accuracy Score: 0.9007 Classification Report: precision recall f1-score support 0 0.90 0.91 0.90 1257 1 0.91 0.89 0.90 1181 avg / total 0.90 0.90 0.90 2438 Confusion Matrix: [[1147 110] [ 132 1049]]
K-NN Test Results
In :
print_score(classifier,X_train,y_train,X_test,y_test,train=False)
Test results: Accuracy Score: 0.9307 Classification Report: precision recall f1-score support 0 0.91 0.96 0.93 1257 1 0.96 0.90 0.93 1181 avg / total 0.93 0.93 0.93 2438 Confusion Matrix: [[1211 46] [ 123 1058]]
(5) Maybe the most well-known reason is that you have overfit the training data. You have hit upon a model, a lot of model hyperparameters, a perspective on the information, or a mix of these components and more that just so happens to give a decent ability gauge on the training dataset..