In: Computer Science
From the MNIST dataset introduced in class, write code in Python that
a) Splits the 42000 training images into a training set (50% of all the data) and a test set (the rest). The labels should also be split accordingly. (PLEASE ONLY SOLVE 2 & 3)
2) Basically repeat Part 1, but now use 80% of the images for training and the other 20% for testing. Report scores. [10 points] 3) Use the SVM model from part 2 to print out all (a few will do if there are too many) the 1s in your test set that the system predicted to be 7s.
Code and output attached.
Code tested with
Note:
Part 2) and Part 3) combined code
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import train_test_split
print('Downloading dataset')
X, Y = fetch_openml('mnist_784', version=1, return_X_y=True)
print('Download complete')
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a [samples, feature] matrix:
number_of_samples = len(X)
X = X.reshape((number_of_samples, -1))
# Split the data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X,
Y,
test_size = 0.2, # 20 % test set, rest 80 % train set
shuffle = False
)
# # Create a SVM classifier model
model = SVC()
print('Training model')
# # Train the model
model.fit(X_train, y_train)
print('Training complete')
# # Get preditions on test set
predicted = model.predict(X_test)
print("Classification report for SVM model:\n", metrics.classification_report(y_test, predicted))
print("1s digits in the test set that the model predicted to be 7s (showing atmost 4):")
test_images_and_truth_and_predictions = list(zip(X[len(y_train):], y_test, predicted))
# filter the images with true label as 1 but model predicted 7.
filtered = [x for x in test_images_and_truth_and_predictions if x[1] == 1 and x[2] == 7]
if len(filtered) == 0:
print('No such prediction found!')
else:
# Plot the predictions
_, axes = plt.subplots(1, len(filtered[:4]))
for ax, (test_image, truth, prediction) in zip(axes[:], filtered[:4]):
ax.set_axis_off()
ax.imshow(test_image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('Prediction: {}'.format(prediction))
plt.show()
Output
As you can see, accuracy is 98% and my trained model did not make any prediction where true label was "digit 1" but it predicted "digit 7". You can try other configurations of SVM model, different parameters, different train/test split, etc. and see if it outputs any such predictions.