In: Computer Science
Question. Programming question: Dimension Reduction
In this question, you are asked to run Singular Value Decomposition
(SVD) on Fashion-MNIST data set, interpret the output and train
generative classifiers for multi-nomial classification of 10
classes. For the Fashion-MNIST data set, you can find more details
in the original GitHub website or Kaggle website.
Kaggle: https://www.kaggle.com/zalando-research/fashionmnist
GetHub: https://github.com/zalandoresearch/fashion-mnist
Tasks:
?Load the training and test data sets from fashion-mnist train.csv and fashion- mnist test.csv. Each row uses a vector of dimension 784 with values between 0 (black) and 255 (white) on the gray color scale.
Use SVD to reduce the number of dimensions of the training data set so that it explains just above 90% of the total variance. Remember to scale the data before performing SVD. Report how many components you select and their variance ratios.
Train generative classifiers (Naive Bayes and KNN) and discriminative classifier (multinomial logistic regression) on both the training data set after SVD and the original data set (without dimension reduction). Fine-tune the hyper-parameters, e.g. learning rate in MLR and k value in KNN, to achieve best performance on a validation set split from the training set. Write a brief description to compare the performances of these classifiers in terms of accuracy on the test set.
?Guidelines:
In this homework, you are allowed to use scikit-learn’s
implementations Multinomial Logistic Regression, Naive Bayes, and
KNN directly.
You are allowed to use a library implementation of SVD. For python users, we recommend scikit-learn’s implementation TruncatedSVD.
Fashion-MNIST
1) Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples.
2) Each example is a 28x28 grayscale image, associated with a label from 10 classes.
3) Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms.
4) It shares the same image size and structure of training and testing splits.
1) Exploring The Dataset
1.1: Importing Libraries
import warnings warnings.filterwarnings('ignore') # Handle table-like data and matrices : import numpy as np import pandas as pd import math import itertools # Modelling Algorithms : # Classification from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC, LinearSVC from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis , QuadraticDiscriminantAnalysis # Regression from sklearn.linear_model import LinearRegression,Ridge,Lasso,RidgeCV, ElasticNet from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor,AdaBoostRegressor from sklearn.svm import SVR from sklearn.neighbors import KNeighborsRegressor from sklearn.neural_network import MLPRegressor # Modelling Helpers : from sklearn.preprocessing import Imputer , Normalizer , scale from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, classification_report from sklearn.feature_selection import RFECV from sklearn.model_selection import GridSearchCV , KFold , cross_val_score #preprocessing : from sklearn.preprocessing import MinMaxScaler , StandardScaler, Imputer, LabelEncoder
#evaluation metrics : # Regression from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error # Classification from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score # Deep Learning Libraries from keras.models import Sequential, load_model from keras.layers import Dense, Dropout, Flatten from keras.layers import Conv2D, MaxPooling2D, BatchNormalization from keras.optimizers import Adam,SGD,Adagrad,Adadelta,RMSprop from keras.preprocessing.image import ImageDataGenerator from keras.callbacks import ReduceLROnPlateau, LearningRateScheduler from keras.utils import to_categorical # Visualisation import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.pylab as pylab import seaborn as sns import missingno as msno # Configure visualisations %matplotlib inline mpl.style.use( 'ggplot' ) plt.style.use('fivethirtyeight') sns.set(context="notebook", palette="dark", style = 'whitegrid' , color_codes=True)
Using TensorFlow backend.
# Center all plots from IPython.core.display import HTML HTML(""" <style> .output_png { display: table-cell; text-align: center; vertical-align: middle; } </style> """); # Make Visualizations better params = { 'axes.labelsize': "large", 'xtick.labelsize': 'x-large', 'legend.fontsize': 20, 'figure.dpi': 150, 'figure.figsize': [25, 7] } plt.rcParams.update(params)
1.2: Extract Dataset
train = pd.read_csv('../input/fashion-mnist_train.csv') test = pd.read_csv('../input/fashion-mnist_test.csv') df = train.copy() df_test = test.copy()
df.head()
1.4: Examine Dimensions
print('Train: ', df.shape) print('Test: ', df_test.shape)
1.5: Examine NaN Values
# Train df.isnull().any().sum() # Test df_test.isnull().any().sum()
2) Visualizing the Dataset
2.1: Plotting Random Images
# Mapping Classes clothing = {0 : 'T-shirt/top', 1 : 'Trouser', 2 : 'Pullover', 3 : 'Dress', 4 : 'Coat', 5 : 'Sandal', 6 : 'Shirt', 7 : 'Sneaker', 8 : 'Bag', 9 : 'Ankle boot'}
fig, axes = plt.subplots(4, 4, figsize = (15,15)) for row in axes: for axe in row: index = np.random.randint(60000) img = df.drop('label', axis=1).values[index].reshape(28,28) cloths = df['label'][index] axe.imshow(img, cmap='gray') axe.set_title(clothing[cloths]) axe.set_axis_off()
2.2: Distribution of Labels
df['label'].value_counts()
sns.factorplot(x='label', data=df, kind='count', size=3, aspect= 1.5)
3) Data PreProcessing
3.1: Setting Random Seeds
# Setting Random Seeds for Reproducibilty. seed = 66 np.random.seed(seed)
3.2: Splitting Data into Train and Validation Set
Now we are gonna split the training data into Train and Validation Set. Train set is used for Training the model and Validation set is used for Evaluating our Model's Performance on the Dataset.
This is achieved using the train_test_split method of scikit learn library.
In [14]:
X = train.iloc[:,1:] Y = train.iloc[:,0] x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=seed)
3.3: Reshaping the Images
# reshape(examples, height, width, channels) x_train = x_train.values.reshape((-1, 28, 28, 1)) x_test = x_test.values.reshape((-1, 28, 28, 1)) df_test.drop('label', axis=1, inplace=True) df_test = df_test.values.reshape((-1, 28, 28, 1))
3.4 : Normalization
# You need to make sure that your Image is cast into double/float from int before you do this scaling # as you will most likely generate floating point numbers. # And had it been int, the values will be truncated to zero. x_train = x_train.astype("float32")/255 x_test = x_test.astype("float32")/255 df_test = df_test.astype("float32")/255
3.5: One Hot Encoding
y_train = to_categorical(y_train, num_classes=10) y_test = to_categorical(y_test, num_classes=10)
print(y_train.shape) print(y_test.shape)
4) Training a Convolutional Neural Network
4.1: Building a ConvNet
model = Sequential() model.add(Conv2D(filters=32, kernel_size=(3, 3), activation='relu', strides=1, padding='same', data_format='channels_last', input_shape=(28,28,1))) model.add(BatchNormalization()) model.add(Conv2D(filters=32, kernel_size=(3, 3), activation='relu', strides=1, padding='same', data_format='channels_last')) model.add(BatchNormalization()) model.add(Dropout(0.25)) model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu', strides=1, padding='same', data_format='channels_last')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Conv2D(filters=128, kernel_size=(3, 3), activation='relu', strides=1, padding='same', data_format='channels_last')) model.add(BatchNormalization()) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(512, activation='relu')) model.add(BatchNormalization()) model.add(Dropout(0.5)) model.add(Dense(128, activation='relu')) model.add(BatchNormalization()) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax'))
4.2: Compiling the Model
# Optimizer optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999 )
In [21]:
# Compiling the model model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])
4.4: Learning Rate Decay
reduce_lr = LearningRateScheduler(lambda x: 1e-3 * 0.9 ** x)
4.5: Data Augmentation
datagen = ImageDataGenerator( rotation_range = 8, # randomly rotate images in the range (degrees, 0 to 180) zoom_range = 0.1, # Randomly zoom image shear_range = 0.3,# shear angle in counter-clockwise direction in degrees width_shift_range=0.08, # randomly shift images horizontally (fraction of total width) height_shift_range=0.08, # randomly shift images vertically (fraction of total height) vertical_flip=True) # randomly flip images
In [25]:
datagen.fit(x_train)
4.6: Fitting the Model
batch_size = 128 epochs = 40
# Fit the Model history = model.fit_generator(datagen.flow(x_train, y_train, batch_size = batch_size), epochs = epochs,
5) Evaluating the Model
score = model.evaluate(x_test, y_test) print('Loss: {:.4f}'.format(score[0])) print('Accuracy: {:.4f}'.format(score[1])) 5.1: Plotting the Training and Validation Curves
plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title("Model Loss") plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend(['Train', 'Test']) plt.show()
plt.plot(history.history['acc']) plt.plot(history.history['val_acc']) plt.title("Model Accuracy") plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend(['Train', 'Test']) plt.show()
6. Confusion Matrix:
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=90) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label')
In [32]:
# Predict the values from the validation dataset Y_pred = model.predict(x_test) # Convert predictions classes to one hot vectors Y_pred_classes = np.argmax(Y_pred,axis = 1) # Convert validation observations to one hot vectors Y_true = np.argmax(y_test,axis = 1) # compute the confusion matrix confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) # plot the confusion matrix plot_confusion_matrix(confusion_mtx, classes = ['T-shirt/Top','Trouser','Pullover','Dress','Coat','Sandal','Shirt','Sneaker','Bag','Ankle Boot'])
7) Visualization of Predicted Classes
7.1: Correctly Predicted Classes
correct = [] for i in range(len(y_test)): if(Y_pred_classes[i] == Y_true[i]): correct.append(i) if(len(correct) == 4): break
fig, ax = plt.subplots(2,2, figsize=(12,6)) fig.set_size_inches(10,10) ax[0,0].imshow(x_test[correct[0]].reshape(28,28), cmap='gray') ax[0,0].set_title("Predicted Label : " + str(clothing[Y_pred_classes[correct[0]]]) + "\n"+"Actual Label : " + str(clothing[Y_true[correct[0]]])) ax[0,1].imshow(x_test[correct[1]].reshape(28,28), cmap='gray') ax[0,1].set_title("Predicted Label : " + str(clothing[Y_pred_classes[correct[1]]]) + "\n"+"Actual Label : " + str(clothing[Y_true[correct[1]]])) ax[1,0].imshow(x_test[correct[2]].reshape(28,28), cmap='gray') ax[1,0].set_title("Predicted Label : " + str(clothing[Y_pred_classes[correct[2]]]) + "\n"+"Actual Label : " + str(clothing[Y_true[correct[2]]])) ax[1,1].imshow(x_test[correct[3]].reshape(28,28), cmap='gray') ax[1,1].set_title("Predicted Label : " + str(clothing[Y_pred_classes[correct[3]]]) + "\n"+"Actual Label : " + str(clothing[Y_true[correct[3]]]))
7.2: Incorrectly Predicted Classes
incorrect = [] for i in range(len(y_test)): if(not Y_pred_classes[i] == Y_true[i]): incorrect.append(i) if(len(incorrect) == 4): break
fig, ax = plt.subplots(2,2, figsize=(12,6)) fig.set_size_inches(10,10) ax[0,0].imshow(x_test[incorrect[0]].reshape(28,28), cmap='gray') ax[0,0].set_title("Predicted Label : " + str(clothing[Y_pred_classes[incorrect[0]]]) + "\n"+"Actual Label : " + str(clothing[Y_true[incorrect[0]]])) ax[0,1].imshow(x_test[incorrect[1]].reshape(28,28), cmap='gray') ax[0,1].set_title("Predicted Label : " + str(clothing[Y_pred_classes[incorrect[1]]]) + "\n"+"Actual Label : " + str(clothing[Y_true[incorrect[1]]])) ax[1,0].imshow(x_test[incorrect[2]].reshape(28,28), cmap='gray') ax[1,0].set_title("Predicted Label : " + str(clothing[Y_pred_classes[incorrect[2]]]) + "\n"+"Actual Label : " + str(clothing[Y_true[incorrect[2]]])) ax[1,1].imshow(x_test[incorrect[3]].reshape(28,28), cmap='gray') ax[1,1].set_title("Predicted Label : " + str(clothing[Y_pred_classes[incorrect[3]]]) + "\n"+"Actual Label : " + str(clothing[Y_true[incorrect[3]]]))
8) Classification Report
classes = ['T-shirt/Top','Trouser','Pullover','Dress','Coat','Sandal','Shirt','Sneaker','Bag','Ankle Boot'] print(classification_report(Y_true, Y_pred_classes, target_names = classes))
9) Predicting on the Test Data
X = df_test Y = to_categorical(test.iloc[:,0])
In [39]:
score = model.evaluate(X, Y) print("Loss: {:.4f}".format(score[0])) print("Accuracy: {:.4f}".format(score[1]))