In: Computer Science
Question: Suppose you are designing a machine learning system to determine if transfer course credit should be awarded to incoming transfer students and for what class. Describe the type of data you believe you will need to collect to design and train this system. What evaluation metrics will you use? What experimental design or special considerations need to be considered when designing this system?
How to efficiently design machine learning system
Implement a data pipeline as quickly as possible
Diagnose high bias and/or high variance and act in
consequence
Manually analyze miss classified records and look for
patterns
Continuously Test and learn using selected evaluation metric
1. Implement a data pipeline as quickly as possible
Your data pipeline should execute the following steps:
Clean the undesired values (outliers)
Fill or drop null values
Normalize the numerical features
Encoded the categorical features
Split data into 3 sets train (70%) / cross-validation (15%) / test
(15%) (sets size for non big data applications)
Fit and predict using your favorite model
Evaluate model performance on train / cross validation set using a
metric of your choice (F1, Precision, Recall, MAE etc)
Andrew advice on this is to write the code corresponding for each
of the steps above as quickly as possible without worrying too much
on the two first steps. They can quickly become time consuming, it
is better to make strong assumptions on the first implementation
and iterate on those later on.
2. Diagnose high bias or/and high variance
There are many ways of diagnosing bias and or variance Andrew
proposes two ways of doing so :
Learning curves
Learning curves are defined as the representation of the evolution
of the cost over the number of iterations of gradient descent for
both the cross validation and the test set.
Sadly, it is by definition only relevant to algorithms using
gradient descent or a variant for optimizing it parameters.
By looking at them you can quickly diagnose high bias vs high
variance. The following image speak for itself.
Image for post
Comparing cross-validation accuracy, train accuracy and human
performance
Bayes error : optimal (unreachable) error rate for a specific
problem. Often approximated using best available human
performance.
High variance: train error is quite close to the Bayes error and
cross validation error is quite worst than both.
High bias: train error is quite close to cross validation error and
both are quite worst than the Bayes error.
High bias and high variance: train error is quite better than cross
validation error and both are quite worst than the Bayes
error.
I have used the term “quite” to insist on the fact that there are
no rules thumb to define how big or small the difference on
cross-validation error train error and Bayes error should be for
either of those cases.
Taking actions based on diagnostic
The action that you could take based on the bias/variance
diagnostic differs from one model to another.
In this article I would only present the ones for Logistic and
Linear Regression and Neural Network but you can find the
corresponding actions for Tree based models, KNN and SVM with a
quick Google search.
The key insights here is that you should diagnose the type of
problem you have (high bias or high variance as quickly as
possible).
High bias:
Increase gradient descent number of iterations (all)
Add polynomial features (Linear & Logistic Regression)
Feature engineering (all)
Increase number of layers / number of units per layer (Neural
Network)
High variance:
Add regularization : L1 norm (all), Drop out regularization (Neural
Network)
Add more data (all)
3. Error Analysis
Error analysis consists in collecting a random sample of miss
classified records in the case of a classification problem or
records for which the prediction error was high in the case of a
regression problem from the test set. Then you should analyze the
distribution of the sample across various categories.
Image for post
In the upper error analysis output table, you can see a practical
example of the method in the case of a cat detector
algorithm.
The main insights that can be drawn from that table is that 61% of
blurry images and 43% of miss classified records were miss
classified. Based on those results, spending some time on improving
the algorithms performance on Great Cat and Blurry images seems
worthwhile.
The dataset may or not contained detailed informations about its
records. That’s, why manually looking at the records may help you
to create categories based on your observations.
In the upper example, it is only by manually looking and
classifying images that the great insights on how to improve
performance were discovered.
Continuously test and learn using your evaluation metric
Throughout the second and third step use your setup for evaluation
build in step 1 to track the amelioration of your algorithm
performance.
You should also use this setup, to test different hyper
parameters/models and test different methods for filling null
values and filtering out outliers.