In: Computer Science
1. Define the classification problem
2. What is the main difference between Simple Matching Coefficient (SMC) Similarity and Jaccard Similarity?
3. Explain in your own words how the Decision Tree Classifier
works.
4. Explain in your own words how the SVM Classifier works.
1. Define the classification problem?
---> The classification problem is the problem that for many real-world objects and systems.
---> coming up with an iron-clad classification system (to determine if an object is a member of a set or not, or which of several sets) is a difficult problem.
---> Classification is a central topic in machine learning that has to do with teaching machines how to group together data by particular criteria.
---> Classification is the process where computers group data together based on predetermined characteristics — this is called supervised learning.
---> There is an unsupervised version of classification, called clustering where computers find shared characteristics by which to group data when categories are not specified.
---> A common example of classification comes with detecting spam emails.
---> To write a program to filter out spam emails, a computer programmer can train a machine learning algorithm with a set of spam-like emails labelled as spam and regular emails labelled as not-spam.
---> The idea is to make an algorithm that can learn characteristics of spam emails from this training set so that it can filter out spam emails when it encounters new emails.
---> Classification is an important tool in today’s world, where big data is used to make all kinds of decisions in government, economics, medicine, and more.
---> Researchers have access to huge amounts of data, and classification is one tool that helps them to make sense of the data and find patterns.
---> While classification in machine learning requires the use of (sometimes) complex algorithms, classification is something that humans do naturally everyday.
---> Classification is simply grouping things together according to similar features and attributes.
---> When you go to a grocery store, you can fairly accurately group the foods by food group (grains, fruit, vegetables, meat, etc.) In machine learning, classification is all about teaching computers to do the same.
2. What is the main difference between Simple Matching Coefficient
(SMC) Similarity and Jaccard Similarity?
Simple matching coefficient:
--------------------------------------
The simple matching coefficient (SMC) or Rand similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets.
Given two objects, A and B, each with n binary attributes, SMC is defined as:
number of matching attributes
SMC = ----------------------------------------
number of attributes
M00 + M11
= --------------------------------
M00+M01+M10+M11
where:
M11 is the total number of attributes where A and B both have a
value of 1.
M01 is the total number of attributes where the attribute of A is 0
and the attribute of B is 1.
M10 is the total number of attributes where the attribute of A is 1
and the attribute of B is 0.
M00 is the total number of attributes where A and B both have a
value of 0.
Jaccard index:
-------------------
The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally given the French name coefficient de communauté by Paul Jaccard), is a statistic used for gauging the similarity and diversity of sample sets.
---> The Jaccard coefficient measures similarity between
finite sample sets, and is defined as the size of the intersection
divided by the size of the union of the sample sets:
|A^B| |A^B|
J(A,B)= --------- = -----------------
|AvB| |A| + |B| - |A^B|
If A and B are both empty , define J(A,B)=1
The Jaccard distance, which measures dissimilarity between sample
sets, is complementary to the Jaccard coefficient and is obtained
by subtracting the Jaccard coefficient from 1, or, equivalently, by
dividing the difference of the sizes of the union and the
intersection of two sets by the size of the union.
|AvB|-|A^B|
d_J(A,B) = 1 - J(A,B) = ------------------
|AvB|
SMD Difference with the Jaccard index :
-----------------------------------------------------
---> The SMC is very similar to the more popular Jaccard index.
---> The main difference is that the SMC has the term {\displaystyle M00 in its numerator and denominator, whereas the Jaccard index does not.
---> Thus, the SMC counts both mutual presences (when an attribute is present in both sets) and mutual absence (when an attribute is absent in both sets) as matches and compares it to the total number of attributes in the universe, whereas the Jaccard index only counts mutual presence as matches and compares it to the number of attributes that have been chosen by at least one of the two sets.
---> In market basket analysis, for example, the basket of two
consumers who we wish to compare might only contain a small
fraction of all the available products in the store, so the SMC
will usually return very high values of similarities even when the
baskets bear very little resemblance, thus making the Jaccard index
a more appropriate measure of similarity in that context
---> For example, consider a supermarket with 1000 products and two customers. The basket of the first customer contains salt and pepper and the basket of the second contains salt and sugar. In this scenario, the similarity between the two baskets as measured by the Jaccard index would be 1/3, but the similarity becomes 0.998 using the SMC.
3. Explain in your own words how the Decision Tree Classifier works?
---> A decision tree is a decision support tool that uses a
tree-like model of decisions and their possible consequences,
including chance event outcomes, resource costs, and utility.
---> It is one way to display an algorithm that only contains conditional control statements.
---> Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.
---> The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is split.
Decision rules:
-------------------
---> The decision tree can be linearized into decision rules, where the outcome is the contents of the leaf node, and the conditions along the path form a conjunction in the if clause. In general, the rules have the form.
---> Decision rules can be generated by constructing association rules with the target variable on the right. They can also denote temporal or causal relations.
Advantages and disadvantages:
---------------------------------------------
Among decision support tools, decision trees (and influence diagrams) have several advantages. Decision trees:
1. Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
2. Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.
3. Help determine worst, best and expected values for different scenarios.
4. Use a white box model. If a given result is provided by a model.
5. Can be combined with other decision techniques.
Disadvantages of decision trees:
1. They are unstable, meaning that a small change in the data can lead to a large change in the structure of the optimal decision tree.
2. They are often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to interpret as a single decision tree.
3. For data including categorical variables with different number of levels, information gain in decision trees is biased in favor of those attributes with more levels.
4. Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.
4. Explain in your own words how the SVM Classifier works.
---> SVM is a supervised machine learning algorithm which can be used for classification or regression problems
---> It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.
---> Simply put, it does some extremely complex data transformations, then figures out how to seperate your data based on the labels or outputs you've defined.
---> Support Vector Machines – SVMs, represent the cutting edge of ranking algorithms and have been receiving special attention from the international scientific community.
---> Many successful applications, based on SVMs, can be found in different domains of knowledge, such as in text categorization, digital image analysis, character recognition and bioinformatics.
---> SVMs are relatively new approach compared to other supervised classification techniques, they are based on statistical learning theory developed by the Russian scientist Vladimir Naumovich Vapnik back in 1962 and since then, his original ideas have been perfected by a series of new techniques and algorithms.
So what makes it so great?
---> Non-linear SVM means that the boundary that the algorithm calculates doesn't have to be a straight line.
---> The benefit is that you can capture much more complex relationships between your datapoints without having to perform difficult transformations on your own. The downside is that the training time is much longer as it's much more computationally intensive.
---> Support vector machines are computational algorithms that construct a hyperplane or a set of hyperplanes in a high or infinite dimensional space.
---> SVMs can be used for classification, regression, or other tasks. Intuitively, a separation between two linearly separable classes is achieved by any hyperplane that provides no misclassification on all data points of any of the considered classes, that is, all points belonging to class A are labeled as +1, for example, and all points belonging to class B are labeled as -1.