In: Computer Science
Describe the difference between classification, clustering, and association rules. Be specific and provide details.
Answer:
Classification:
Classification is the process of learning a model that elucidate different predetermined classes of data. It is a two-step process, comprised of a learning step and a classification step. In learning step, a classification model is constructed and classification step the constructed model is used to prefigure the class labels for given data.
Clustering:
Clustering is a technique of organising a group of data into classes and clusters where the objects reside inside a cluster will have high similarity and the objects of two clusters would be dissimilar to each other. Here the two clusters can be considered as disjoint. The main target of clustering is to divide the whole data into multiple clusters. Unlike classification process, here the class labels of objects are not known before, and clustering pertains to unsupervised learning.
Association:
Association rules are if-then statements that help to show the probability of relationships between data items within large data sets in various types of databases. Association rule mining has a number of applications and is widely used to help discover sales correlations in transactional data or in medical data sets.
Difference Between Classification and Clustering:
Classification and Clustering are the two types of learning methods which characterize objects into groups by one or more features. These processes appear to be similar, but there is a difference between them in context of data mining. The prior difference between classification and clustering is that classification is used in supervised learning technique where predefined labels are assigned to instances by properties, on the contrary, clustering is used in unsupervised learning where similar instances are grouped, based on their features or properties.
When the training is provided to the system, the class label of training tuple is known and then tested, this is known as supervised learning. On the other hand, unsupervised learning does not involve training or learning, and the training sample is not known previously.
Basis for comparison | Classification | Clustering |
---|---|---|
Basic | This model function classifies the data into one of numerous already defined definite classes. | This function maps the data into one of the multiple clusters where the arrangement of data items is relies on the similarities between them. |
Involved in | Supervised learning | Unsupervised learning |
Training sample | Provided | Not provided |
Key Differences Between Classification and Clustering
The difference between clustering and association:
By definition, clustering is grouping a set of objects in such a manner that objects in the same group are more similar than to those object belonging to other groups.
Whereas, association rules is about finding associations amongst items within large commercial databases.
Now, let's take an example. Suppose we have data on trips and corresponding product purchases as below:
Where, “1” means purchase and “0” means no-purchase.
Now, let’s ask ourselves 2 business questions:
i) Which all trips has similar product purchases?
ii) Which products could be grouped together?
Question (i) would be answered by clustering – where we will look at similarities between trips (ti, tj) based on purchased product dimensions.
Question (ii) would be answered by association rules – where we will look at co-occurrences of products (Pi, Pj) within trips and association rules will be derived based on popular metrics, e.g. support, confidence, lift etc.
So both, clustering and association rule mining (ARM), are in the field of unsupervised machine learning. Clustering is about the data points, ARM is about finding relationships between the attributes of those datapoints.