In: Statistics and Probability
Find an example of use of a) cluster analysis and b) classification in research or business literature. Preferably, those example should be taken from the same domain. Analyze the problems that were solved with those two methods and the conclusions that were made. Address the particular algorithms selection if it is provided. Draw a more general conclusion how to decide which method to use in the similar cases.
Cluster analysis and classification both are machine learning methods.
1. Classification is the process of classifying the data with the help of class labels whereas, in clustering, there are no predefined class labels. In simple words classification method is supervised learning (have dependent variables) and Cluster analysis is unsupervised learning (does not have dependent variable).
2. For Classification have algorithms like Decision trees, Bayesian classifiers are used whereas, for Cluster analysis have algorithms like K-means is used.
3. Classification has prior knowledge of classes but the cluster doesn't have any prior knowledge of classes.
Example of classification:
Suppose A Electricity Company wants to classify consumption data based on person electricity consumption. For classification class they have three category (predefined class)
1) Abnormal (who have consumed more electricity that records comes into abnormal class)
2) Subnormal (who have consumed very less electricity or by mistake wrong bill generated)
3) Normal (Average consumption bill)
So In Classification method data will classify based on predefined classed (abnormal subnormal and normal) and also can predict new data will go in which class.
Example of a cluster:
Suppose A Electricity Company wants to cluster data Means Company want to make homogeneous group of customer based on electricity consumption(no predefined class is their)
For These type of analysis we use cluster analysis methods
Conclusion: According to data we can choose method. In simple words we can say if we have dependent variable (predefined class labels) in our data we can use Classification method and if we does not have dependent variable (no predefined class labels) in data we can use cluster analysis
Let’s take An example: We have data of electricity company and there in no such predefined labels class
ACCOUNT_ID |
DEMAND |
NET_PAYABLE_AMOUNT |
COLLECTION_AMOUNT |
CONSUMED_UNITS |
5368742000 |
3438 |
3438 |
3437 |
349 |
7582052000 |
1267 |
1267 |
1267 |
118 |
2483152000 |
252 |
2525 |
1575 |
42 |
4061696749 |
655 |
1049 |
400 |
122 |
5480352000 |
265 |
555 |
290 |
53 |
9122742000 |
6762 |
6762 |
11800 |
708 |
8610052000 |
460 |
1298 |
802 |
90 |
5100441549 |
4903 |
9571 |
9571 |
533 |
682052000 |
1660 |
1660 |
1660 |
161 |
3138942000 |
1586 |
5138 |
5136 |
170 |
Here Based on homogeneous consumer’s data clustered and here we are using K-means clustering algorithm because there is no predefined class available.
Here,we can not apply the classification method.Because in the mentioned data set we don't have any predefined category like "abnormal","subnormal","normal".So based on the data we can apply Cluster analysis algorithm.