In: Computer Science
What is Unsupervised Learning and difficulties involved in unsupervied Learning and name a few unsupervised algorithms?
What is a PCA? When to use PCA ? [Please explain with one to two examples]
How does a PCA work? [Please write in atleast 5 sentences] .
What are different methods by which you can compute the PCA? Does every method will yield same result or answer at the end of each method?
What are advantages and disadvantages of PCA? [Explain with example]
What is clustering? Explain how K-Means Clustering Algorithm works?
What are the Advantages and disadvantages of Clustering ALgorithms discussed in our class (K-Means,Hierchal)?
Which Clustering Algorithm is better K-Means or hierarchical Clustering? Explain with a proper example which is better algorithm?
1. Unsupervised learning is a technique which uses unlabelled data to train the model. Unlike supervised learning, the model cannot associate value in a feature to a result. The amount of unlabeled data in real-world is greater than the labeled data. The following are some of the difficulties in unsupervised learning:
- The time complexity is greater when compared to most supervised learning.
- The number of clusters that the model will form is unknown prior to training.
- Data pre-processing is often difficult because of the unavailability of the labels.
- The model might find a pattern in data which is not required.
Few unsupervised algorithms are K-means, Hierarchical clustering, Fuzzy C-Means.
2. Often the real-world datasets have 1000s of features(dimensions) in them. This reduces the performance and accuracy of some models because not all the features in the dataset are useful for the model. These features can be removed during data pre-processing using techniques like PCA, LDA, GDA, etc. Apart from improving the performance of the model, reducing features also help in data visualization.
Example: The popular MNIST data contains handwritten numbers in images. Each pixel in the image is considered as a feature.
Above is an example of the number "2" from the MNIST data set. Notice that all the numbers are centered. The white pixels from the border to the handwritten number are not useful. In these situations, dimensionality reduction can be used to remove unwanted features.
3. PCA is the most popular technique of dimentionality reduction. PCA projects the data into a hyperplane and the dimention which has less variance will be removed. For example, consider a 2D dataset which is projected as follows
It can be noted that the maximum variance is along the solid line, followed by the dotted line and then the dashed line. Thus the solid and dotted axis can be preserved and the dashed axis can be removed.
5. Advantages of PCA: Increases performance, Reduces co-related features, Improves visualization.
Disadvantages: The retained features will be turned into principle components and interpreting this will become difficult in some cases. The data has to be scaled and standardized before using PCA.