In: Statistics and Probability
# 3) Problem 3.5 from the textbook asks you to do and interpret a principal components analysis on the given correlation matrix, which can be entered into R with the following code:
my.cor.mat <- matrix(c(1,.402,.396,.301,.305,.339,.340, .402,1,.618,.150,.135,.206,.183, .396,.618,1,.321,.289,.363,.345, .301,.150,.321,1,.846,.759,.661, .305,.135,.289,.846,1,.797,.800, .339,.206,.363,.759,.797,1,.736, .340,.183,.345,.661,.800,.736,1), ncol=7, nrow=7, byrow=T);
As mentioned in the book, the 7 variables are 'head length', 'head breadth', 'face breadth', 'left finger length', 'left forearm length', 'left foot length','height'.
Obtain the principal components (including choosing an appropriate number of PCs). Also make an attempt to interpret your PCs.
INTERPRETATION:
The proportion of variance row from the above output explains the percentage of information captured by each components. Thus from the proportion of variance row, we could see that the first principal components explains 75% of the total information with an Eigen value of 2.29. And the second principal component expalains 15% of information with an Eigen value of 1.04. One way to choose the appropriate number of principal components is by using the eigen value. An eigenvalue is an index that indicates how good a component is as a summary of the data. An eigenvalue of 1.0 means that the component contains the same amount of information as a single variable. Thus from Cumulative proportion row, we could infer that the first two principal components together explain 91% of the total information in the given data.
SCREE PLOT:
The second method to determine the number of components is using Scree Plot. A scree plot displays the proportion of the total variation in a dataset that is explained by each of the components in a principle component analysis. It helps you to identify how many of the components are needed to summarise the data.
The following scree plot shows the number of Eigenvalues on vertical axis, ordered from biggest to smallest and number of principal components on the horizontal axis.
From the scree plot, we conclude that the correct number of components is the number that appear prior to the elbow point.
With the help of scree plot we can see there is not much of difference in variance explained beyond the second component (elbow rule: Since component 2 is in elbow point) hence first two components has a higher variance explained. Scree plot along with the eigen value criteria (eigen value greater than one) we can retain the first two principal components which together captures 91% of the total information.
BIPLOT:
The principal component biplot is based on the first two principal components and it provides 91% information contained in a correlation plot. The biplot indicates certain relationships between variables, based on the angles between the vectors. Some variables are positively correlated, others are negatively or not correlated at all. The relationship or interaction between an object vector (red line represents the vector) and a variable vector, is positive if their angle is acute, and negative in the case of an obtuse angle. In biplot the vector representing a variable 1, forms a very wide angle, more than 120º, with variables 2 and 3.Thus the relationship between variable 1 (head length), variable 2 (head breadth) and variable 3 (face breadth) is negative since the angle between them is obtuse. Furthermore, it has negative loadings for PC1. Thus PC1 separating individuals with high values of variable 2 and 3 and low values of variable 1 from individuals with high values of variable1 and low values of variable 2 and 3. Whereas there is positive relationship between variables 1,3,4,5,6 and 7 since the angle between them is acute. Whereas PC1 has positive loadings with variables 4 (left finger length) , variable 5 (left forearm length) ,variable 6 (left foot length) and variable 7 (height) are positively correlated. Whereas PC2 has positive loadings for variable 1 (head length) and negative loadings for other 6 variables.