In: Statistics and Probability
What are the principal aspects of data that need to be examined when using multivariate analysis?
I. OVERVIEW
Multivariate analysis in statistics is devoted to the summarization, representation, and interpretation of data when more than one characteristic of each sample unit is measured. Almost all data-collection processes yield multivariate data. The medical diagnostician examines pulse rate, blood pressure, hemoglobin, temperature, and so forth; the educator observes for individuals such quantities as intelligence scores, quantitative aptitudes, and class grades; the economist may consider at points in time indexes and measures such as percapita personal income, the gross national product, employment, and the Dow-Jones average. Problems using these data are multivariate because inevitably the measures are interrelated and because investigations involve inquiry into the nature of such interrelationships and their uses in prediction, estimation, and methods of classification. Thus, multivariate analysis deals with samples in which for each unit examined there are observations on two or more stochastically related measurements. Most of multivariate analysis deals with estimation, confidence sets, and hypothesis testing for means, variances, covariances, correlation coefficients, and related, more complex population characteristics.
Only a sketch of the history of multivariate analysis is given here. The procedures of multivariate analysis that have been studied most are based on the multivariate normal distribution discussed below.
Robert Adrian considered the bivariate normal distribution early in the nineteenth century, and Francis Galton understood the nature of correlation near the end of that century. Karl Pearson made important contributions to correlation, including multiple correlation, and to regression analysis early in the present century. G. U. Yule and others considered measures of association in contingency tables, and thus began multivariate developments for counted data. The pioneering work of “Student” (W. S. Cosset) on small-sample distributions led to R. A. Fisher’s distributions of simple and multiple correlation coefficients. J. Wishart derived the joint distribution of sample variances and covariances for small multivariate normal samples. Harold Hotelling generalized the Student t-statistic and t-distribution for the multivariate problem. S. S. Wilks provided procedures for additional tests of hypotheses on means, variances, and covariances. Classification problems were given initial consideration by Pearson, Fisher, and P. C. Mahalanobis through measures of racial likeness, generalized distance, and discriminant functions, with some results similar to the work of Hotelling. Both Hotelling and Maurice Bartlett made initial studies of canonical correlations, intercorrelations between two sets of variates. More recent research by S. N. Roy, P. L. Hsu, Meyer Girshick, D. N. Nanda, and others has dealt with the distributions of certain characteristic roots and vectors as they relate to multivariate problems, notably to canonical correlations and multivariate analysis of variance. Much attention has also been given to the reduction of multivariate data and its interpretation through many papers on factor analysis and principal components. [For further discussion of the history of these special areas of multivariate analysis and of their present-day applications, see Counted Data; Distributions, Statistical, article onSpecial Continuous Distributions; Factor analysis; Multivariate Analysis, articles onCorrelationand Classification and Discrimination; Statistics, Descriptive, article on Association; and the biographies ofFisher, R. A.; Galton; Girshick; Gosset; Pearson; Wilks; Yule.]
Basic multivariate distributions :
Scientific progress is made through the development of more and more precise and realistic representations of natural phenomena. Thus, science, and to an increasing extent social science, uses mathematics and mathematical models for improved understanding, such mathematical models being subject to adoption or rejection on the basis of observation [seeModels, Mathematical]. In particular, stochastic models become necessary as the inherent variability in nature becomes understood.
The multivariate normal distribution provides the stochastic model on which the main theory of multivariate analysis is based. The model has sufficient generality to represent adequately many experimental and observational situations while retaining relative simplicity of mathematical structure. The possibility of applying the model to transforms of observations increases its scope [seeStatistical Analysis, Special Problems Of, article onTransformations Of Data]. The large-sample theory of probability and the multivariate central limit theorem add importance to the study of the multivariate normal distribution as it relates to derived distributions. Inquiry and judgment about the use of any model must be the responsibility of the investigator, perhaps in consultation with a statistician. There is still a great deal to be learned about the sensitivity of the multivariate model to departures from that distributional assumption. [SeeErrors, article on Effects Of Errors In Statistical Assumptions.]
The multivariate normal distribution
Suppose that the characteristics or variates to be measured on each element of a sample from a population, conceptual or real, obey the probability law described through the multivariate normal probability density function. If these variates are p in number and are designated by X1, … Xp, the multivariate normal density contains p parameters, or population characteristics, σ1, …, σP , representing, respectively, the means or expected values of the variates, and parameters σ ij i, j = 1, …, p, σji σ ij, representing variances and covariances of the variates. Here σ ii is the variance of Xi(corresponding to the variance σ2 of a variate X in the univariate case) and σij = σij is the covariance of Xi and Xj. The correlation coefficient between Xi and Xi is
The multivariate normal probability density function provides the probability density for the variates Xi, … … …, Xp at each pointx1, … … …, xp in the sample or observation space. Its specific mathematical form is
− ∞ < xi < ∞, i = 1,..., p [For the explicit form of this density in the hivariate case (p = 2), seeMultivariate Analysis, article on Correlation(1).]
(Vector and matrix notation and an understanding of elementary aspects of matrix algebra are important for any real understanding or application of multivariate analysis. Thus, x′ is the vector (xi...xp), μ′ is the vector (μ1 ...,μp), and (x – μ)′ is the vector (x1–μ1..., xp – μp Also, Σ is the p × p, symmetric matrix which has elements σij, Σ = [σij], ǀΣǀ is the determinant of Σ and Σ-1 is its inverse. The prime indicates “transpose,” and thus (x — μ)− is the transpose of (x — μ), a column vector.)
Comparison off(x1..., xp with f (x) the univariate normal probability density function, may assist understanding; for a univariate normal variate X with mean μ and variance σ2,
Where − ∞< x < ∞
The multivariate normal density may be characterized in various ways. One direct method begins with p independent, univariate normal variables, U1, ..., Up each with zero mean and unit variance. From the independence assumption, their joint density is the product
A very special case of the multivariate normal probability density function. If variates Xl, ... Xp are linearly related to Ul, ..., Up so that X=AU + μ in matrix notation, with X , U , and μ being column vectors and A being a p × p nonsingular matrix of constants aij, then
Xi = aij + ... + aip Up, i = 1,..., p.
Clearly, the mean of Xi is E(Xi) = μi, where μi is a known constant and E represents “expectation.” The variance of Xi is
And the covariance of Xi and Xj, i ≠ j, is
Standard density function manipulations then yield the joint density function of Xl, ..., Xp as that already given as the general p-variate normal density. If the matrix A is singular, the results for E(Xi), var(Xi), and cov(Xi, Xj) still hold and Xi, ..., Xp are said to have a singular multivariate normal distribution; although the joint density function cannot be written, the concept is useful.
A second characterization of the p-variate normal distribution is the following: Xl ..., Xp have a pvariate normal distribution if and only if is univariate normal for all choices of the coefficients ai, that is, if and only if all linear combinations of the Xi are univariate normal.
The multivariate normal cumulative distribution function represents the probability of the joint occurrence of the events X1 ≤ x1,Xp ≤x and may be written
Indicating that probabilities that observations fall into regions of the p-dimensional variate space may be obtained by integration. Tables of F(x1,..., xp) are available for p = 2, 3 (see Greenwood & Hartley 1962).