Home » Research » Reliability of principal component analysis

The purpose of this page is to demonstrate the internal consistency of the methodology developed for a large variety of methods and problems; it is therefore admittedly rather long and detailed.
Principal component analysis (PCA) is generally considered to be the working horse of multivariate data analysis, since so many methods are merely a variation on the same basic theme. Wellknown examples are:
 soft independent modelling of class analogy (SIMCA), regularized discriminant analysis (RDA) and discriminant analysis with shrunken covariances (DASCO) for multivariate classification,
 principal component scoresbased multivariate statistical process control (MSPC),
 modelling of multiplicative terms in analysis of variance (ANOVA),
 generalized rank annihilation method (GRAM) and iterative target transformation factor analysis (ITTFA) for curve resolution and optional secondorder bilinear calibration, and
 principal component regression (PCR) and partial least squares (PLS) regression for multivariate calibration.
This page is organized as follows:
Setting the stage
PCA generates eigenvalues and eigenvectors for a crossproduct of a data matrix X. This crossproduct can be calculated in covariance or in correlation form and either with or without extracting the mean. Focus can be on the column (Qmode) or row (Rmode) space. The Qmode eigenvectors (of XX^{T}) are named scores while the Rmode eigenvectors (of X^{T}X) are named loadings.
Traditionally, there has been a great deal of interest in determining the reliability of the eigenvalues and eigenvectors resulting from a PCA. Two basic approaches can be distinguished, namely analytical expressions and resampling. Analytical expressions often provide useful insight with respect to the major sources of error, but their validity in practice may be limited owing to the underlying assumptions. By contrast, one makes fewer assumptions about the data generating process with resampling.
Two early chemometrics papes that compare both approaches in an instructive way are:
 E.R. Malinowski
Theory of error applied to factor loadings resulting from combination target factor analysis
Analytica Chimica Acta, 122 (1980) 327330
 B.A. Roscoe and P.K. Hopke
Error estimates for factor loadings and scores obtained with target transformation factor analysis
Analytica Chimica Acta, 132 (1981) 8997
Although resampling certainly has the higher black box content, it appears to be the preferred approach in chemometrics practice.
A major purpose of this page is to show that correct resampling of PCAbased models is rather complex. To this end, we will introduce two basic models and exploit the insight that can be gained from analytical expressions derived under these models. In particular, it is shown how to account for two major sources of error, i.e., sampling and measurement, as summarized in the following table:

Table PCA 1: Sources of error and potential approaches to account for them.


It is stressed that a comparison of the results of analytical expressions and resampling can be very useful for the testing of computer programs, even when the choice of approach has been made. One may think here of a situation where resampling will definitely be employed because the assumptions underlying analytical expressions are too restrictive for the application at hand. Likewise, the adequacy of analytical expressions should always be determined using resampling.

Two (or three) basic models for the data matrix entering PCA

Top

Model I is the one extensively treated in multivariate analysis, a rather specialized branch of statistics, namely:
The N rows of the data matrix X are Kdimensional observations, drawn from some distribution with mean and covariance matrix . Note that measurement errors are not specified. One is only concerned about the uncertainty in estimated population characteristics that can be attributed to the limited sample size (N). This model is suitable for the description of data encountered in various fields. A typical example is presented in:
 A.R. Gibson, A.J. Baker and A. Moeed
Morphometric variation in introduced populations of the common myna (Acridotheres tristis): an application of the jackknife to principal component analysis
Systematic Zoology, 33 (1984) 408421
Size measurements are often very precise. Consequently, it is reasonable to attribute the variation among the rows of X entirely to biogical origin.
Under this model one might, for example, be interested in the reliability of the loadings, i.e., the slopes of the axes in the PC space. To illustrate the variability of interest under Model I, simple data sets are generated without noise:

By contrast, model II, which is often relevant in chemometrics, is as follows:
The data matrix X is a sum of outer products of score and loading vectors t and p to which (measurement) noise E is added. Under model II, the data matrix X is of reduced rank in absence of noise, see:
 A.J. Burnham, J.F. MacGregor and R. Viveros
Latent variable multivariate regression modeling
Chemometrics and Intelligent Laboratory Systems, 48 (1999) 167180
The main variations in the data are caused by a relatively small number of latent variables, which leads to a number of exact dependencies. With A latent variables, there would be NA or KA exact dependencies, whichever is smaller. Under model I, X could be of full rank. Then PCA is an efficient data compression technique if the variables are correlated.
For simplicity, the noise is assumed to independently identically distributed (iid) with variance . This assumption is not unreasonable because PCA is a leastsquares (LS) method and LS methods behave optimally only when the noise is iid. The assumption is equivalent to requiring that the data have been appropriately pretreated.
It is important to note that the iid assumption leads to relatively simple expressions for bias and variance. More complicated expressions that allow for correlated and heteroscedastic noise are often available. In fact, the 'iid expressions' are usually obtained by simplifying the more general ones. An example of this will be given below for scores and loadings in the context of multivariate calibration. Otherwise, the general expressions are considered out of the current scope.
This bilinear model describes two important types of data in chemometrics. The first one is the data generated by a hyphenated instrument (e.g. chromatography × spectroscopy) or excitationemission fluorescence. For this type of data, a score plot clearly shows that the rows of X cannot be considered as a random sample from some multivariate distribution:


Figure PCA 2: Score plot corresponding to the simulated HPLCUV data matrix of Figure TAC 3 (100 spectra of 40 wavelengths). Subsequent rows in X form a trajectory in elution time, which would certainly not be the case if they constitute a random sample.

Model II is also appropriate to describe the training set data for multivariate calibration. A typical example is the calibration of nearinfrared (NIR) spectra for the prediction of methyl tertbutyl ether (MTBE), ethanol (EtOH) and water (H2O) in gazoline, see:
 N.M. Faber, D.L. Duewer, S.J. Choquette, T.L. Green and S.N. Chesler
Characterizing the uncertainty in nearinfrared spectroscopic prediction of mixedoxygenate concentrations in gasoline: samplespecific prediction intervals
Analytical Chemistry, 70 (1998) 29722982; 70 (1998) 4877

Specifying the model in terms of signal and noise is an essential part of data analysis since the purpose of uncertainty estimation is to account for all relevant sources of uncertainty. From that perspective, ignoring measurement noise under model I is certainly restrictive in practice. Model III is therefore introduced here as model I with independent noise added, as in model II.
The effects of limited sample size and measurement noise are illustrated below using the data sets depicted in Figures PCA 1 (simulated: 25×2) and PCA 3 (real NIR: 40×391).

Expressionbased approach to uncertainty estimation

Top

In principle, there are several ways to derive expressions for uncertainty estimates. Obviously, the most rigorous approach is to derive the exact distribution for the relevant quantities. Bias and variance would then follow from the first (noncentral) and second (central) moments. However, this is seldom feasible for complex models such as PCA and methods based on it. A Bayesian approach could be considered, but deriving the posterior distribution is often intractable too for such complex models.
A rather convenient approach is to approximate the quantity of interest using a Taylor series expansion and keeping only the leading terms. The required algebraic manipulations are, for example, covered in the excellent text:
 J.R. Magnus and H. Neudecker
Matrix Differential Calculus with Applications in Statistics and Econometrics
John Wiley, Chichester, 2002
It suffices to keep the firstorder term to approximate variance. To obtain a nonzero estimate for bias, however, one must include at least the secondorder term to account for curvature. Keeping only the first term is known as local linearization or the delta method:

In the following, focus is on the eigenvalues to illustrate the essential difference between the effect of a limited sample size under model I and measurement noise under model II, respectively. Expressions are available for the eigenvectors too, but they are not considered for the moment since they share the relevant features. Expressions for the standard errors in the scores and loadings under model II will be given in a separate section below.
For model I, approximate bias and variance are derived using a truncated Taylor series expansion, and assuming multivariate normality, in:
 M.A. Girshick
On the sampling theory of determinantal equations
Annals of Mathematical Statistics, 10 (1939) 203224
 D.N. Lawley
Tests of significance for the latent roots of covariance and correlation matrices
Biometrika, 43 (1956) 128136
The results are:
where l denotes an eigenvalue, for which the subscript refers to the PC number, N is the number of objects, and A is the number of significant PCs.
It is noted that these expressions are valid for covariancebased PCA. Lawley has also addressed correlationbased PCA. This line of research has undergone a major revival by the work of Ogasawara, see:
 H. Ogasawara
Standard errors of the principal component loadings for unstandardized and standardized variables
British Journal of Mathematical and Statistical Psychology, 53 (2000) 155174
 H. Ogasawara
Concise formulas for the standard errors of component loading estimates
Psychometrika, 67 (2002) 289297
 H. Ogasawara
Asymptotic biases of the unrotated/rotated solutions in principal component analysis
British Journal of Mathematical and Statistical Psychology, 57 (2004) 353376
Although the expressions are much more complicated, they do not differ with respect to the relevant features. Consequently, we will focus on the simpler expressions.
For model II, approximate bias and variance are derived using the same firstorder approach in:
 L.A. Goodman and S.J. Haberman
The analysis of nonadditivity in twoway analysis of variance
Journal of the American Statistical Association, 85 (1990) 139145
 N.M. Faber, M.J. Meinders, P. Geladi, M. Sjöström, L.M.C. Buydens and G. Kateman
Random error bias in principal component analysis. I: Derivation of theoretical predictions
Analytica Chimica Acta, 304 (1995) 257271
 N.M. Faber, L.M.C. Buydens and G. Kateman
Standard errors in the eigenvalues of a crossproduct matrix: theory and applications
Journal of Chemometrics, 7 (1993) 495526
The results are:
where l denotes an eigenvalue, for which the subscript refers to the PC number, N is the number of objects, K is the number of variables, A is the number of significant PCs, and is the variance of the noise in X.
These expressions are obtained using a socalled small sample approach, i.e., by keeping N fixed and letting . They are valid for both covariance and correlationbased PCA and do not require a normal distribution for the errors in X. Large sample results, i.e., derived by keeping fixed and letting , are presented in:
 L.J. Gleser
Estimation in a multivariate "errors in variables" regression model: large sample results
Annals of Statistics, 9 (1981) 2444
The essential difference between the two models is apparent from these two pairs of expressions. Under model I, the size of the sample (N), with its interpretation in terms of degrees of freedom (N1), explicitly determines the uncertainty in model results. By contrast, the uncertainty in model results explicitly scales up with the size of the measurement noise () under model II. For both models, the relative uncertainty estimates further depend implicitly with N through the eigenvalues themselves.
In summary, Equations [3,4] and [5,6] concisely reflect limitations of sampling and measurement system, respectively. They illustrate the principle that expressions may provide insight that is required for tuning an analytical procedure to meet specific needs. For model III, one would need to combine these expressions.

Resamplingbased approach to uncertainty estimation

Top

The most dependable route to uncertainty estimates is to repeat the experiment under relevant conditions, i.e., either repeatibility or reproducibility. With resampling, one essentially attempts to simulate this timeconsuming process through numerical experiments. Popular methods are the jackknife and the bootstrap. It is important to note that the bootstrap can be performed in two modes, namely parametric and nonparametric, which are also known as conditional and unconditional, respectively. In the nonparametric mode, one draws rows of X with replacement, whereas in the parametric mode, one constructs a (parametric) model and resamples the residuals instead. The adjective 'conditional' derives from the residuals being calculated conditional to a model. The jackknife is similar to the nonparametric bootstrap.
Good general introductions to these methods can be found in:
 B. Efron and R.J. Tibshirani
An Introduction to the Bootstrap
Chapman & Hall, New York, 1993
 R. Wehrens, H. Putter and L.M.C. Buydens
The bootstrap: a tutorial
Chemometrics and Intelligent Laboratory Systems, 54 (2000) 3552
For an excellent treatment of the parametric bootstrap connected with models I and II, see:
 L. Milan and J. Whittaker
Application of the parametric bootstrap to models that incorporate a singular value decomposition
Applied Statistics, 44 (1995) 3149
Milan and Whittaker explain in detail how to cope with specific aspects of the singular value decomposition (SVD), which is fully equivalent to PCA. The parameters of the SVD consist of singular values on the diagonal of S (positive square roots of the eigenvalues), normalized left singular vectors in U (Qmode eigenvectors) and normalized right singular vectors in V (Rmode eigenvectors).
To identify these parameters, orthogonality constraints are applied. Since these constraints depend on the data matrix at hand, spurious variability is introduced when resampling. Milan and Whittaker argue that this nuisance variability can be minimized by rotating the resampling results towards a common reference, for which the parameters from the original SVD constitute a natural choice. They propose to use an orthogonal Procrustes rotation, because it additionally takes care of a reordering of the singular values ('factor switching') or a sign reflection of the singular vectors, while preserving the optimal LS fit of the data.
The following plots serve to illustrate the working of the jackknife, nonparametric (unconditional) bootstrap and parametric (conditional) bootstrap for a simple univariate model:


Figure PCA 5: Resampling for zerointercept leastsquares straightline fit of single x versus single y. (a) Original data points (), model (), fitted points () and fit residuals (). (b) Model () for objects {1,2,4,5} selected by the jackknife (). (c) Model () for objects {4,4,4,5,2} drawn with replacement by the nonparametric bootstrap (). (d) Original fitted points () and model () for objects () obtained by adding fit residuals {3,2,3,1,4}, which are drawn with replacement by the parametric bootstrap (), to the original fitted points.

To this repertoire of popular resampling methods one should add the noise addition method. This method is similar to the parametric bootstrap in that it works directly with the (measurement) noise in the data. However, noise addition is more flexible because it can handle correlated and heteroscedastic noise  the parametric bootstrap requires the noise to be iid.
Variance estimation using the noise addition method is explained in:
 W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling
Numerical recipes. The art of scientific computing
Cambridge University Press, Cambridge, UK, 1988, §14.5
When used for bias estimation, it is named simulation extrapolation (SIMEX), see:
 R.J. Carroll, D. Ruppert and L.A. Stefanski
Measurement error in nonlinear models
Chapman & Hall, London, UK, 1995, Chapter 4
A bias estimate is obtained by adding noise at several levels and extrapolating back to the target level where no noise is present. This bias estimate can then be used to obtain an improved estimate:


Figure PCA 6: Bias estimation and correction of the 'naive' estimate using SIMEX.

This process is equivalent to the well known method of standard addititions in analytical chemistry.
It is emphasized that SIMEX can be used to correct for the type of bias exemplified by Equation [5], i.e., explicitly dependent on . The jackknife can be used as a biascorrecting device too, but the type of bias handled by the jackknife is the one exemplified by Equation [3], i.e., it is explicitly dependent on N.
It is easily verified that the jackknife and nonparametric bootstrap assess the effect of the limited sample size (N). Generating noiseless data sets as shown in Figure 1 leads to the following estimates of standard errors in the eigenvalues:


Figure PCA 7: Comparison of jackknife and nonparametric bootstrap with Equation [4] for 1000 noiseless 25×2 data matrices. The bootstrap results are based on 1000 draws of rows with replacement. The symbols are explained in the text.

It follows that the jackknife and nonparametric bootstrap are highly suited for assessing the effect of limited sample size under model I in cases where Equation [4] cannot be used owing to the restrictive distributional assumption of multivariate normality.
The histograms can be summarized as follows:

Table PCA 2: Estimated standard errors in the eigenvalues for 1000 noiseless 25×2 data matrices.


The second column is the standard deviation in the eigenvalues obtained for the 1000 repetitions. Being based on such a large number of (assumed) independent realizations, it can be regarded as highly accurate. It is observed that Equation [4] performs slightly better than the resampling alternatives. First, the average is closer to the target values in column two. This observation is, however, somewhat misleading. In general, one expects resampling results to be (slightly) more accurate because less approximations or assumptions are made. Second, the expressionbased results tend to be less variable. This favorable aspect has been reported in the context of multivariate calibration in:
 M.C. Denham
Choosing the number of factors in partial least squares regression: estimating and minimizing the mean squared error of prediction
Journal of Chemometrics, 14 (2000) 351361
Interestingly, the 29% spread in the expressionbased results is very close to what can be expected from the form of Equation [4]. It is observed that this expression must be evaluated using the estimate for the associated eigenvalue. It can be inferred from Equation [4] that plugging in this estimate introduces a relative uncertainty of a factor (2/(N1))^{½}=(1/12)^{½}=0.29. This factor turns out to be identical to the rule of thumb that holds for the common estimate of variance, which is indeed favorable. It is also consistent with the classical interpretation of the eigenvalues as a fraction of the 'explained sample variance'. However, since this observation depends entirely on Equation [4], one might wonder how to interpret the eigenvalues when Equation [6] is appropriate.
Finally, it is reiterated that this kind of comparisons is believed to be useful for the testing of computer programs.

Consequences of incorrect resampling

Top

The form of Equations [3][6] implies that uncertainty estimates can greatly differ, depending on the model assumed. It follows that resampling should be used with caution. The potential effect of incorrect resampling was determined using Monte Carlo simulations for the standard error in the eigenvalues of PCA in:
 N.M. Faber, L.M.C. Buydens and G. Kateman
Standard errors in the eigenvalues of a crossproduct matrix: theory and applications
Journal of Chemometrics, 7 (1993) 495526
Monte Carlo simulations are ideal for testing the adequacy of an uncertainty assessment because the desired outcome is known in advance. For data that were artificially generated according to model II, the jackknife and nonparametric bootstrap overestimated the standard error of the dominant eigenvalue by more than a factor 30.
The consequences of incorrect resampling may be illustrated using the NIST data set. For this data set, it was determined that the measurement noise in the spectra (X) was negligible. In other words, the uncertainty in the calibration results (regression coefficients vector, predictions) were mainly affected by the noise in the predictands (y). The following results are obtained:

Table PCA 3: Estimated standard errors in the eigenvalues of the NIST training set spectra (40×391). All numbers are multiplied by 10^{2}. The symbols are explained in the text.


The standard errors in the third column are obtained using a highly pessimistic estimate of the noise, namely 10 times the average residual for the 5 PC model, while optimal prediction requires up to 12 PCs for ethanol. Still, the estimated standard error is negligible compared to the associated eigenvalue, which is correct. By contrast, Equation [4] and nonparametric resampling, which are consistent among each other, further overestimate the already pessimistic estimate by a factor 30 and 8, respectively. (The bootstrap results are based on 1000 draws of rows with replacement.)

Multivariate classification

Top

The main purpose of classification is to assign objects to one of several groups or classes. The classes are seen as separate populations and the distinction is quantified in terms of population characteristics. Hence the mean and covariance matrix for each class play a key role. This is also reflected in the development of new classification methods. For a nice example of the latter, see:
 I.E. Frank
DASCO  a new classification method
Chemometrics and Intelligent Laboratory Systems, 4 (1988) 215222
Clearly, model I or III applies so that the jackkife and nonparametric bootstrap are suitable resampling strategies, perhaps supplemented with noise addition in a nested setup.
For an impressive series of papers concerning implementation of the nonparametric bootstrap, see:
 R.A. Lodder and G.M. Hieftje
Quantile BEAST attacks the falsesample problem in nearinfrared reflectance analysis
Applied Spectroscopy, 42 (1988) 13511365
 R.A. Lodder and G.M. Hieftje
Detection of subpopulations in nearinfrared reflectance analysis
Applied Spectroscopy, 42 (1988) 15001512
 R.A. Lodder and G.M. Hieftje
Quantile analysis: a method for characterizing data distributions
Applied Spectroscopy, 42 (1988) 15121520
For a comparison of two parametric bootstrap methods, see:
 B.M. Smith and P.J. Gemperline
Bootstrap methods for assessing the performance of nearinfrared pattern classification techniques
Journal of Chemometrics, 16 (2002) 241246
One of the two parametric methods, namely resampling from residuals with replacement, performed poorly. It is likely that this method failed because it is intended to work under model II. In other words, it assesses a source of variability that may be irrelevant here.

Multivariate statistical process control

Top

Determining whether the process is out of control is similar to multivariate classification. Consequently, model I or III applies. Socalled contribution plots are useful for identifying the process variables that are responsible for an outofcontrol situation. Attaching confidence limits to the contributions makes their interpretation less subjective. For an application of the nonparametric bootstrap to this problem, see:
 A.K. Conlin, E.B. Martin and A.J. Morris
Confidence limits for contribution plots
Journal of Chemometrics, 14 (2000) 725736
Some criticism is warranted:
 The authors conclude with "More recent work [15] derives estimates for s_{p} using a different approach; however, this is only valid for covariancebased PCA, so while it may be an improvement on the approach of Jolliffe [7], it would still not be applicable for industrial situations." However, Reference [15] covers correlationbased PCA as well, but deals with Model II. It is reiterated that uncertainty assessment can be fundamentally different, depending on the task to be performed. Model II is, for example, wellsuited for multivariate calibration, see below.
 The reported confidence limits will not provide nominal coverage, simply because they have not been corrected for the multiple testing character of the procedure (e.g. using Bonferroni).

Analysis of variance

Top

Model II applies. For pertinent references and a rigorous derivation of Equations [5] and [6], see:
 L.A. Goodman and S.J. Haberman
The analysis of nonadditivity in twoway analysis of variance
Journal of the American Statistical Association, 85 (1990) 139145
Parametric resampling should be used.

Secondorder bilinear calibration: GRAM

Top

GRAM is a method for curve resolution and calibration that requires only a single calibration sample. The data matrices of the calibration and prediction sample should follow model II (see derivation: =229 kB), hence the parametric bootstrap and the noise addition method are appropriate resampling strategies.
GRAM is only one of many competing methods that can be applied to this particular data setup. However, GRAM stands out for two reasons. First, its error analysis is extensively studied, see below. Second, GRAM has been reported to compare favorably with alternating least squares (ALS), alternating trilinear decomposition (ATLD), alternating coupled vectors resolution (ACOVER), alternating slicewise diagonalization (ASD), alternating coupled matrices resolution (ACOMAR), selfweighted alternating trilinear decomposition (SWATLD) and pseudo alternating least squares (PALS), see:
 N.M. Faber
Towards a rehabilitation of the generalized rank annihilation method (GRAM)
Analytical and Bioanalytical Chemistry, 372 (2002) 683687
For a Monte Carlo study of the sensitivity of the GRAM solution to measurement noise, see:
 K.S. Booksh and B.R. Kowalski
Error analysis of the generalized rank annihilation method
Journal of Chemometrics, 8 (1994) 4563
For the derivation of variance expressions that are consistent with Equation [6], see:
 N.M. Faber, A. Lorber and B.R. Kowalski
Generalized rank annihilation method: standard errors in the estimated eigenvalues if the instrumental errors are heteroscedastic and correlated
Journal of Chemometrics, 11 (1997) 95109
For a comparison of various strategies applied to variance estimation, see:
 N.M. Faber
Quantifying the effect of measurement errors on the uncertainty in bilinear model predictions: a small simulation study
Analytica Chimica Acta, 439 (2001) 193201
For successful bias estimation using the noise addition method in SIMEX mode, see:
 N.M. Faber, J. Ferré and R. Boqué
Iteratively reweighted generalized rank annihilation method. 1. Improved handling of prediction bias
Chemometrics and Intelligent Laboratory Systems, 55 (2001) 6790
The question of correct resampling in connection with GRAM has been discussed in:
 N.M. Faber
Comment on a recently proposed resampling method
Journal of Chemometrics, 15 (2001) 169188
 A.K. Smilde
Reply to 'Comment on a recently proposed resampling method'
Journal of Chemometrics, 15 (2001) 189192
Many of the conclusions derived for GRAM should hold equally well for the alternative methods.

Multivariate calibration: regression coefficients

Top

For multiple linear regression (MLR) using ordinary least squares (OLS), expressions are available for the standard error in the regression coefficients when the (fullrank) predictor matrix (X) is corrupted by noise, see:
 S.D. Hodges and P.G. Moore
Data uncertainties and least squares regression
Applied Statistics, 21 (1972) 185195
 R.B. Davies and B. Hutton
The effect of errors in the independent variables in linear regression
Biometrika, 62 (1975) 383391
Assuming iid noise in the predictands y and the predictors X, an approximate covariance matrix for the estimated regression coefficients vector estimate follows as
where b is the regression coefficients vector, and denote the variance of the noise in y and X, respectively, and signifies the Euclidean norm of a vector, i.e., its length. This approximation, which becomes exact when X is errorfree, illustrates that the effect of the measurement noise should be assessed. Consequently, the parametric bootstrap and the noise addition method should perform best.
Generally accepted analogues to Equation [7] do not exist for the calibration of rankdeficient X using methods like PCR and PLS. Various resampling schemes have been compared with a simple expression in:
 N.M. Faber
Uncertainty estimation for multivariate regression coefficients
Chemometrics and Intelligent Laboratory Systems, 64 (2002) 169179
For the PLS calibration of the NIST data set, parametric resampling indeed appears to work well, whereas nonparametric resampling overestimates the standard error in the regression coefficients by a factor of five or more:


Figure PCA 8: Standard errors in PLS coefficients (×10^{2}) for methyl tertbutyl ether (MTBE), ethanol (EtOH) and water (H2O): expressionbased versus the ones obtained by the nonparametric (top) and parametric (bottom) bootstrap.

The adequacy of the expression and parametric bootstrap are visually confirmed by the smoothness of the regression coefficient vector, for example, for water (H2O):


Figure PCA 9: PLS coefficients for H2O.

From the standard error estimates, one may calculate the 90% confidence interval that includes zero. The nonparametric bootstrap suggests that 142 (out of 391) coefficients are insignificant versus 28 for the parametric bootstrap. Only the latter number seems to be reasonable. On the basis of 142 insignificant coefficients, one would expect many more zero crossings than actually observed. This is further illustrated by plotting only the insignificant coefficients for a limited region:

Several researchers have independently proposed to use the standard errors in the regression coefficients to eliminate uninformative variables, see:
 V. Centner, D.L. Massart, O.E. De Noord, S. De Jong, B.M. Vandeginste and C. Sterna
Elimination of uninformative variables for multivariate calibration
Analytical Chemistry, 68 (1996) 38513858
 R. Wehrens and W.E. Van der Linden
Bootstrapping principal component regression models
Journal of Chemometrics, 11 (1997) 157171
 H. Martens and M. Martens
Modified Jackknife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR)
Food Quality and Preference, 11 (2000) 516
It should be noted that this procedure amounts to performing multiple ttests. Consequently, some sort of correction must be applied to calculate the overall significance level, e.g. a Bonferroni correction. It is well known that this limits the number of variables to be tested. Moreover, the correlation among variables is entirely neglected when performing separate ttests. A simple alternative is to calculate a joint confidence region for a set of variables using a standard Ftest, as suggested in:
 N.M. Faber and B.R. Kowalski
Propagation of measurement errors for the validation of predictions obtained by principal component regression and partial least squares
Journal of Chemometrics, 11 (1997) 181238

Multivariate calibration: scores and loadings

Top

PCA is often used for data visualization (exploratory data analysis). Score plots are inspected to see how samples are distributed over the calibrated range (trends, clusters, outliers), while loading plots may reveal relationships among the variables. Unfortunately, score plots are not without problems because their correct interpretation depends critically on the scaling used, see:
 P. Geladi, M. Manley and T. Lestander
Scatter plotting in multivariate data analysis
Journal of Chemometrics, 17 (2003) 503511
N.B. Scaling problems do not arise in loading plots since the loadings are all normalized to unit length.
The default option in many software packages is to evenly fill out the score plot. This certainly leads to plots that are visually attractive. However, the relative position of the points can be misleading.
Figure PCA 3 is an example of a score plot obtained by the default scaling. Both axes have the same length, but close examination shows that the ranges are 0.19 and 0.14, respectively. In other words: the spacing of the tick marks along the axes is not identical, as it should be. However, the distortion of the original Xspace is moderate in this case. Geladi et al. give an extreme example, where the second PC is blown up when applying the default scaling.
A correct interpretation of distances is stil possible when attaching uncertainties to the points (e.g. confidence ellipses) because the uncertainty follows scale  scaling is a linear operation.
The analogy with the regression coefficients implies that the effect of (measurement) noise should be assessed so that parametric resampling is appropriate here.
Approximate bias and variance expressions can be obtained from Theorem 4 in:
 L.A. Goodman and S.J. Haberman
The analysis of nonadditivity in twoway analysis of variance
Journal of the American Statistical Association, 85 (1990) 139145
In the following, we will restrict ourselves to variances. Assuming iid noise in X, approximate covariance matrices for the scores and loadings follow as:
where t, p and l denote a denormalized score vector, normalized loading vector and eigenvalue, respectively, for which the subscript refers to the PC number, A is the number of significant PCs, is the variance of the noise in X, and I denotes an appropriately dimensioned identitity matrix.
It is important to note that Equations [8] and [9] are obtained without additional approximations by simplifying the fully general ones that accommodate for correlated and heteroscedastic noise:
where t, p, u and s denote a denormalized score vector, normalized loading vector, normalized score vector and singular value, respectively, for which the subscript refers to the PC number, A is the number of significant PCs, N and K are the number of rows and columns of X, and and denote the covariance matrices of the noise for rows and columns of X.
It is observed that correlated noise leads to correlation between the scores and loadings associated with different PCs.
The adequacy of the simplified expressions [8] and [9] is tested by adding iid noise with standard deviation 0.2 to the data matrices of Figure PCA 1. This noise level amounts to a signaltonoise ratio of approximately 6. The spread in the scores from 1000 repetitions leads to the following estimate of the covariance matrices:


Figure PCA 11: Covariance matrices for scores 1 and 2 of noisy 25×2 data matrices. The scores are normalized to illustrate that PC 1 is more stable than PC 2: the vertical scale is 10^{4} for PC 1 versus 10^{3} for PC 2.

It is apparent from these plots that the components of the score vectors are indeed uncorrelated when the noise is uncorrelated. Similar results are obtained for the loadings (not shown).
Plotting the square root of the diagonal of the covariance matrices yields:


Figure PCA 12: Comparison of Monte Carlo results and expressionbased estimates of standard errors in the scores of noisy 25×2 data matrices.

The agreement appears to be good enough to allow one to construct confidence ellipses.
This conjecture is tested for scores and loadings simultaneously as follows. The scores and loadings are scaled symmetrically and displayed together in a biplot. In Matlab, this symmetric scaling is realized as:
[U,S,V] = svd(X,0); T = U * sqrt(S); P = V * sqrt(S);
Some afterthought shows that this scaling leads to standard errors in scores and loadings that are both inversely proportional to the square root of the associated singular value.
It makes a lot of sense to scale the biplot using the command:
axis('equal');
Then the range of the points should be proportional to the square root of the associated singular value.
These considerations lead to a very clear interpretation of decreasing stability with increasing PC number, as illustrated by the following biplot:

The plot is nicely filled out, while the distortion (of the original Xspace) caused by the particular scaling applied is compensated for by the confidence ellipses that follow scale.

Conclusions

Top

Different resampling methods may yield widely different results. For example, the uncertainty estimates provided by the jackknife and nonparametric bootstrap often bear little resemblance with the effect of measurement noise, which is correctly assessed using the noise addition method or the parametric bootstrap. The latter procedures lead to uncertainty estimates that are often much smaller, but more realistic. Correct resampling should lead to (slightly) more accurate results than the application of a correct analytical expression. However, the results of expressions tend to be less variable. Moreover, expressions yield the insight that is required to optimize an analytical procedure to meet specific needs.

References & further information

Top

Open a list of references
For further information, please contact Klaas Faber:

