logo_2Cy.gif
Home About us Media Research Consultancy Training Site map Contact

Home » Research » Reliability of principal component analysis

The purpose of this page is to demonstrate the internal consistency of the methodology developed for a large variety of methods and problems; it is therefore admittedly rather long and detailed.

Principal component analysis (PCA) is generally considered to be the working horse of multivariate data analysis, since so many methods are merely a variation on the same basic theme. Well-known examples are:

  • soft independent modelling of class analogy (SIMCA), regularized discriminant analysis (RDA) and discriminant analysis with shrunken covariances (DASCO) for multivariate classification,
  • principal component scores-based multivariate statistical process control (MSPC),
  • modelling of multiplicative terms in analysis of variance (ANOVA),
  • generalized rank annihilation method (GRAM) and iterative target transformation factor analysis (ITTFA) for curve resolution and optional second-order bilinear calibration, and
  • principal component regression (PCR) and partial least squares (PLS) regression for multivariate calibration.


This page is organized as follows:


Setting the stage

PCA generates eigenvalues and eigenvectors for a cross-product of a data matrix X. This cross-product can be calculated in covariance or in correlation form and either with or without extracting the mean. Focus can be on the column (Q-mode) or row (R-mode) space. The Q-mode eigenvectors (of XXT) are named scores while the R-mode eigenvectors (of XTX) are named loadings.

Traditionally, there has been a great deal of interest in determining the reliability of the eigenvalues and eigenvectors resulting from a PCA. Two basic approaches can be distinguished, namely analytical expressions and resampling. Analytical expressions often provide useful insight with respect to the major sources of error, but their validity in practice may be limited owing to the underlying assumptions. By contrast, one makes fewer assumptions about the data generating process with resampling.

Two early chemometrics papes that compare both approaches in an instructive way are:

  • E.R. Malinowski
    Theory of error applied to factor loadings resulting from combination target factor analysis
    Analytica Chimica Acta, 122 (1980) 327-330
  • B.A. Roscoe and P.K. Hopke
    Error estimates for factor loadings and scores obtained with target transformation factor analysis
    Analytica Chimica Acta, 132 (1981) 89-97


Although resampling certainly has the higher black box content, it appears to be the preferred approach in chemometrics practice.

A major purpose of this page is to show that correct resampling of PCA-based models is rather complex. To this end, we will introduce two basic models and exploit the insight that can be gained from analytical expressions derived under these models. In particular, it is shown how to account for two major sources of error, i.e., sampling and measurement, as summarized in the following table:

Table PCA 1: Sources of error and potential approaches to account for them.

PCA Table 1.gif


It is stressed that a comparison of the results of analytical expressions and resampling can be very useful for the testing of computer programs, even when the choice of approach has been made. One may think here of a situation where resampling will definitely be employed because the assumptions underlying analytical expressions are too restrictive for the application at hand. Likewise, the adequacy of analytical expressions should always be determined using resampling.


Two (or three) basic models for the data matrix entering PCA


Top Top blue.gif

Model I is the one extensively treated in multivariate analysis, a rather specialized branch of statistics, namely:

PCA eq 1.gif

The N rows of the data matrix X are K-dimensional observations, drawn from some distribution with mean mu.gif and covariance matrix sigma.gif. Note that measurement errors are not specified. One is only concerned about the uncertainty in estimated population characteristics that can be attributed to the limited sample size (N). This model is suitable for the description of data encountered in various fields. A typical example is presented in:

  • A.R. Gibson, A.J. Baker and A. Moeed
    Morphometric variation in introduced populations of the common myna (Acridotheres tristis): an application of the jackknife to principal component analysis
    Systematic Zoology, 33 (1984) 408-421


Size measurements are often very precise. Consequently, it is reasonable to attribute the variation among the rows of X entirely to biogical origin.

Under this model one might, for example, be interested in the reliability of the loadings, i.e., the slopes of the axes in the PC space. To illustrate the variability of interest under Model I, simple data sets are generated without noise:

PCA1.gif

Figure PCA 1: Simulated 25×2 data sets. The rows of X follow a bivariate normal distribution with mean muis0.gif and covariance matrix sigmais.gif. The uncertainty in the slope of PC 1 is estimated using the non-parametric bootstrap, which is explained below. The numbers in square brackets denote the number of outliers (out of 1000 bootstrap samples) encountered during resampling. Without trimming off the edges of the resampling distribution one would, for example, obtain the estimate 3 for panel (d), which is excessively high.


By contrast, model II, which is often relevant in chemometrics, is as follows:

PCA eq 2.gif

The data matrix X is a sum of outer products of score and loading vectors t and p to which (measurement) noise E is added. Under model II, the data matrix X is of reduced rank in absence of noise, see:

  • A.J. Burnham, J.F. MacGregor and R. Viveros
    Latent variable multivariate regression modeling
    Chemometrics and Intelligent Laboratory Systems, 48 (1999) 167-180


The main variations in the data are caused by a relatively small number of latent variables, which leads to a number of exact dependencies. With A latent variables, there would be N-A or K-A exact dependencies, whichever is smaller. Under model I, X could be of full rank. Then PCA is an efficient data compression technique if the variables are correlated.

For simplicity, the noise is assumed to independently identically distributed (iid) with variance stdX2.gif. This assumption is not unreasonable because PCA is a least-squares (LS) method and LS methods behave optimally only when the noise is iid. The assumption is equivalent to requiring that the data have been appropriately pre-treated.

It is important to note that the iid assumption leads to relatively simple expressions for bias and variance. More complicated expressions that allow for correlated and heteroscedastic noise are often available. In fact, the 'iid expressions' are usually obtained by simplifying the more general ones. An example of this will be given below for scores and loadings in the context of multivariate calibration. Otherwise, the general expressions are considered out of the current scope.

This bi-linear model describes two important types of data in chemometrics. The first one is the data generated by a hyphenated instrument (e.g. chromatography × spectroscopy) or excitation-emission fluorescence. For this type of data, a score plot clearly shows that the rows of X cannot be considered as a random sample from some multivariate distribution:

PCA2.gif

Figure PCA 2: Score plot corresponding to the simulated HPLC-UV data matrix of Figure TAC 3 (100 spectra of 40 wavelengths). Subsequent rows in X form a trajectory in elution time, which would certainly not be the case if they constitute a random sample.


Model II is also appropriate to describe the training set data for multivariate calibration. A typical example is the calibration of near-infrared (NIR) spectra for the prediction of methyl tert-butyl ether (MTBE), ethanol (EtOH) and water (H2O) in gazoline, see:

  • N.M. Faber, D.L. Duewer, S.J. Choquette, T.L. Green and S.N. Chesler
    Characterizing the uncertainty in near-infrared spectroscopic prediction of mixed-oxygenate concentrations in gasoline: sample-specific prediction intervals
    Analytical Chemistry, 70 (1998) 2972-2982; 70 (1998) 4877


PCA3.gif

Figure PCA 3: Position of samples in the plane spanned by the first two principal components: training (Cyan circle.gif) and test (Yellow asterisk.gif) set. With 97% explained variance, this two-dimensional plot gives an excellent overview of the data. The test sample in the upper right corner is clearly identified as an outlier. It is important to note that the appearance of score plots depends heavily on the scaling used, which is not a trivial matter.


Specifying the model in terms of signal and noise is an essential part of data analysis since the purpose of uncertainty estimation is to account for all relevant sources of uncertainty. From that perspective, ignoring measurement noise under model I is certainly restrictive in practice. Model III is therefore introduced here as model I with independent noise added, as in model II.

The effects of limited sample size and measurement noise are illustrated below using the data sets depicted in Figures PCA 1 (simulated: 25×2) and PCA 3 (real NIR: 40×391).


Expression-based approach to uncertainty estimation


Top Top blue.gif

In principle, there are several ways to derive expressions for uncertainty estimates. Obviously, the most rigorous approach is to derive the exact distribution for the relevant quantities. Bias and variance would then follow from the first (non-central) and second (central) moments. However, this is seldom feasible for complex models such as PCA and methods based on it. A Bayesian approach could be considered, but deriving the posterior distribution is often intractable too for such complex models.

A rather convenient approach is to approximate the quantity of interest using a Taylor series expansion and keeping only the leading terms. The required algebraic manipulations are, for example, covered in the excellent text:

  • J.R. Magnus and H. Neudecker
    Matrix Differential Calculus with Applications in Statistics and Econometrics
    John Wiley, Chichester, 2002


It suffices to keep the first-order term to approximate variance. To obtain a non-zero estimate for bias, however, one must include at least the second-order term to account for curvature. Keeping only the first term is known as local linearization or the delta method:

PCA4.gif

Figure PCA 4: Local linearization of the function Z = X2. The approximate variance follows as sigma2Z.gif. This approximation can be quite adequate if the partial derivative dZdX.gif is sufficiently constant over the region where the probability density function (pdf) of X, p(X), is appreciable. Note that for this simple function one need not make a restrictive assumption for the pdf of X; the normal distribution in the plot serves for illustration. It is only required that X has a finite variance. Moreover, often the central limit theorem (CLT) can be invoked to motivate the normal distribution for the linearized quantity, e.g., a multivariate prediction. It is observed that ignoring curvature leads to zero bias - the first-order approximation of the mean of Z is the square of the mean of X. Jenssen's inequality implies positive bias for the concave function Z.


In the following, focus is on the eigenvalues to illustrate the essential difference between the effect of a limited sample size under model I and measurement noise under model II, respectively. Expressions are available for the eigenvectors too, but they are not considered for the moment since they share the relevant features. Expressions for the standard errors in the scores and loadings under model II will be given in a separate section below.

For model I, approximate bias and variance are derived using a truncated Taylor series expansion, and assuming multivariate normality, in:

  • M.A. Girshick
    On the sampling theory of determinantal equations
    Annals of Mathematical Statistics, 10 (1939) 203-224
  • D.N. Lawley
    Tests of significance for the latent roots of covariance and correlation matrices
    Biometrika, 43 (1956) 128-136


The results are:

PCA eq 3.gif

PCA eq 4.gif

where l denotes an eigenvalue, for which the subscript refers to the PC number, N is the number of objects, and A is the number of significant PCs.

It is noted that these expressions are valid for covariance-based PCA. Lawley has also addressed correlation-based PCA. This line of research has undergone a major revival by the work of Ogasawara, see:

  • H. Ogasawara
    Standard errors of the principal component loadings for unstandardized and standardized variables
    British Journal of Mathematical and Statistical Psychology, 53 (2000) 155-174
  • H. Ogasawara
    Concise formulas for the standard errors of component loading estimates
    Psychometrika, 67 (2002) 289-297
  • H. Ogasawara
    Asymptotic biases of the unrotated/rotated solutions in principal component analysis
    British Journal of Mathematical and Statistical Psychology, 57 (2004) 353-376


Although the expressions are much more complicated, they do not differ with respect to the relevant features. Consequently, we will focus on the simpler expressions.

For model II, approximate bias and variance are derived using the same first-order approach in:

  • L.A. Goodman and S.J. Haberman
    The analysis of nonadditivity in two-way analysis of variance
    Journal of the American Statistical Association, 85 (1990) 139-145
  • N.M. Faber, M.J. Meinders, P. Geladi, M. Sjöström, L.M.C. Buydens and G. Kateman
    Random error bias in principal component analysis. I: Derivation of theoretical predictions
    Analytica Chimica Acta, 304 (1995) 257-271
  • N.M. Faber, L.M.C. Buydens and G. Kateman
    Standard errors in the eigenvalues of a cross-product matrix: theory and applications
    Journal of Chemometrics, 7 (1993) 495-526


The results are:

PCA eq 5.gif

PCA eq 6.gif

where l denotes an eigenvalue, for which the subscript refers to the PC number, N is the number of objects, K is the number of variables, A is the number of significant PCs, and stdX2.gif is the variance of the noise in X.

These expressions are obtained using a so-called small sample approach, i.e., by keeping N fixed and letting stdX2lim0.gif. They are valid for both covariance- and correlation-based PCA and do not require a normal distribution for the errors in X. Large sample results, i.e., derived by keeping stdX2.gif fixed and letting Nliminf.gif, are presented in:

  • L.J. Gleser
    Estimation in a multivariate "errors in variables" regression model: large sample results
    Annals of Statistics, 9 (1981) 24-44


The essential difference between the two models is apparent from these two pairs of expressions. Under model I, the size of the sample (N), with its interpretation in terms of degrees of freedom (N-1), explicitly determines the uncertainty in model results. By contrast, the uncertainty in model results explicitly scales up with the size of the measurement noise (stdX2.gif) under model II. For both models, the relative uncertainty estimates further depend implicitly with N through the eigenvalues themselves.

In summary, Equations [3,4] and [5,6] concisely reflect limitations of sampling and measurement system, respectively. They illustrate the principle that expressions may provide insight that is required for tuning an analytical procedure to meet specific needs. For model III, one would need to combine these expressions.


Resampling-based approach to uncertainty estimation


Top Top blue.gif

The most dependable route to uncertainty estimates is to repeat the experiment under relevant conditions, i.e., either repeatibility or reproducibility. With resampling, one essentially attempts to simulate this time-consuming process through numerical experiments. Popular methods are the jackknife and the bootstrap. It is important to note that the bootstrap can be performed in two modes, namely parametric and non-parametric, which are also known as conditional and unconditional, respectively. In the non-parametric mode, one draws rows of X with replacement, whereas in the parametric mode, one constructs a (parametric) model and resamples the residuals instead. The adjective 'conditional' derives from the residuals being calculated conditional to a model. The jackknife is similar to the non-parametric bootstrap.

Good general introductions to these methods can be found in:

  • B. Efron and R.J. Tibshirani
    An Introduction to the Bootstrap
    Chapman & Hall, New York, 1993
  • R. Wehrens, H. Putter and L.M.C. Buydens
    The bootstrap: a tutorial
    Chemometrics and Intelligent Laboratory Systems, 54 (2000) 35-52


For an excellent treatment of the parametric bootstrap connected with models I and II, see:

  • L. Milan and J. Whittaker
    Application of the parametric bootstrap to models that incorporate a singular value decomposition
    Applied Statistics, 44 (1995) 31-49


Milan and Whittaker explain in detail how to cope with specific aspects of the singular value decomposition (SVD), which is fully equivalent to PCA. The parameters of the SVD consist of singular values on the diagonal of S (positive square roots of the eigenvalues), normalized left singular vectors in U (Q-mode eigenvectors) and normalized right singular vectors in V (R-mode eigenvectors).

To identify these parameters, orthogonality constraints are applied. Since these constraints depend on the data matrix at hand, spurious variability is introduced when resampling. Milan and Whittaker argue that this nuisance variability can be minimized by rotating the resampling results towards a common reference, for which the parameters from the original SVD constitute a natural choice. They propose to use an orthogonal Procrustes rotation, because it additionally takes care of a reordering of the singular values ('factor switching') or a sign reflection of the singular vectors, while preserving the optimal LS fit of the data.

The following plots serve to illustrate the working of the jackknife, non-parametric (unconditional) bootstrap and parametric (conditional) bootstrap for a simple univariate model:

PCA5.gif

Figure PCA 5: Resampling for zero-intercept least-squares straight-line fit of single x versus single y. (a) Original data points (Green circle filled.gif), model (Yellow line.gif), fitted points (Red asterisk.gif) and fit residuals (Red dash.gif). (b) Model (Yellow line.gif) for objects {1,2,4,5} selected by the jackknife (Green circle filled.gif). (c) Model (Yellow line.gif) for objects {4,4,4,5,2} drawn with replacement by the non-parametric bootstrap (Green circle filled.gif). (d) Original fitted points (Red asterisk.gif) and model (Yellow line.gif) for objects (Green circle filled.gif) obtained by adding fit residuals {3,2,3,1,4}, which are drawn with replacement by the parametric bootstrap (Red dash.gif), to the original fitted points.


To this repertoire of popular resampling methods one should add the noise addition method. This method is similar to the parametric bootstrap in that it works directly with the (measurement) noise in the data. However, noise addition is more flexible because it can handle correlated and heteroscedastic noise - the parametric bootstrap requires the noise to be iid.

Variance estimation using the noise addition method is explained in:

  • W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling
    Numerical recipes. The art of scientific computing
    Cambridge University Press, Cambridge, UK, 1988, §14.5


When used for bias estimation, it is named simulation extrapolation (SIMEX), see:

  • R.J. Carroll, D. Ruppert and L.A. Stefanski
    Measurement error in nonlinear models
    Chapman & Hall, London, UK, 1995, Chapter 4


A bias estimate is obtained by adding noise at several levels and extrapolating back to the target level where no noise is present. This bias estimate can then be used to obtain an improved estimate:

PCA6.gif

Figure PCA 6: Bias estimation and correction of the 'naive' estimate using SIMEX.


This process is equivalent to the well known method of standard addititions in analytical chemistry.

It is emphasized that SIMEX can be used to correct for the type of bias exemplified by Equation [5], i.e., explicitly dependent on stdX2.gif. The jackknife can be used as a bias-correcting device too, but the type of bias handled by the jackknife is the one exemplified by Equation [3], i.e., it is explicitly dependent on N.

It is easily verified that the jackknife and non-parametric bootstrap assess the effect of the limited sample size (N). Generating noiseless data sets as shown in Figure 1 leads to the following estimates of standard errors in the eigenvalues:

PCA7.gif

Figure PCA 7: Comparison of jackknife and non-parametric bootstrap with Equation [4] for 1000 noiseless 25×2 data matrices. The bootstrap results are based on 1000 draws of rows with replacement. The symbols are explained in the text.


It follows that the jackknife and non-parametric bootstrap are highly suited for assessing the effect of limited sample size under model I in cases where Equation [4] cannot be used owing to the restrictive distributional assumption of multivariate normality.

The histograms can be summarized as follows:

Table PCA 2: Estimated standard errors in the eigenvalues for 1000 noiseless 25×2 data matrices.

PCA Table 2.gif


The second column is the standard deviation in the eigenvalues obtained for the 1000 repetitions. Being based on such a large number of (assumed) independent realizations, it can be regarded as highly accurate. It is observed that Equation [4] performs slightly better than the resampling alternatives. First, the average is closer to the target values in column two. This observation is, however, somewhat misleading. In general, one expects resampling results to be (slightly) more accurate because less approximations or assumptions are made. Second, the expression-based results tend to be less variable. This favorable aspect has been reported in the context of multivariate calibration in:

  • M.C. Denham
    Choosing the number of factors in partial least squares regression: estimating and minimizing the mean squared error of prediction
    Journal of Chemometrics, 14 (2000) 351-361


Interestingly, the 29% spread in the expression-based results is very close to what can be expected from the form of Equation [4]. It is observed that this expression must be evaluated using the estimate for the associated eigenvalue. It can be inferred from Equation [4] that plugging in this estimate introduces a relative uncertainty of a factor (2/(N-1))½=(1/12)½=0.29. This factor turns out to be identical to the rule of thumb that holds for the common estimate of variance, which is indeed favorable. It is also consistent with the classical interpretation of the eigenvalues as a fraction of the 'explained sample variance'. However, since this observation depends entirely on Equation [4], one might wonder how to interpret the eigenvalues when Equation [6] is appropriate.

Finally, it is reiterated that this kind of comparisons is believed to be useful for the testing of computer programs.


Consequences of incorrect resampling


Top Top blue.gif

The form of Equations [3]-[6] implies that uncertainty estimates can greatly differ, depending on the model assumed. It follows that resampling should be used with caution. The potential effect of incorrect resampling was determined using Monte Carlo simulations for the standard error in the eigenvalues of PCA in:

  • N.M. Faber, L.M.C. Buydens and G. Kateman
    Standard errors in the eigenvalues of a cross-product matrix: theory and applications
    Journal of Chemometrics, 7 (1993) 495-526


Monte Carlo simulations are ideal for testing the adequacy of an uncertainty assessment because the desired outcome is known in advance. For data that were artificially generated according to model II, the jackknife and non-parametric bootstrap overestimated the standard error of the dominant eigenvalue by more than a factor 30.

The consequences of incorrect resampling may be illustrated using the NIST data set. For this data set, it was determined that the measurement noise in the spectra (X) was negligible. In other words, the uncertainty in the calibration results (regression coefficients vector, predictions) were mainly affected by the noise in the predictands (y). The following results are obtained:

Table PCA 3: Estimated standard errors in the eigenvalues of the NIST training set spectra (40×391). All numbers are multiplied by 102. The symbols are explained in the text.

PCA Table 3.gif


The standard errors in the third column are obtained using a highly pessimistic estimate of the noise, namely 10 times the average residual for the 5 PC model, while optimal prediction requires up to 12 PCs for ethanol. Still, the estimated standard error is negligible compared to the associated eigenvalue, which is correct. By contrast, Equation [4] and non-parametric resampling, which are consistent among each other, further overestimate the already pessimistic estimate by a factor 30 and 8, respectively. (The bootstrap results are based on 1000 draws of rows with replacement.)


Multivariate classification


Top Top blue.gif

The main purpose of classification is to assign objects to one of several groups or classes. The classes are seen as separate populations and the distinction is quantified in terms of population characteristics. Hence the mean mu.gif and covariance matrix sigma.gif for each class play a key role. This is also reflected in the development of new classification methods. For a nice example of the latter, see:

  • I.E. Frank
    DASCO - a new classification method
    Chemometrics and Intelligent Laboratory Systems, 4 (1988) 215-222


Clearly, model I or III applies so that the jackkife and non-parametric bootstrap are suitable resampling strategies, perhaps supplemented with noise addition in a nested set-up.

For an impressive series of papers concerning implementation of the non-parametric bootstrap, see:

  • R.A. Lodder and G.M. Hieftje
    Quantile BEAST attacks the false-sample problem in near-infrared reflectance analysis
    Applied Spectroscopy, 42 (1988) 1351-1365
  • R.A. Lodder and G.M. Hieftje
    Detection of subpopulations in near-infrared reflectance analysis
    Applied Spectroscopy, 42 (1988) 1500-1512
  • R.A. Lodder and G.M. Hieftje
    Quantile analysis: a method for characterizing data distributions
    Applied Spectroscopy, 42 (1988) 1512-1520


For a comparison of two parametric bootstrap methods, see:

  • B.M. Smith and P.J. Gemperline
    Bootstrap methods for assessing the performance of near-infrared pattern classification techniques
    Journal of Chemometrics, 16 (2002) 241-246


One of the two parametric methods, namely resampling from residuals with replacement, performed poorly. It is likely that this method failed because it is intended to work under model II. In other words, it assesses a source of variability that may be irrelevant here.


Multivariate statistical process control


Top Top blue.gif

Determining whether the process is out of control is similar to multivariate classification. Consequently, model I or III applies. So-called contribution plots are useful for identifying the process variables that are responsible for an out-of-control situation. Attaching confidence limits to the contributions makes their interpretation less subjective. For an application of the non-parametric bootstrap to this problem, see:

  • A.K. Conlin, E.B. Martin and A.J. Morris
    Confidence limits for contribution plots
    Journal of Chemometrics, 14 (2000) 725-736


Some criticism is warranted:

  • The authors conclude with "More recent work [15] derives estimates for sp using a different approach; however, this is only valid for covariance-based PCA, so while it may be an improvement on the approach of Jolliffe [7], it would still not be applicable for industrial situations." However, Reference [15] covers correlation-based PCA as well, but deals with Model II. It is re-iterated that uncertainty assessment can be fundamentally different, depending on the task to be performed. Model II is, for example, well-suited for multivariate calibration, see below.
  • The reported confidence limits will not provide nominal coverage, simply because they have not been corrected for the multiple testing character of the procedure (e.g. using Bonferroni).


Analysis of variance


Top Top blue.gif

Model II applies. For pertinent references and a rigorous derivation of Equations [5] and [6], see:

  • L.A. Goodman and S.J. Haberman
    The analysis of nonadditivity in two-way analysis of variance
    Journal of the American Statistical Association, 85 (1990) 139-145


Parametric resampling should be used.


Second-order bilinear calibration: GRAM


Top Top blue.gif

GRAM is a method for curve resolution and calibration that requires only a single calibration sample. The data matrices of the calibration and prediction sample should follow model II (see derivation: icon_pdf.gif=229 kB), hence the parametric bootstrap and the noise addition method are appropriate resampling strategies.

GRAM is only one of many competing methods that can be applied to this particular data set-up. However, GRAM stands out for two reasons. First, its error analysis is extensively studied, see below. Second, GRAM has been reported to compare favorably with alternating least squares (ALS), alternating trilinear decomposition (ATLD), alternating coupled vectors resolution (ACOVER), alternating slice-wise diagonalization (ASD), alternating coupled matrices resolution (ACOMAR), self-weighted alternating trilinear decomposition (SWATLD) and pseudo alternating least squares (PALS), see:

  • N.M. Faber
    Towards a rehabilitation of the generalized rank annihilation method (GRAM)
    Analytical and Bioanalytical Chemistry, 372 (2002) 683-687


For a Monte Carlo study of the sensitivity of the GRAM solution to measurement noise, see:

  • K.S. Booksh and B.R. Kowalski
    Error analysis of the generalized rank annihilation method
    Journal of Chemometrics, 8 (1994) 45-63


For the derivation of variance expressions that are consistent with Equation [6], see:

  • N.M. Faber, A. Lorber and B.R. Kowalski
    Generalized rank annihilation method: standard errors in the estimated eigenvalues if the instrumental errors are heteroscedastic and correlated
    Journal of Chemometrics, 11 (1997) 95-109


For a comparison of various strategies applied to variance estimation, see:

  • N.M. Faber
    Quantifying the effect of measurement errors on the uncertainty in bilinear model predictions: a small simulation study
    Analytica Chimica Acta, 439 (2001) 193-201


For successful bias estimation using the noise addition method in SIMEX mode, see:

  • N.M. Faber, J. Ferré and R. Boqué
    Iteratively reweighted generalized rank annihilation method. 1. Improved handling of prediction bias
    Chemometrics and Intelligent Laboratory Systems, 55 (2001) 67-90


The question of correct resampling in connection with GRAM has been discussed in:

  • N.M. Faber
    Comment on a recently proposed resampling method
    Journal of Chemometrics, 15 (2001) 169-188
  • A.K. Smilde
    Reply to 'Comment on a recently proposed resampling method'
    Journal of Chemometrics, 15 (2001) 189-192


Many of the conclusions derived for GRAM should hold equally well for the alternative methods.


Multivariate calibration: regression coefficients


Top Top blue.gif

For multiple linear regression (MLR) using ordinary least squares (OLS), expressions are available for the standard error in the regression coefficients when the (full-rank) predictor matrix (X) is corrupted by noise, see:

  • S.D. Hodges and P.G. Moore
    Data uncertainties and least squares regression
    Applied Statistics, 21 (1972) 185-195
  • R.B. Davies and B. Hutton
    The effect of errors in the independent variables in linear regression
    Biometrika, 62 (1975) 383-391


Assuming iid noise in the predictands y and the predictors X, an approximate covariance matrix for the estimated regression coefficients vector estimate follows as

PCA eq 7.gif

where b is the regression coefficients vector, stdy2.gif and stdX2.gif denote the variance of the noise in y and X, respectively, and norm.gif signifies the Euclidean norm of a vector, i.e., its length. This approximation, which becomes exact when X is error-free, illustrates that the effect of the measurement noise should be assessed. Consequently, the parametric bootstrap and the noise addition method should perform best.

Generally accepted analogues to Equation [7] do not exist for the calibration of rank-deficient X using methods like PCR and PLS. Various resampling schemes have been compared with a simple expression in:

  • N.M. Faber
    Uncertainty estimation for multivariate regression coefficients
    Chemometrics and Intelligent Laboratory Systems, 64 (2002) 169-179


For the PLS calibration of the NIST data set, parametric resampling indeed appears to work well, whereas non-parametric resampling overestimates the standard error in the regression coefficients by a factor of five or more:

PCA8.gif

Figure PCA 8: Standard errors in PLS coefficients (×10-2) for methyl tert-butyl ether (MTBE), ethanol (EtOH) and water (H2O): expression-based versus the ones obtained by the non-parametric (top) and parametric (bottom) bootstrap.


The adequacy of the expression and parametric bootstrap are visually confirmed by the smoothness of the regression coefficient vector, for example, for water (H2O):

PCA9.gif

Figure PCA 9: PLS coefficients for H2O.


From the standard error estimates, one may calculate the 90% confidence interval that includes zero. The non-parametric bootstrap suggests that 142 (out of 391) coefficients are insignificant versus 28 for the parametric bootstrap. Only the latter number seems to be reasonable. On the basis of 142 insignificant coefficients, one would expect many more zero crossings than actually observed. This is further illustrated by plotting only the insignificant coefficients for a limited region:

PCA10.gif

Figure PCA 10: PLS coefficients for H2O between 7600 and 8200 cm-1 for which the 90% interval includes zero: non-parametric (Cyan dot.gif) and parametric (Green asterisk.gif) bootstrap.


Several researchers have independently proposed to use the standard errors in the regression coefficients to eliminate uninformative variables, see:

  • V. Centner, D.-L. Massart, O.E. De Noord, S. De Jong, B.M. Vandeginste and C. Sterna
    Elimination of uninformative variables for multivariate calibration
    Analytical Chemistry, 68 (1996) 3851-3858
  • R. Wehrens and W.E. Van der Linden
    Bootstrapping principal component regression models
    Journal of Chemometrics, 11 (1997) 157-171
  • H. Martens and M. Martens
    Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR)
    Food Quality and Preference, 11 (2000) 5-16


It should be noted that this procedure amounts to performing multiple t-tests. Consequently, some sort of correction must be applied to calculate the overall significance level, e.g. a Bonferroni correction. It is well known that this limits the number of variables to be tested. Moreover, the correlation among variables is entirely neglected when performing separate t-tests. A simple alternative is to calculate a joint confidence region for a set of variables using a standard F-test, as suggested in:

  • N.M. Faber and B.R. Kowalski
    Propagation of measurement errors for the validation of predictions obtained by principal component regression and partial least squares
    Journal of Chemometrics, 11 (1997) 181-238


Multivariate calibration: scores and loadings


Top Top blue.gif

PCA is often used for data visualization (exploratory data analysis). Score plots are inspected to see how samples are distributed over the calibrated range (trends, clusters, outliers), while loading plots may reveal relationships among the variables. Unfortunately, score plots are not without problems because their correct interpretation depends critically on the scaling used, see:

  • P. Geladi, M. Manley and T. Lestander
    Scatter plotting in multivariate data analysis
    Journal of Chemometrics, 17 (2003) 503-511


N.B. Scaling problems do not arise in loading plots since the loadings are all normalized to unit length.

The default option in many software packages is to evenly fill out the score plot. This certainly leads to plots that are visually attractive. However, the relative position of the points can be misleading.

Figure PCA 3 is an example of a score plot obtained by the default scaling. Both axes have the same length, but close examination shows that the ranges are 0.19 and 0.14, respectively. In other words: the spacing of the tick marks along the axes is not identical, as it should be. However, the distortion of the original X-space is moderate in this case. Geladi et al. give an extreme example, where the second PC is blown up when applying the default scaling.

A correct interpretation of distances is stil possible when attaching uncertainties to the points (e.g. confidence ellipses) because the uncertainty follows scale - scaling is a linear operation.

The analogy with the regression coefficients implies that the effect of (measurement) noise should be assessed so that parametric resampling is appropriate here.

Approximate bias and variance expressions can be obtained from Theorem 4 in:

  • L.A. Goodman and S.J. Haberman
    The analysis of nonadditivity in two-way analysis of variance
    Journal of the American Statistical Association, 85 (1990) 139-145


In the following, we will restrict ourselves to variances. Assuming iid noise in X, approximate covariance matrices for the scores and loadings follow as:

PCA eq 8.gif

PCA eq 9.gif

where t, p and l denote a denormalized score vector, normalized loading vector and eigenvalue, respectively, for which the subscript refers to the PC number, A is the number of significant PCs, stdX2.gif is the variance of the noise in X, and I denotes an appropriately dimensioned identitity matrix.

It is important to note that Equations [8] and [9] are obtained without additional approximations by simplifying the fully general ones that accommodate for correlated and heteroscedastic noise:

PCA eq 10.gif

PCA eq 10.gif

where t, p, u and s denote a denormalized score vector, normalized loading vector, normalized score vector and singular value, respectively, for which the subscript refers to the PC number, A is the number of significant PCs, N and K are the number of rows and columns of X, and sigma_row.gif and sigma_col.gif denote the covariance matrices of the noise for rows and columns of X.

It is observed that correlated noise leads to correlation between the scores and loadings associated with different PCs.

The adequacy of the simplified expressions [8] and [9] is tested by adding iid noise with standard deviation 0.2 to the data matrices of Figure PCA 1. This noise level amounts to a signal-to-noise ratio of approximately 6. The spread in the scores from 1000 repetitions leads to the following estimate of the covariance matrices:

PCA11.gif

Figure PCA 11: Covariance matrices for scores 1 and 2 of noisy 25×2 data matrices. The scores are normalized to illustrate that PC 1 is more stable than PC 2: the vertical scale is 10-4 for PC 1 versus 10-3 for PC 2.


It is apparent from these plots that the components of the score vectors are indeed uncorrelated when the noise is uncorrelated. Similar results are obtained for the loadings (not shown).

Plotting the square root of the diagonal of the covariance matrices yields:

PCA12.gif

Figure PCA 12: Comparison of Monte Carlo results and expression-based estimates of standard errors in the scores of noisy 25×2 data matrices.


The agreement appears to be good enough to allow one to construct confidence ellipses.

This conjecture is tested for scores and loadings simultaneously as follows. The scores and loadings are scaled symmetrically and displayed together in a biplot. In Matlab, this symmetric scaling is realized as:

[U,S,V] = svd(X,0); T = U * sqrt(S); P = V * sqrt(S);

Some afterthought shows that this scaling leads to standard errors in scores and loadings that are both inversely proportional to the square root of the associated singular value.

It makes a lot of sense to scale the biplot using the command:

axis('equal');

Then the range of the points should be proportional to the square root of the associated singular value.

These considerations lead to a very clear interpretation of decreasing stability with increasing PC number, as illustrated by the following biplot:

PCA13.gif

Figure PCA 13: Biplot for scores (Cyan dot.gif) and loadings (Green square filled.gif) obtained after adding noise to the (errorless) 25×2 data set depicted in panel (a) of Figure PCA 1. The ellipses describe approximate 95%-confidence regions. Further indicated is the center (Yellow asterisk.gif).


The plot is nicely filled out, while the distortion (of the original X-space) caused by the particular scaling applied is compensated for by the confidence ellipses that follow scale.


Conclusions


Top Top blue.gif

Different resampling methods may yield widely different results. For example, the uncertainty estimates provided by the jackknife and non-parametric bootstrap often bear little resemblance with the effect of measurement noise, which is correctly assessed using the noise addition method or the parametric bootstrap. The latter procedures lead to uncertainty estimates that are often much smaller, but more realistic. Correct resampling should lead to (slightly) more accurate results than the application of a correct analytical expression. However, the results of expressions tend to be less variable. Moreover, expressions yield the insight that is required to optimize an analytical procedure to meet specific needs.


References & further information


Top Top blue.gif

Open blue.gif Open a list of references

For further information, please contact Klaas Faber: Klaas Faber.jpg