|
Multivariate (and multiway) calibration methods like principal component regression (PCR) and partial least squares (PLS) regression require the analyst to select a suitable number of components (also known as latent variables or factors). In practice, this is all but a trivial task!
This page is organized as follows:
Validation-based model selection
Currently, validation-based methods appear to be the standard tool. Utilizing a large independent test set is generally considered to be best ('test = best'), at least in theory. However, a separate test set is rather wasteful hence it may not always be available.
Cross-validation is the most popular alternative. However, it is a resampling method. Correct application is therefore, in a strict sense, limited to data that can be considered as a random sample from some population. This may present a serious problem if the data is collected according to an experimental design. Another situation where cross-validation is likely to fail, is encountered when an existing model is to be updated with a few observations from another population.
|
|
A serious weakness shared by all validation-based methods is the following. Ideally, validation leads to a minimum prediction error for the optimum model complexity, see figure on the right.
In practice, however, a clear minimum is often not observed. Instead, one has to rely on 'visual inspection' of 'far-from-ideal' plots, and apply 'soft' decision rules such as the 'first local minimum' or the 'start of a plateau'.
Which decision rule is applied, actually depends on the data at hand.
It follows that validation-based model selection is inherently subjective in practice.
|
|
|
|
Figure COM 1: The text book illustration.
|
|
|
Randomization test
|
Top
|
|
We have developed a randomization test that intends to make the decision more objective. A tutorial introduction is:
- N.M. Faber
How to avoid over-fitting in PLS regression?
CAC 2006
Download (zipped =268 kB)
For more practical examples, see:
- N.M. Faber and R. Rajkó
An evergreen problem in multivariate calibration
Spectroscopy Europe, 18 (2006) 24-28
Download from the Spectroscopy Europe site ( =1,470 kB)
- N.M. Faber and R. Rajkó
How to avoid over-fitting in multivariate calibration - the conventional validation approach and an alternative
Analytica Chimica Acta, 595 (2007) 98-106
- N.M. Faber, J. Mojet and A.A.M. Poelman
A novel randomization test for estimating the number of principal components in external preference mapping
Food Quality and Preference, submitted
- M.P. Gómez-Carracedo, J.M. Andrade, D.N.Rutledge and N.M. Faber
Selecting the optimum number of PLS components for the calibration of ATR-mid-IR spectra of undesigned kerosene samples
Analytica Chimica Acta, 585 (2007) 253-265
- S. Wiklund, D. Nilsson, L. Eriksson, M. Sjöström, S. Wold and K. Faber
A randomization test for PLS component selection
Journal of Chemometrics, 21 (2007) 427-439
|
|
References & further information
|
Top
|
|
Open a list of references.
For further information, please contact Róbert Rajkó:
|
|