|
Analytical figures of merit and other reliability measures enable one to improve or optimize an analytical determination. Often, this task boils down to performing a combinatorial search. A good example is finding the best subset of wavelength channels to keep for model building.
This page is organized as follows:
Best subset selection
The number of combinations grows exponentially with the total number of channels:
|
|
|
Figure OPT 1: The number of combinations as a function of the number of channels. The data are plotted on a logarithmic scale (base 10) for visual clarity.
|
|
Consequently, large-scale optimization techniques are required to do the job within a reasonable time span. The excellent survey of Ronald Shaffer and Gary Small (Learning optimization from nature. Genetic algorithms and simulated annealing, Analytical Chemistry, 69 (1997) 236A-242A) shows that two techniques are commonly deployed, namely genetic algorithms (GAs) and simulated annealing (SA). Unfortunately, there have been few studies comparing the two methods, and consequently guidelines about when to use a GA or SA are far from complete. (In practice, the choice is often a matter of taste.) For example, when applying a GA and SA to best subset selection no clear winner emerged from the following studies:
- C.B. Lucasius, M.L.M. Beckers and G. Kateman
Genetic algorithms in wavelength selection: a comparative study
Analytica Chimica Acta, 286 (1994) 135-153
- U. Hörchner and J.H. Kalivas
Further investigation on a comparative study of simulated annealing and genetic algorithm for wavelength selection
Analytica Chimica Acta, 311 (1995) 1-13
The main problem with these methods is, that certain parameters must be fine-tuned, otherwise the search is not effective and convergence to the global optimum is unnecessarily slow. (Typical parameters are the 'mutation rate' for a GA and the 'cooling schedule' for SA.) We have found this fine-tuning to be relatively easy when applying SA to best subset selection, as illustrated by the following results.
|
|
Worked example
|
Top
|
|
First, we have investigated the convergence behaviour for the data set studied in the articles mentioned above. This data set consists of the UV spectra of four ribonucleic acids (RNAs) measured at 36 channels:
|
|
|
Figure OPT 2. Normalized UV spectra of the RNAs adenine (A), cytidine (C), guanine (G) and uracyl (U). The normalization emphasizes that the spectra of adenine and uracyl are highly overlapped, while the others have distinct features.
|
|
Finding the best subset amounts to searching among 236-1»7x1010 candidates. (The rather insignificant minus 1 is for the full data set, which was already evaluated.) The global optimum for this data set was known through an exhaustive search. The SA algorithm was run 10,000 times with a different random initialization and the number of evaluations required to hit the global optimum was monitored. In most cases, less than 10,000 evaluations were required - the mode lies around 700. It can be inferred from the theory of (semi-) Markov processes that this quantity is distributed as the convolution of a normal and a geometric distribution:
|
|
The quality of this fit was statistically validated by examining studentized residuals:
|
|
|
Figure OPT 4: Studentized residuals for the original frequency counts, calculated as the residual divided by a standard error. The required standard error follows from the multinomial distribution of binned data (independent runs).
|
|
The studentized residuals have a standard deviation 1.08, which suggests a satisfactory fit.
This exercise was repeated for a variety of data sets. To keep the computation time manageable, the distributions were generated for 100 independent runs. The performance of the algorithm was measured in terms of the median of the distributions, rather than the mean, because the distributions are heavily skewed (see Figure OPT 3). Plotting the median against the number of channels yields:
|
|
Comparing this plot with the number of combinations (see scale of Figure OPT 1) shows that the search has been very effective. Studies of John Kalivas and coworkers suggest that the current results can be representative for other selection problems (samples, factors).
|
|
References & further information
|
Top
|
|