Statistics in Analytical Chemistry: Part 8—Calibration Diagnostics

The past seven articles have provided the fundamentals of calibration, including a protocol for designing a calibration study. Once such a study has been performed (i.e., the chosen standards have been analyzed in replicate), the data are available for constructing a calibration curve. (See installments 5 and 6 in American Laboratory June 2003 and July 2003, respectively, for details on calibration design.) It is tempting simply to fit a straight line (SL) using ordinary least squares (OLS) and to evaluate the SL model using only the correlation coefficient, or its square, R2. Note that R2 represents the proportion of the variation in the response (y-value) that is “explained” by the calibration. If R2 is high enough (typically 0.99 or better), the curve is deemed adequate.

However, more rigorous testing should be done to ensure that the selected model and fitting technique are adequate. Statistics can provide the necessary tools for this task. This article is the first of three that will discuss the process in detail. The evaluation involves seven basic steps, all of which are easy to perform with the help of statistical software. The steps are:

  • Plot response versus true concentration
  • Determine the behavior of the standard deviation of the response
  • Fit the proposed model and evaluate R2adj
  • Examine the residuals for nonrandomness
  • Evaluate the p-value for the slope (and any higher-order terms)
  • Perform a lack-of-fit test
  • Plot and evaluate the prediction interval.

Before these steps are discussed in detail, the concept of p-value (“p” stands for “probability”) must be introduced. This statistic will be used throughout the diagnosis process to guide decisions about plausible hypotheses. Most statistical evaluations of data involve hypothesis testing. A proposed assumption (called the null hypothesis) is made for the data; a contrasting assumption is called the alternative hypothesis. When the statistical test is performed, the question being asked is, “What are the odds of getting this set of data (or data at least this unusual, as defined by the alternative hypothesis), purely by chance, if the null (or starting) hypothesis is true?” The p-value is that probability. If the odds are low (typically less than 1% or 5%, depending on the test being used), then the null hypothesis is rejected and the alternative is accepted. Note that one can never prove a hypothesis to be false; one can only decide based on the weight of the evidence, much as in a court of law.

Step 1: Plot response versus true concentration

As might be expected, the first step in constructing a calibration curve is to plot the response data versus the respective true concentrations. Even without having a model fitted to the points, this plot is usually informative. Any data that are possible outliers (or typos in the data table) may be obvious. Suspect points should be investigated to see if they are correct and if they belong with the rest of the data set. However, “errant” (or inconvenient) data should not be permanently excluded unless a sound physical reason can be found for taking such action. As much as possible, any calibration experiment should capture the variability that will likely enter into the typical analysis process. (The topic of outliers will be discussed in more detail in later installments.) If there is a high degree of curvature, this situation may also be detectable.

Step 2: Determine the behavior of the standard deviation of the response

The importance of this second step cannot be overemphasized. If the computed standard deviation (SD) of the responses changes systematically (e.g., increases or decreases) with concentration, then ordinary least squares is not the appropriate fitting technique. Recall from installment 3 (January 2003) that one of the assumptions behind OLS is that, “The standard deviation of the responses does not change over the range of x values for which the model will be applied. However, in analytical chemistry, this assumption does not always hold; the variability of the response will often increase with increasing concentration.”

To perform this behavior analysis, the SD of the responses is computed separately at each concentration (hence, true replicates must be run). These values are plotted versus true concentration. A straight line (using, for example, OLS) is fitted through the points, resulting in an equation of the form (g + hx). Note that in this series of articles, the (a + bx) form of the straight-line equation will be reserved for situations in which the instrumental response is being plotted versus true concentration. For modeling standard deviations (g + hx) will be used.

Associated with the slope (h) is a p-value, which allows the analyst to decide if the slope is significant. In general, if this p-value is less than 1% (possibly reported by software as 0.01), then the slope is considered to be significant and the SD is declared to change with concentration.

Because OLS assumes a constant SD, all data points are allowed to influence the regression line equally; in other words, each point carries a “weight” of 1. However, if some data are noisier than others (i.e., the standard deviation is not constant), then the more variable points should not be allowed to have as much influence. To incorporate this nonconstant SD into the regression process, a generalization of OLS is used instead: weighted least squares (WLS).

As is indicated by the name, weighted least squares, weights are involved. Various formulas have been used by different authors to calculate these weights. However, a robust procedure uses the formula that results from fitting a SL to the SD data (see above discussion). The basic equation for the weight is the reciprocal squared of the estimated standard deviation:

weight @ x = (g + hx)–2

To enable the calculation of root mean square error in original response units, this weight should be normalized by dividing by the mean of all the reciprocal-squared values:

{(g + hx)–2}/{Avg [(g + hx)–2]}

This formula is evaluated for each concentration and applied to the corresponding data to “weight” them. The result is that the noisy responses have less influence on the calibration curve than do the precise values.

While step 2 may seem a computational annoyance, ignoring a SD that changes with concentration will have two negative results. First, the model’s coefficient estimates (g and h) will be noisy. Second (and possibly more important), the prediction interval will be too wide in the well-behaved-data region and too narrow in the noisy-data region.

Steps 3 through 7 will be discussed in the next two installments. Following the details of calibration diagnostics will be a series of articles using these procedures to diagnose real calibration data sets.

Mr. Coleman is an Applied Statistician, Alcoa Technical Center, MST-C, 100 Technical Dr., Alcoa Center, PA 15069, U.S.A.; e-mail: [email protected]. Ms. Vanatta is an Analytical Chemist, Air Liquide-Balazs™ Analytical Services, Box 650311, MS 301, Dallas, TX 75265, U.S.A.; tel: 972-995-7541; fax: 972-995-3204; e-mail: [email protected].