Statistics in the Laboratory: Standard Deviation of the Mean

In the last column we discussed the use of pooling to get a better estimate of the standard deviation of the measurement method, essentially the standard deviation of the raw data. But as the last column implied, most of the time individual measurements are averaged and decisions must take into account another standard deviation, the standard deviation of the mean, sometimes called the “standard error” of the mean. It’s helpful to explore this statistic in more detail: first, to understand why statisticians often recommend a “sledgehammer” approach to data collection methods; and, second, to see that there might be a better alternative to this crude tactic. We’ll also see how to answer the question, “How big should my sample size be?”

For the next few columns, we need to discuss in more detail the ways statisticians do their theoretical work and the ways we use their results.

I often say that theoretical statisticians live on another planet (they don’t, of course, but let’s say Saturn), while those of us who apply their results live on Earth. Why do I say that? Because a lot of theoretical statistics makes the unrealistic assumption that there is an infinite amount of data available to us (statisticians call it an infinite population of data). When we have to pay for each measurement, that’s a laughable assumption. We’re often delighted if we have a random sample of that data, perhaps as many as three replicate measurements from which we can calculate a mean.

That last sentence contains a telling phrase: “a random sample of that data.” Statisticians imagine that the infinite population of data contains all possible values we might get when we make measurements. Statisticians view our results as a random draw from that infinite population of possible results that have been sitting there waiting for us. If we were to make another set of measurements on the same sample, we’d get a different set of results. That doesn’t surprise the statisticians (and it shouldn’t surprise us if we adopt their view)—it’s just another random draw of all the results that are just waiting to appear.

On Saturn they talk about a mean, but they call it a “true” mean. They don’t intend to imply that they have a pipeline to the National Institute of Standards and Technology and thus know the absolutely correct value for what the mean represents. When they call it a “true mean,” they’re just saying that it’s based on the infinite amount of data in the population, that’s all.

Statisticians generally use Greek letters for true values—μ for a true mean, σ for a true standard deviation, δ for a true difference, etc.

The technical name for these descriptors (μ, σ, δ) is parameters. You’ve probably been casual about your use of this word, employing it to refer to, say, the pH you’re varying in your experiments, or the yield you get from those experiments, or maybe even constraints (“We have to stay within out budgetary parameters”). You can’t be sloppy like that when you work with a statistician: the word parameter has a very strict meaning.

Because parameters are based on an infinite amount of data, there is no uncertainty in their values. (We’ll see why in a minute.)

So, you’re saying to yourself, “I’m confused. And why would I even worry about what to call these things if I don’t have that infinite amount of data and can’t calculate them, anyway?”

Good point. Here’s a key thing, though. Even though we’ll never know the values of these parameters, we can still use a limited sample of data to guess at their true values. It’s a process called estimation, so the results are called parameter estimates, also called sample statistics.

We use a Roman letter to represent individual measurements (e.g., x1 = 3.6), and we put a “bar” above the letter when we want to indicate an arithmetic average (a mean). For example, if x2 = 4.8, and x3 = 4.5, we would write the mean of x1 through x3 as x- = 4.3. Thus, we say that the statistic x- ±is an estimate of the parameter μ. Because there is uncertainty in the measured values that have been “drawn from the population at random,” there is uncertainty in these parameter estimates.

Backing up a bit, how do we measure the uncertainty in measured values? As we discussed in the last column, the estimates of the true standard deviation σ is given by the familiar equation:

Image.

where the Greek capital letter sigma (Σ) is the summation operator, and its index i indicates the measurement number from 1 to n. For x1 through x3, s = 0.6807.

Now, let’s go to Saturn for a few minutes. On Saturn we can play with the infinite population of data. Let’s suppose that for the measurements we’ve been making, μ = 4.76 (exactly) and σ = 0.30 (exactly). The estimate of s = 0.6807 seems a bit high in comparison to σ = 0.30, but parameter estimates can be quite variable when n is small (and to a statistician n = 3 is small), so it isn’t anything to worry about.

We won’t live long enough to look at all of the data in the infinite population, so let’s look at only one million pieces of data and say that’s representative enough. The Gaussian distribution in Figure 1 was obtained by drawing at random one million data points (statistical samples of size n = 1) from the infinite population with μ = 4.76 and σ = 0.30. The data have been “binned” to generate the “histogram distribution” shown in Figure 1. The bin size is 0.04 on the horizontal axis. There are 100 bins from 3 to 7. If a sample mean had a value between 4.00 and 4.04, for example, it would be placed in bin number 26. The height of each contiguous histogram bar represents the number of data points that end up in that bin. Note that the mean of the one million data points is 4.760 (to three decimal places), and the standard deviation of the one million data points is 0.300 (to three decimal places). Figure 1 is what we would expect to see for the individual measurements. No surprises here.

ImageFigure 1 – The distribution of 1,000,000 individual pieces of data (n = 1) drawn at random from an infinite population with μ = 4.76 and σ = 0.30. See text for discussion.

Figure 2 is a little bit different. For this figure, we didn’t pull out only one data point, but we pulled out two data points at a time and binned their means. So, Figure 2 is based on two million data points, or one million means for which n = 2. The “grand mean,” the “average of the averages” (represented by the symbol x with two bars above it) is equal to 4.760, as expected, but now we see that the “standard deviation of the means” sx- = 0.212, less than 0.30. Interesting.

ImageFigure 2 – Yellow: the distribution of 1,000,000 means, each estimated from two pieces of data (n = 2) drawn at random from an infinite population with μ = 4.76 and σ = 0.30. Green in background: the underlying distribution of raw data. See text for discussion.

For Figure 3 we pulled out four data points at a time and binned their means. The grand mean is again 4.760, but sx- = 0.15, exactly half of σ = 0.30 for the raw data. What’s going on here?

ImageFigure 3 – Yellow: the distribution of 1,000,000 means, each estimated from four pieces of data (n = 4) drawn at random from an infinite population with μ = 4.76 and σ = 0.30. Green in background: the underlying distribution of raw data. See text for discussion.

When data points are averaged, the negative deviations of some of the data points cancel the positive deviations of other data points. Thus, the estimated means tend to be closer to the true mean and therefore exhibit less variability than the raw data. The relationship between sx-, s, and n is a “reciprocal square-root” function, the statistician’s “one-over-the-square-root-of-n” effect:

Image.

Clearly, as n increases, the uncertainty in the mean decreases.

This relationship holds on Saturn, as well, and shows why on Saturn there is no uncertainty in the mean; if n = ∞ then σx- = 0:

Image.

This equation can be rearranged to show in general how the ratio of the standard deviation of the mean to the standard deviation of the raw data decreases as 1/√n:

Image.

Figure 4 illustrates this 1/√n effect. Clearly, as n increases, σx- decreases. Doing a few replicates can reduce the uncertainty in the mean by quite a bit. For example, when n = 4, σx- is decreased by a factor of 2. But to decrease σx- by another factor of 2, the number of experiments must be quadrupled to 16. Clearly, as the number of replicates is increased, the marginal improvement in σx- decreases. Stated differently, the first few replicates give a lot of bang for the buck; after that, it gets more and more expensive to decrease σx-.

ImageFigure 4 – Illustration of the “one-over-the-square-root-of-n” effect. The ratio σx-/σ decreases as 1/√n.

Many researchers want to know how big their sample size should be (a legitimate request). Suppose a researcher asks a statistician this question, expecting to get a simple answer: e.g., n = 3. Instead, the statistician turns around and silently walks off in disgust. Why do statisticians behave this way? Because they know there is no simple answer to this question, and they’re going to have to work with the researcher to try to get information that the researcher might not have. Experience has taught them that the best time to get out of a bad deal is at the beginning. They don’t want to go through this excruciating process again.

The researcher might have a pooled estimate of σ for the measurement process, but the researcher’s mean is probably going to be used to make a decision. The question then becomes, “How uncertain can the reported mean be and still make a good decision?” That is, how small does σx- have to be? It’s my opinion that because of the ways companies compartmentalize their functions, the researcher making the measurements is often not aware of this last piece of information. It then becomes the statistician’s task to move across the company to discover this piece of information so the sample size can be determined. If you know σ and σx-, you can calculate the sample size n yourself. At this point, you don’t need the statistician.

Here’s an example. The percentage of toluene in 500 chemical samples of gasoline is to be estimated by making multiple gas chromatographic measurements for each gasoline sample and using the sample mean as an estimate of the toluene percentage. Each measurement costs $50. Previous experience has indicated that individual measurements have a standard deviation of 0.10% toluene (this is σ, the method standard deviation). However, the client requires a standard deviation of 0.025% toluene (this will be σx-). How big should your sample size be?

You can almost calculate n in your head. If the ratio of σx- to σ is 0.025%/0.10% = 1/4, then √n = 4 and n = 16. You must make 16 replicate measurements on each of the 500 chemical samples for a total of 8,000 measurements. But this will cost $400,000. Your client is going to balk at this. They’ll ask, “Isn’t there a cheaper way to get the results we need?”

Of course there’s a cheaper way. To get there, let’s look at an assumption statisticians usually make when they solve sample size questions like this. They assume σ is what it is, and it can’t be changed. They then apply the 1/√n sledgehammer to come up with a sample size, as we did above.

But statisticians are often wrong about their assumption, and σ  can be changed. Suppose we bought a better chromatograph that gave measurements with σ = 0.025% toluene instead of 0.10% toluene. With that new chromatograph, the calculation of sample size would be n = 1. Only 500 measurements would be needed, and the cost running the samples would be only $25,000.

ImageFigure 5 – An illustration of financial considerations when deciding whether or not to use a more precise measurement method. See text for discussion.

Figure 5 illustrates the idea. Suppose you start out making 16 measurements per sample ($800/sample) using the old chromatograph and you suddenly realize you could save money if you bought a better chromatograph. By the time you’ve finished your 100th sample (1600 measurements up to this point, an integrated COST of $80,000), you’ve put together the funding (the upper yellow rectangle in the figure, $90,000) and the new chromatograph you’re purchased has just arrived. Starting with sample 101 you use the new chromatograph and start saving 15 measurements x $50 per measurement = $750 per sample, which you can use to recover the $90,000 cost of the new chromatograph from samples 101 through 220 (the yellow rectangle labeled RECOVER). After that, it’s pure SAVINGS, spending only $50 per sample rather than $800 per sample (the green rectangle). The total cost of the project (red area) will be $190,000 ($90,000 for the chromatograph and $100,000 for the measurements). This is a lot better than the $400,000 it was going to cost. (The total cost would have been only $115,000 if you’d realized the benefits of a better chromatograph at the beginning of the project.)

Don’t try to do with statistics what you can do cheaper with an improved measurement method. The 1/√n sledgehammer isn’t always the best way to solve sample size problems.

In conclusion: a) σx- is important for most decision-making, b) you can make σx- as small as you want by using a large enough sample size, c) you can calculate your sample size yourself, and d) sometimes it’s less expensive to make σx- small just by using a better measurement method with a smaller σ.

In the next module we’ll see how σx- can be used to calculate a confidence interval for the mean.

Stanley N. Deming, Ph.D., is an analytical chemist masquerading as a statistician at Statistical Designs, El Paso, Texas, U.S.A.; e-mail: [email protected]; www.statisticaldesigns.com