| Goals: |
|
Along with determining the location of the center of data, it is important to know how spread out the data is. In this subsection, we will study two different terms used to describe how spread out data is in a sample. The first is called, sample variance.
Given a sample of data values, , the sample variance of the data is
where is the mean of the data.
If the data encompasses the entire population, we call it population variance, denoted as , and we replace in the formula above with just .
The role of sample variance is to estimate the population variance, . Ideally, if we have the entire population we would just calculate . Most of the time we only have a sample of the entire population.
Let’s calculate the sample variance of the following samples and compare them.
A random sample of 10 using in Excel.
We shall compute the variance using the expression given in definition of sample variance. The random sample we obtain is given in Figure 3.56 below.
In an adjacent column, we first find the mean .
Notice that the formula has for each data value . Using auto-fill in Excel, we create a column that computes this part of the expression for each data value. Notice that we call the mean in each expression.
Adding up this column and dividing by will result in the sample variance for the randomly generated data. (Note that and .)
A random sample of 100 using in Excel.
We use the command and leave it to the reader to verify the command is true. 1313Note that there is also a . This is for the case when your sample is actually the entire population.. Part of the generated random sample of 100 and the associated sample variance of all 100 data values is then shown in Figure 3.60.
Comparing (a) and (b), we notice that the sample variance is around 0.08. The random samples you obtain should also be somewhat close to 0.08 in sample variance.
Mathematically, variance is a great tool for describing how spread out the data in a sample is. However, in day-to-day talk, the variance has one flaw that is commonly missed. It is not in the same units of measurement as the original data! For example, if the original sample taken was in feet, then the sample variance would actually be in feet squared. This is due to the squaring in the expression in the definition for sample variance. A better way of expressing how far apart the data is in the same units of measurement as the data is when the sample standard deviation.
Given a sample of data values, , the sample standard deviation of the data is
where is the mean of the data.
If the data encompasses the entire population, we call it population variance, denoted as , and we replace in the formula above with just .
Really, sample standard deviation is just the square root of sample variance. The calculation can be quickly done.
Let’s compute the sample standard deviation of 10 randomly generated data values using in Excel by using the formula and by using the command .
Begin by generating the random sample of 10. For simplicity, suppose it is the same as that in Figure 3.56. We saw that the sample variance was 0.085070668. Taking the square root of this number, we obtain the sample standard deviation, as desired.
For a quicker computation, the command will also calculate the sample standard deviation. Be sure to select the original data when using the command. You should also obtain the same and as stated in Figure 3.61.
The larger the variance or standard deviation is, the more spread out the data is. Extreme outliers, therefore, will cause the variance and standard deviation to become larger even though most of the data may be concentrated in an region of the number line.
At what value is 25% of the data explained? Given a random sample, is it possible to determine a value for which 25% of the data in the sample is at or below this value? This is what quartiles will do for us.
Dividing a random sample up into quarters, the values for which 25%, 50% and 75% of the data is at or below are known as quartiles and are respectively denoted as , , and .
We have actually messed with a quartile before but didn’t label it as such. The median is actually a quartile. It is . This is because when calculating the median, we are looking to see what value for which 50% of the data is at or below it.
There are many different approaches to finding and . One approach is to find the median of the lower 50% of the data and the median of the upper 50%, excluding the median of the entire sample. This will then be and , respectively. Excel calculates the and a bit differently – a method beyond the scope of this class. We will rely on calculating the quartiles without using the commands in Excel.
Let’s generate a random sample of 20 data with and calculate , , and .
In Excel, we generate a random sample of 20 data. To find , , and we first want to organize the data from lowest to highest in value, as seen in Figure 3.62
is the median. This can easily be calculated with the command . The authors’ obtain .
Since there are 20 data values, then the median of the entire sample is between the tenth and eleventh data value, if ordered from lowest to highest. Therefore, the lower 50% of the sample will be the first ten data values. Using again, but only on these ten values, we can obtain . The authors’ obtain . Following a similar approach for the upper 50%, we have . To summarize:
| , | , | . |
We define
The interquartile range describes how wide the middle 50% of the data is. Commonly, interquartile range is used to determine the presence of outliers. The following guideline is used to determine if a data value is an outlier. Let be a data value from a random sample.
| If , then is an outlier |
| If , then is an outlier |
Remark: The table above is used as a guideline. If a data value is ruled as an outlier, it need not be removed from the data set. More information and analysis is needed to move forward and delete data from a sample.
Let’s determine if the following data has any outliers.
| 1.3 | 18.3 | 40.3 | 56.7 |
|---|---|---|---|
| 4.2 | 20.1 | 42.4 | 57.8 |
| 7.7 | 21.0 | 42.9 | 60.8 |
| 8.7 | 22.8 | 43.7 | 67.3 |
| 11.5 | 28.3 | 44.7 | 75.3 |
| 11.5 | 29.9 | 46.2 | 81.3 |
| 13.5 | 30.5 | 47.2 | 98.2 |
| 14.5 | 32.5 | 50.7 | 15.0 |
| 53.3 | 35.4 | 20.2 | 20.5 |
Enter the data into Excel. We need to arrange the data from lowest to highest to determine and . Doing so, we found that and . Therefore, we have that
It follows that
and
In Excel, it is easy to see if we have any data that exceeds these values since we already have the data sorted. As a result, you should notice that 98.2 exceeds the guidelines. We would then consider this data value an outlier. However, due to how close it is to the other data along with the importance (not known here) of the data value, we may not delete it from our sample of data.
Answer each of the following statements as True or False.
There is no difference in formula between population variance and sample variance.
Sample standard deviation and sample variance are only a square root operation away from one another.
The variance of data can be described in the same units of measurement that the data is in.
has no relation to the median whatsoever.
If a data value is deemed an outlier, then should always be removed from the data set.
Without using the commands and , create a spreadsheet that will compute the sample standard deviation and sample variance of a sample of 7 data values. Use it to quickly compute the sample standard deviation and sample variance of the following data.
24.4, 15.5, 87.4, 73.4, 22.2, 81.0, 86.4
2, 1, 3, 3, 3, 4, 3
22.0, 21.1, 12.7, 12.0, 23.2, 15.9, 16.0
1.0, 0.3, 0.6, 1.4, 0.4, 0.9, 1.3
Using the command , generate a random sample of 900 data values. Compute the sample standard deviation, , and sample variance, . How do your values compare knowing that and ?
Using the command , generate a random sample of 900 data values. Predict what the population standard deviation and population variance ought to be.
For the following data, calculate , , and .
14, 52, 66, 87, 99, 11, 1, 2, 3, 4, 5, 6, 7, 8, 26, 32, 32, 14, 88
6, 18, 3, 6, 2, 5, 6, 8, 21, 3, 5, 10, 26, 19, 6, 13, 63, 19, 41, 91
1,2, 2, 2, 3, 3, 4, 5, 5, 6, 8, 12, 19, 19, 21, 22, 35, 50, 61, 83
Using the data in the previous example, determine if there are any outliers based on the guidelines stated in this section.