3 Summary Statistics in Excel

3.5 How Spread Out is the Quantitative Data?

Goals:
Know the different measurements of spread: Variance, Standard Deviation, and Quartiles;
Understand the difference between computing a population statistic or a sample statistic;

3.5.1 Variance and Standard Deviation

Along with determining the location of the center of data, it is important to know how spread out the data is. In this subsection, we will study two different terms used to describe how spread out data is in a sample. The first is called, sample variance.

Definition (Variance).

Given a sample of n data values, x1,x2,,xn, the sample variance of the data is

s2=(x1-x¯)2+(x2-x¯)2+(xn-x¯)2n-1=1n-1i=1n(xi-x¯)2,

where x¯ is the mean of the data.

If the data encompasses the entire population, we call it population variance, denoted as σ2, and we replace n-1 in the formula above with just n.

The role of sample variance is to estimate the population variance, σ2. Ideally, if we have the entire population we would just calculate σ2. Most of the time we only have a sample of the entire population.

Example 3.5.1.

Let’s calculate the sample variance of the following samples and compare them.

  • (a)

    A random sample of 10 using 𝚁𝙰𝙽𝙳() in Excel.

    We shall compute the variance using the expression given in definition of sample variance. The random sample we obtain is given in Figure 3.56 below.

    Figure 3.56:

    In an adjacent column, we first find the mean x¯.

    Figure 3.57:

    Notice that the formula has (xi-x¯)2 for each data value xi. Using auto-fill in Excel, we create a column that computes this part of the expression for each data value. Notice that we call the mean x¯ in each expression.

    Figure 3.58:

    Adding up this column and dividing by 9 will result in the sample variance for the randomly generated data. (Note that n=10 and n-1=9.)

    Figure 3.59:
  • (b)

    A random sample of 100 using 𝚁𝙰𝙽𝙳() in Excel.

    We use the command 𝚅𝙰𝚁.𝚂() and leave it to the reader to verify the command is true. 1313Note that there is also a 𝚅𝙰𝚁.𝙿(). This is for the case when your sample is actually the entire population.. Part of the generated random sample of 100 and the associated sample variance of all 100 data values is then shown in Figure 3.60.

    Figure 3.60:

    Comparing (a) and (b), we notice that the sample variance is around 0.08. The random samples you obtain should also be somewhat close to 0.08 in sample variance.

Mathematically, variance is a great tool for describing how spread out the data in a sample is. However, in day-to-day talk, the variance has one flaw that is commonly missed. It is not in the same units of measurement as the original data! For example, if the original sample taken was in feet, then the sample variance would actually be in feet squared. This is due to the squaring in the expression in the definition for sample variance. A better way of expressing how far apart the data is in the same units of measurement as the data is when the sample standard deviation.

Definition (Standard Deviation).

Given a sample of n data values, x1,x2,,xn, the sample standard deviation of the data is

s=(x1-x¯)2+(x2-x¯)2+(xn-x¯)2n-1=1n-1i=1n(xi-x¯)2,

where x¯ is the mean of the data.

If the data encompasses the entire population, we call it population variance, denoted as σ, and we replace n-1 in the formula above with just n.

Really, sample standard deviation is just the square root of sample variance. The calculation can be quickly done.

Example 3.5.2.

Let’s compute the sample standard deviation of 10 randomly generated data values using 𝚁𝙰𝙽𝙳() in Excel by using the formula and by using the command 𝚂𝚃𝙳𝙴𝚅.𝚂().

Begin by generating the random sample of 10. For simplicity, suppose it is the same as that in Figure 3.56. We saw that the sample variance was 0.085070668. Taking the square root of this number, we obtain the sample standard deviation, as desired.

Figure 3.61:

For a quicker computation, the command 𝚂𝚃𝙳𝙴𝚅.𝚂() will also calculate the sample standard deviation. Be sure to select the original data when using the command. You should also obtain the same and as stated in Figure 3.61.

The larger the variance or standard deviation is, the more spread out the data is. Extreme outliers, therefore, will cause the variance and standard deviation to become larger even though most of the data may be concentrated in an region of the number line.

3.5.2 Quartiles

At what value is 25% of the data explained? Given a random sample, is it possible to determine a value for which 25% of the data in the sample is at or below this value? This is what quartiles will do for us.

Definition (Quartiles).

Dividing a random sample up into quarters, the values for which 25%, 50% and 75% of the data is at or below are known as quartiles and are respectively denoted as Q1, Q2, and Q3.

We have actually messed with a quartile before but didn’t label it as such. The median is actually a quartile. It is Q2. This is because when calculating the median, we are looking to see what value for which 50% of the data is at or below it.

There are many different approaches to finding Q1 and Q3. One approach is to find the median of the lower 50% of the data and the median of the upper 50%, excluding the median of the entire sample. This will then be Q1 and Q3, respectively. Excel calculates the Q1 and Q3 a bit differently – a method beyond the scope of this class. We will rely on calculating the quartiles without using the commands in Excel.

Example 3.5.3.

Let’s generate a random sample of 20 data with 𝚁𝙰𝙽𝙳() and calculate Q1, Q2, and Q3.

In Excel, we generate a random sample of 20 data. To find Q1, Q2, and Q3 we first want to organize the data from lowest to highest in value, as seen in Figure 3.62

Figure 3.62: Sample of 20 Data

Q2 is the median. This can easily be calculated with the command 𝙼𝙴𝙳𝙸𝙰𝙽(). The authors’ obtain Q2=0.416114567.

Since there are 20 data values, then the median of the entire sample is between the tenth and eleventh data value, if ordered from lowest to highest. Therefore, the lower 50% of the sample will be the first ten data values. Using 𝙼𝙴𝙳𝙸𝙰𝙽() again, but only on these ten values, we can obtain Q1. The authors’ obtain Q1=0.18474058. Following a similar approach for the upper 50%, we have Q3=0.551750463. To summarize:

Q1=0.18474058, Q2=0.416114567, Q3=0.551750463.

Interquartile Range

Definition (Interquartile Range).

We define

interquartile range (IQR)=Q3-Q1.

The interquartile range describes how wide the middle 50% of the data is. Commonly, interquartile range is used to determine the presence of outliers. The following guideline is used to determine if a data value is an outlier. Let x be a data value from a random sample.

If x<Q1-1.5IQR, then x is an outlier
If x>Q3+1.5IQR, then x is an outlier

Remark: The table above is used as a guideline. If a data value is ruled as an outlier, it need not be removed from the data set. More information and analysis is needed to move forward and delete data from a sample.

Example 3.5.4.

Let’s determine if the following data has any outliers.

1.3 18.3 40.3 56.7
4.2 20.1 42.4 57.8
7.7 21.0 42.9 60.8
8.7 22.8 43.7 67.3
11.5 28.3 44.7 75.3
11.5 29.9 46.2 81.3
13.5 30.5 47.2 98.2
14.5 32.5 50.7 15.0
53.3 35.4 20.2 20.5

Enter the data into Excel. We need to arrange the data from lowest to highest to determine Q1 and Q3. Doing so, we found that Q1=16.7 and Q3=49.0. Therefore, we have that

IQR=Q3-Q1=49.0-16.7=32.3.

It follows that

Q1-1.5IQR=-31.72775

and

Q3+1.5IQR=97.3719394.

In Excel, it is easy to see if we have any data that exceeds these values since we already have the data sorted. As a result, you should notice that 98.2 exceeds the guidelines. We would then consider this data value an outlier. However, due to how close it is to the other data along with the importance (not known here) of the data value, we may not delete it from our sample of data.

Concepts Check: 1. Find the sample standard deviation and sample variance of the data 23.1, 23.2, 23.3, 23.4, 23.5. Answer: s=0.158113883, s2=0.025 2. Find the quartiles of the data 753.2, 55, 58, 62, 73. Answer: Q1=54.1, Q2=58, Q3=67.5 3. Determine if the data in #2 above has any potential outliers. Answer: The value 753.2 can be deemed an outlier.

3.5.3 Exercises

  1. 1.

    Answer each of the following statements as True or False.

    1. (a)

      There is no difference in formula between population variance and sample variance.

    2. (b)

      Sample standard deviation and sample variance are only a square root operation away from one another.

    3. (c)

      The variance of data can be described in the same units of measurement that the data is in.

    4. (d)

      Q2 has no relation to the median whatsoever.

    5. (e)

      If a data value x is deemed an outlier, then x should always be removed from the data set.

  2. 2.

    Without using the commands 𝚂𝚃𝙳𝙴𝚅.𝚂 and 𝚅𝙰𝚁.𝚂, create a spreadsheet that will compute the sample standard deviation and sample variance of a sample of 7 data values. Use it to quickly compute the sample standard deviation and sample variance of the following data.

    1. (a)

      24.4, 15.5, 87.4, 73.4, 22.2, 81.0, 86.4

    2. (b)

      2, 1, 3, 3, 3, 4, 3

    3. (c)

      22.0, 21.1, 12.7, 12.0, 23.2, 15.9, 16.0

    4. (d)

      1.0, 0.3, 0.6, 1.4, 0.4, 0.9, 1.3

  3. 3.

    Using the command =𝙽𝙾𝚁𝙼𝙸𝙽𝚅(𝚁𝙰𝙽𝙳(),𝟸𝟶,𝟺), generate a random sample of 900 data values. Compute the sample standard deviation, s, and sample variance, s2. How do your values compare knowing that σ=4 and σ2=16?

  4. 4.

    Using the command =𝙽𝙾𝚁𝙼𝙸𝙽𝚅(𝚁𝙰𝙽𝙳(),𝟻𝟶,𝟸), generate a random sample of 900 data values. Predict what the population standard deviation and population variance ought to be.

  5. 5.

    For the following data, calculate Q1, Q2, and Q3.

    1. (a)

      14, 52, 66, 87, 99, 11, 1, 2, 3, 4, 5, 6, 7, 8, 26, 32, 32, 14, 88

    2. (b)

      6, 18, 3, 6, 2, 5, 6, 8, 21, 3, 5, 10, 26, 19, 6, 13, 63, 19, 41, 91

    3. (c)

      1,2, 2, 2, 3, 3, 4, 5, 5, 6, 8, 12, 19, 19, 21, 22, 35, 50, 61, 83

  6. 6.

    Using the data in the previous example, determine if there are any outliers based on the guidelines stated in this section.