3 Summary Statistics in Excel

3.3 Understanding Proportions

Goals:
Know what proportions are and how to calculate them;
Know the different ways of expressing a proportion;
Understand the difference between p^ and p.

3.3.1 Proportions

In your daily life, you may have encountered the phrase, “what is the proportion of (fill-in the blank)?” Proportions allow us to summarize what how often an attribute occurs in relation to the whole.

Definition (Proportion).

Given a sample of n data values and a subset of q data values from the sample having a specified attribute, the sample proportion of the specified attribute, denoted as p^, is the ratio of q to n. That is,

p^=qn.

If the collection of data values represents the entire population, then the proportion of the specified attribute is referred to as the population proportion of the specified attribute and is denoted as just p.

Remark: It is common to express a proportion as a decimal, fraction, or a percentage. By definition, and written as a decimal, a proportion is a value ranging from 0 to 1.

Proportions are easy to calculate. By definition, we must first calculate the frequency of a specified attribute first. Frequency distributions, as a result, allow us to quickly calculated proportions. Let’s revisit an example.

Example 3.3.1.

Let’s find p^ for the bin (64.1,65] given in the frequency distribution in Figure 3.28. The frequencies of each bin are given. All we need to do is divide the frequency for bin (64.1,65] by the total number of data values in the sample, n=100. Thus, p^=3/100=0.03=3%.

Example 3.3.2.

Let’s use 𝚁𝙰𝙽𝙳𝙱𝙴𝚃𝚆𝙴𝙴𝙽(𝟷,𝟼), to generate a sample of 250 and calculate the sample proportion p^ of the value 4 showing up.

Figure 3.44: Snapshot of Random Sample

In cell 𝙰𝟷, type RANDOM SAMPLE. Using 𝚁𝙰𝙽𝙳𝙱𝙴𝚃𝚆𝙴𝙴𝙽(𝟷,𝟼), generate a random sample of 250 data points in cells 𝙰𝟸 through 𝙰𝟸𝟻𝟷. A snapshot of our random sample is in Figure 3.44.

In cells 𝙳𝟷 through 𝙵𝟷, copy the setup as it is in Figure 3.45.

Figure 3.45: Setup of Sample Proportion

Calculate the frequencies of each of the outputs occurring in cell 𝙴𝟸 through 𝙴𝟽 by typing =𝙵𝚁𝙴𝚀𝚄𝙴𝙽𝙲𝚈(𝙰𝟸:𝙰𝟸𝟻𝟷,𝙳𝟸:𝙳𝟽). In our sample, it shows that we obtained a frequency of 𝟹𝟾 fours. Since the sample size is n=250, the sample frequency is p^=38250=0.152.

Quick question – would the sample proportion p^ be different since a random sample was taken? Of course! In fact, our next example will highlight this.

Example 3.3.3.

What happens to the sample proportion of 4‘s if we take multiple random samples of size 250 or more using the command 𝚁𝙰𝙽𝙳𝙱𝙴𝚃𝚆𝙴𝙴𝙽(𝟷,𝟼)?

Keeping the sample size n=250 the same, let’s denote the sample proportions calculated from each random sample generated as p^i, where i stands for which random sample generated. Thus, in the previous example we have p^1=0.152. Repeating the previous exercise, or if you have it still open, pressing F9 on the keyboard will automatically generate a new random sample. All the work should be done for us, so we can then just record each sample proportion. The following table shows a sample of sample proportions calculated.

p^1 0.152
p^2 0.172
p^3 0.224
p^4 0.192
p^5 0.196

You should notice that the values vary as hypothesized. From our sample, though, we see that the values are as low as 0.152 and as high as 0.224. Did you obtain any values lower or higher?

It is interesting to notice that based on our large sample size of n=250 it seems more difficult to obtain very large or very small sample proportions. (Keep regenerating samples. Do you get 0.5 or more ever, or 0.02 or smaller ever?) It may still be possible to obtain these sample proportions, but it definitely seems to be more difficult to do so.

Increase the sample size to n=1000 and calculate another 5 sample proportion of 4‘s. Below is our collection of sample proportions.

p^1 0.174
p^2 0.177
p^3 0.183
p^4 0.168
p^5 0.159

It seems more likely to obtain sample proportions between 0.15 and 0.18 and more difficult to obtain values smaller than 0.15 and larger than 0.18. We can then hypothesize as n increases, the possible range of values will shrink.

The sample collected is meant to be a representation of the population. The larger the sample size, the more the sample represents the population.1010Assuming there is no biasness in the sample collected. Our intuition then allows us to hypothesize that the sample proportion should be estimating the population proportion with larger sample sizes.

Example 3.3.4.

Can we make a prediction of what the population proportion p is of a 4 showing up using the command 𝚁𝙰𝙽𝙳𝙱𝙴𝚃𝚆𝙴𝙴𝙽(𝟷,𝟼)?

The command 𝚁𝙰𝙽𝙳𝙱𝙴𝚃𝚆𝙴𝙴𝙽(𝟷,𝟼) assumes an equally likely chance of obtaining any of the values 1 through 6. That is, if we had a sample size of 6, we would expect to see 4 show up once. This is a proportion! Hence, we have the population proportion of 4‘s showing is p=16=0.1667. Can you see the relationship of the sample proportions in the previous examples and the population proportion?

Concepts Check: 1. Given a random sample of categorical data blue, orange, orange, red, green, green, green, orange, black, calculate p^ for orange. Answer: p^=39=0.1333 2. Given a fair 20-sided die, estimate the population proportion, p, resulting in an 11 showing up. Answer: p=120=0.05

3.3.2 Exercises

  1. 1.

    Answer the following as True or False.

    1. (a)

      p stands for sample proportion.

    2. (b)

      For each sample taken, p^ should never change.

    3. (c)

      It is possible to obtain a proportion of 0.

    4. (d)

      It is possible to obtain a proportion of 200%.

    5. (e)

      Proportions can be written as a fraction, decimal, or percentage.

  2. 2.

    Generate a sample of 1000 data values using the command 𝙽𝙾𝚁𝙼.𝙸𝙽𝚅(𝚁𝙰𝙽𝙳(),𝟺𝟸,𝟽) and answer the following.

    1. (a)

      What is the sample proportion of values that lie within (21,63)?

    2. (b)

      What is the sample proportion of values that lie within (28,56)?

    3. (c)

      What is the sample proportion of values that lie within (35,49)?

    4. (d)

      Do you suspect your answers will be different than your another student in the class? Why or why not?

  3. 3.

    Repeat the previous exercise but with 8000 data values. Compare your results with that in the previous exercise.

  4. 4.

    Use the command

    𝙲𝙷𝙾𝙾𝚂𝙴(𝙼𝙰𝚃𝙲𝙷(𝚁𝙰𝙽𝙳(),{𝟶,0.44,0.6,0.71},𝟷),``𝙾𝚁𝙰𝙽𝙶𝙴",``𝙱𝙻𝚄𝙴",``𝙶𝚁𝙴𝙴𝙽",``𝚈𝙴𝙻𝙻𝙾𝚆")

    to generate a sample of 900 different favorite colors taken over the years. Calculate the sample proportion of Blue showing up. Can you estimate the population proportion of Blue showing up?