| Goals: |
|
Linear regression, or sometimes referred to as the least-squares regression line5353See optional section at the end of the chapter on the derivation of the linear regression formula for clarification as to why it is called the least-squares regression line. aims at finding a linear relationship between the random variables and . That is, we are looking for and so that we can write the expression5454This is the equation of a line. In the previous section, we wrote it as , the usual way it is written in an algebra course. Due to other uses in statistics of the letters and , we avoid writing it in this manner.
| (14.1) |
In order to do just that, we need to assume we have data from a random variable and from a random variable such that is paired with , for . Writing these as pairs, we have
From this data, we can easily calculate the values of and in Eq. (14.1).
The slope, or the value , in Eq. (14.1) is calculated using the formula
| (14.2) |
where is the number of data values, and
Caution!!! Be aware that there is a difference between and .
The -intercept, or the value of , in Eq. (14.1) is calculated using the formula
| (14.3) |
where , , and is the value obtained from the formula for the slope.
The derivation of these formulas comes from finding the least-squared distance between each of the points, and a line of best fit. The line that minimizes the distance from each of the points is the line that gives us the values for and above.5555For a more thorough derivation and reasoning behind where these formulas come from, please see the optional section on the derivation of the linear regression line at the end of the chapter. Please note that this section requires a knowledge of calculus.
Having computed the values for and in our predictive equation, we now have the equation of the line that best fits our data points. The purpose of doing this is so that we can make predictions about the value of given the value of . We need to be careful, though. The predictions we make are only valid for the range of data values sampled from the variable to derive the regression line equation. We can make predictions for values of outside this range, but these predictions may not be valid because we don’t know the behavior of the system outside the range of values that we originally collected. The system may flatten out, or become vertical, or exhibit sinusoidal behavior. We simply don’t know; therefore predictions using values of outside our original range should carefully analyzed.
Let’s practice using the formulas with an example.
In Excel, create a scatterplot and find the equation of the regression line given the following data.
| 2 | 4 | 6 | 8 | 10 | |
| 7 | 10 | 12 | 16 | 21 |
Solution. Before we calculate the linear regression equation, it is ideal to first analyze the data by creating a scatterplot to see if the data looks somewhat linear. As we will learn later, there are statistical measurement tools we can use to determine how good a fit the regression line is to the data.
Begin by entering the data in the table above in columns and as shown in Figure 14.4 below.
Highlight just the data in the table and select first option within the scatterplot button
located under the Insert tab on the ribbon. After doing so and adding a titles we arrive at the scatterplot of the data as seen in Figure 14.5 below.
It can be seen that the points fall nearly on a straight line, but not quite. It seems appropriate that a line would fit the data. Note that the equation that we derive from these data will give somewhat accurate predictions for given , but it will not be perfect.
In order to calculate the expression for , it is useful calculate what , , and are for each of the data values. Hence, we can expand our table in Excel to include these values, including the sums of each of the columns.
Note that in our formula for , , as seen in cell in Figure 14.6. A similar deduction can be made for the rest of the values in the formula for .
Plugging in the necessary sums from Figure 14.6 into the formula for and noting that , we have
So the slope of our regression line is 1.7. Using this, we have that
Now we have the equation of our regression line:
Once we have the regression line, we can use it to make predictions. From the previous example, we could use the equation to make predictions for given by plugging in values for .
We need to remember, though, that only values that fall within our range of data for are valid to use. Extrapolations beyond our original range of values may not accurately reflect the relationship between and . This is because we don’t know if, perhaps, the random variables and exhibit a nonlinear behavior outside of our range of data.
We can also use Python to find the regression line.
Write a script in Python to find the line of best fit for the same data as in Figure 14.4.
Solution. Open a text editing software, such as notepad in Windows and enter in the data from Figure 14.4, as shown below, as columns with one space between each of the values, row-wise. (We used notepad.)
Save the file as data1.txt to a directory that you can find on your computer.
Open a script in Python and save it as Regression.py. Make sure you save it in the same directory your data file is saved. At the prompt, enter the following.
| from scipy.stats import * | |||
| from numpy import * |
These commands will load the scipy.stats and numpy modules. To import the data from the data1.txt file in Python as variables, enter the following.
| x,y = loadtxt("data1.txt", unpack=True) |
This command, when ran, loads the data from the data1.txt file into two variables: and , column-wise.
The regression function in Python returns the slope of the regression line and the intercept. In addition, it also returns the correlation coefficient and two other values that we are not interested in at the moment. Since the command requires that we account for all five of these values when we run the function on the lists and , we assign appropriate variables to each. For now, we will only focus on the slope and the intercept, and .
| b,a,r,p,s = linregress(x,y) |
Let’s have the slope and intercept printed out. Type the following.
| print "The slope of the line is %f and the intercept is %f." %(b,a) |
Run the script and you should obtain the same results that we have in the previous example.
Answer the following as True or False.
Regression curves can be used for prediction purposes.
The in Eq. 14.1 is the -intercept.
The slope of a line, , in Eq. 14.1 can never be negative.
There is no difference between and .
When creating the regression line, pertains to the number of data values obtained from random variable .
Create a regression line in Excel for each of the following data sets using the tactics discussed in Example 14.2.1. Be sure to write your answer in form when finished.
| 22.3 | 19.4 | 18.1 | 21.5 | 24.3 | 25.0 | 19.6 | |
| 82.1 | 100.4 | 109.2 | 90.3 | 74.9 | 66.5 | 108.3 |
| 12.3 | 13.2 | 14.1 | 17.9 | 15.7 | 18.8 | 19.2 | |
| 6.7 | 7.3 | 8.4 | 10.6 | 9.5 | 12.8 | 14.1 |
| 26 | 23 | 24 | 22 | 21 | 20 | 19 | 17 | 14 | 12 | |
| 36 | 37 | 32 | 39 | 34 | 41 | 42 | 46 | 49 | 52 |
Explain what occurs when attempting to compute the regression line for the following data. Why would this be?
| 1 | 1 | 1 | 1 | 1 | 1 | |
| 3 | 4 | 5 | 6 | 7 | 8 |