Chapter 5 Regression

There are three data sets on github: ‘edu_data.csv’, ‘GRE_data.csv’, and ‘salary_data.csv’. We will give all examples using the ‘edu_data.csv’ file, and you can choose to use either the ‘GRE_data.csv’ or ‘salary_data.csv’ to perform your own analyses.

There are descriptions of all three data sets on github. Choose one that seems interesting to work with.

This chapter will cover estimation of the linear regression coefficients, \(R^2\), significance testing of the regression coefficients, significance testing of \(R^2\), and confidence intervals for the linear regression coefficients. We show examples using the ‘edu_data.csv’ data set, and then you will have the opportunity to cover the same material with your own data set. As always, the first step is to read your data into R.

5.1 Read in the Data

We will use the get_url() and read.csv() functions to read the education data from github into R.

library(RCurl)

## Loading required package: bitops

data_url = getURL('https://raw.githubusercontent.com/TakingStatsByTheHelm/Book/1.-Data-Sets/edu_data.csv')
edu_data = read.csv(text = data_url)

5.2 Simple Linear Regression

Let’s run a simple linear regression! The lm() function will allow us to run intercept-only, simple, and multiple linear regression (remember that multiple linear regression is when we have more that one predictor, and that we’ll talk more about it next week).

Let’s start with an intercept-only model. We’ll use variable ‘api00’ as the outcome variable, which is jsut academic performance for each school during the year 2000.

\[ api00_i = b_0 + \epsilon_i \]

### Create the object model.01 to store the output. Don't forget to use 'na.action = na.exclude' to remove missing data!

model.01 = lm(api00 ~ 1, data = edu_data, na.action = na.exclude)

### Then, we use the summary() function to get the output from the regression model in a clear way

summary(model.01)

## 
## Call:
## lm(formula = api00 ~ 1, data = edu_data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -226.66  -91.16   -7.66   97.34  321.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  595.660      6.308   94.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 112 on 314 degrees of freedom

Note that the intecept is estimated as 595.660, has a standard error (i.e., \(s_{b_0}\)) of 6.308, a t-value of 94.42 \(\left(\frac{595.660}{6.308}\right)\), and corresponding p-value of 0 (i.e., 2 to the negative \(16^{th}\) equals 0 in R).

Next, we’ll use the variable ‘meals’ to predict ‘api00’. You’ll just use the lm() and summary() functions again! Don’t forget the ‘na.action = na.exclude’ to remove missing data.

\[ api00_i = b_0 + b_1 \cdot meals_i + \epsilon_i \]

### Create the object model.02 to store the output to

model.02 = lm(api00 ~ 1 + meals, data = edu_data, na.action = na.exclude)

### Then, we use the summary() function to get the output from the regression model in a clear way

summary(model.02)

## 
## Call:
## lm(formula = api00 ~ 1 + meals, data = edu_data, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -198.351  -46.467   -6.925   47.176  192.169 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 866.4821    11.3087   76.62   <2e-16 ***
## meals        -3.7617     0.1488  -25.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64.3 on 313 degrees of freedom
## Multiple R-squared:  0.6713, Adjusted R-squared:  0.6702 
## F-statistic: 639.1 on 1 and 313 DF,  p-value: < 2.2e-16

Notice that we get back estimates for both \(b_0\) and \(b_1\). In this case our \(b_0\) is 866.48, and our \(b_1\) is -3.76. Both are significantly different from 0.

Remember, the variable ‘meals’ indicates the percentage of students within a school that receive free meals.

Therefore, these results suggest that for each 1-percentage increase of students meals within a school, we expect a 3.76 point decrease in the school’s overall academic performance (interpretation of \(b_1\)). Also, if 0 percent of the students obtain free meals, then we expect the school’s academic performance to equal 866.48 (interpretation of \(b_0\)).

If we look toward the bottom of the output, we notice that the estimate of \(R^2\) is .67, which indicates that 67% of the variability in academic performace is accounted for the by the percentage of students that receive free meals.

Also note that the \(R^2\) value is significantly different from 0.

5.3 Confidence Intervals for Regression Coefficients

To calculate confidence intervals for the regression coefficients, just use the confint() function

confint(model.02)

##                  2.5 %     97.5 %
## (Intercept) 844.231409 888.732875
## meals        -4.054522  -3.468969

By default, the confint() function will calculate 95% confidence intervals.

Note that the 95% confidence interval for \(b_0\) is \([844.23, 888.73]\), and the 95% confidence interval for \(b_1\) is \([-4.05, -3.47]\).

Therefore, if we were able to repeat this experiment (i.e. collect academic performance and percentage of meals on 315 randomly sample schools, over and over again), we would expect 95% of the \(b_0\) to be within \([844.32, 888.73]\), and 95% of the \(b_1\) to be within \([-4.05, -3.47]\).

5.4 Confidence Intervals for Predicted values

To calculate confidence intervals for predicted values, use the predict() function. We need to tell the predict function the model we are using to predict from (in this case ‘model.02’), the value for \(X^*\) (in this case 10).

#Follow this same order in the function. Put the model first, then the 'newdata = ' (but be sure to change the variable name to the one in your model, not meals!), and use "interval = 'confidence'"
predict(model.02, newdata = data.frame(meals = 10), interval = 'confidence')

##        fit      lwr      upr
## 1 828.8647 809.3648 848.3645

Therefore, when a school has 10% of students receiving free meals, we expect that school to have an academic performace of 828.86, and (if we were to repeat this experment over and over) 95% of the time the expected value will be within \([809.36, 848.36]\).

5.5 Mean Centering

When we perform multiple regression, it is best to mean-center the predictors so that the intercept has more meaning. Let’s begin by mean centering the % of meals and enrollment.

edu_data$meals_c = edu_data$meals - mean(edu_data$meals, na.rm = TRUE)

edu_data$enroll_c = edu_data$enroll - mean(edu_data$enroll, na.rm = TRUE)

Now that we have mean-centered the predictors that we are interested in, we can use them as predictors within multiple regression!

5.6 Multiple Regression

Let’s run a multiple regression! The lm() function will allow us to run multiple linear regression (remember last week we used lm() to calculate intercept-only and simple linear regression).

We’ll use variable ‘api00’ as the outcome variable, which is jsut academic performance for each school during the year 2000, and we’ll predict it using the centered enrollemtn and the centered % of meals.

To keep the notation consistent with class, let’s define

\(X_{1i} = enroll_i\) and \(X_{2i} = meals_i\)

Then,

\(x_{1i} = X_{1i} - \bar{X}_{1}\)

\(x_{2i} = X_{2i} - \bar{X}_{2}\)

\[ api00_i = b_0 + b_1\cdot x_{1i} + b_2\cdot x_{2i} + \epsilon_i \]

### Create the object model.01 to store the output. Don't forget to use 'na.action = na.exclude' to remove missing data!

model.01 = lm(api00 ~ 1 + enroll_c + meals_c , data = edu_data, na.action = na.exclude)

### Then, we use the summary() function to get the output from the regression model in a clear way

summary(model.01)

## 
## Call:
## lm(formula = api00 ~ 1 + enroll_c + meals_c, data = edu_data, 
##     na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -205.230  -42.477   -8.259   40.875  164.567 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 595.66032    3.49147  170.60  < 2e-16 ***
## enroll_c     -0.07911    0.01582   -5.00  9.6e-07 ***
## meals_c      -3.49017    0.15335  -22.76  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.97 on 312 degrees of freedom
## Multiple R-squared:  0.6956, Adjusted R-squared:  0.6937 
## F-statistic: 356.5 on 2 and 312 DF,  p-value: < 2.2e-16

We also get estimates of the intercept and slopes, their standard errors, t-values, and corresponding p-values.

The intercept \(b_0\) indicates the expected academic performance for a school with average enrollment and average % of free meals.

The slope \(b_1\) indicates that for each student enrolled at the school, we expected a 0.08 decrease in academic performance. The slope \(b_2\) indicates for each % of students receiving free meals, we expect a 3.49 decrease in academic performance. We also note that both predictors are significantly different from 0.

5.7 Interaction

Let’s run a multiple regression with an interaction term! The lm() function will allow us to run multiple linear regression with interactions.

We’ll use variable ‘api00’ as the outcome variable, which is jsut academic performance for each school during the year 2000, and we’ll predict it using the centered enrollment, centered % of meals, and the interaction.

To keep the notation consistent with class, let’s define

\(X_{1i} = enroll_i\) and \(X_{2i} = meals_i\)

Then,

\(x_{1i} = X_{1i} - \bar{X}_{1}\)

\(x_{2i} = X_{2i} - \bar{X}_{2}\)

So we have the model

\[ api00_i = b_0 + b_1\cdot x_{1i} + b_2\cdot x_{2i} + b_3\cdot x_{1i} \cdot x_{2i} + \epsilon_i \]

### Create the object model.01 to store the output. Don't forget to use 'na.action = na.exclude' to remove missing  data!

# Form the interaction using the I() function
model.02 = lm(api00 ~ 1 + enroll_c + meals_c + I(enroll_c*meals_c), data = edu_data, na.action = na.exclude)

### Then, we use the summary() function to get the output from the regression model in a clear way

summary(model.02)

## 
## Call:
## lm(formula = api00 ~ 1 + enroll_c + meals_c + I(enroll_c * meals_c), 
##     data = edu_data, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -207.075  -42.555   -5.186   41.436  157.940 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.987e+02  3.810e+00 157.154  < 2e-16 ***
## enroll_c              -6.170e-02  1.806e-02  -3.417 0.000717 ***
## meals_c               -3.573e+00  1.583e-01 -22.573  < 2e-16 ***
## I(enroll_c * meals_c) -1.513e-03  7.671e-04  -1.972 0.049490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.68 on 311 degrees of freedom
## Multiple R-squared:  0.6994, Adjusted R-squared:  0.6965 
## F-statistic: 241.2 on 3 and 311 DF,  p-value: < 2.2e-16

We can choose either enrollment or meals as the moderator variable. Let’s choose meals, then we rewrite the equation

\[ api00_i = b_0 + b_1\cdot x_{1i} + b_2\cdot x_{2i} + b_3\cdot x_{1i} \cdot x_{2i} + \epsilon_i \]

\[ api00_i = (b_0 + b_2\cdot x_{2i}) + (b_1 + b_3 \cdot x_{2i})\cdot x_{1i} + \epsilon_i \]

Then we have

\[ a_0 = b_0 + b_2\cdot x_{2i} \] \[ a_1 = b_1 + b_3 \cdot x_{2i} \]

and

\[ api00_i = a_0 + a_1\cdot x_{1i} + \epsilon_i \]

Filling in these equations based on the output gives us

\[ a_0 = 598.7 - 3.57\cdot x_{2i} \] \[ a_1 = -.06 - .002 \cdot x_{2i} \]

and

\[ api00_i = a_0 + a_1\cdot x_{1i} + \epsilon_i \]

Therefore, we have a relation between \(x_{1i}\) and \(api00_i\), and that relation changes as a function of \(x_{2i}\)

5.8 Matrix Notation and Regression

Let’s begin by noticing that \(y_i\) can be written as a column vector. More specifically, if we have \(N\) individuals within our sample, then \[ y_i = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] \]

Ok, then we can write the intercept-only regression model as

\[ y_i = b_0 + \epsilon_i\] or we can write it in matrix notation as

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right] [b_0] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

Now, let’s extend this equation to include a single predictor. In particular, if we have

\[ y_i = b_0 + b_1(x_{1i}) + \epsilon_i \]

then we can rewrite equation in matrix notation as

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{cc} 1, x_{11}\\ 1, x_{12}\\ \vdots \\ 1, x_{1N}\end{array} \right] \left[ \begin{array}{c}b_0\\ b_1\end{array}\right] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

Finally, extending the equation that contains \(q\) predictor variables, we have \[ y_i = b_0 + b_1(x_{1i}) + b_2(x_{2i}) + ... + b_q(x_{qi}) + \epsilon_i \]

which is written in matrix notation as

\[ \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] = \left[ \begin{array}{cccc} 1, x_{11}, \ldots, x_{q1}\\ 1, x_{12}, \ldots, x_{q2}\\ \vdots \\ 1, x_{1N}, \ldots, x_{qN}\end{array} \right] \left[ \begin{array}{c}b_0\\ b_1\\ \vdots\\ b_q\end{array}\right] + \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right]\]

When we implement this model within R, it is important to know that we call on \(y_i\) and \({x_{1}, ..., x_{q}}\) as variables within a data frame. We’ll emphasize this within the examples to come. Ok, now let’s simplify the notation a bit more. Let’s define

\[ \boldsymbol{Y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] ;\boldsymbol{X} = \left[ \begin{array}{cccc} 1, x_{11}, \ldots, x_{q1}\\ 1, x_{12}, \ldots, x_{q2}\\ \vdots \\ 1, x_{1N}, \ldots, x_{qN}\end{array} \right]; \boldsymbol{b} = \left[ \begin{array}{c}b_0\\ b_1\\ \vdots\\ b_q\end{array}\right]; \boldsymbol{\epsilon} = \left[ \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{array} \right] \]

Then, the regression model may be compactly written as

\[ \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{b} + \boldsymbol{\epsilon} \]

Finally, we would like to solve for \(\boldsymbol{b}\). Those are the focal parameter estimates. First begin by premultiplying both sides of the equation by \(\boldsymbol{X'}\)

\[ \boldsymbol{X'}\boldsymbol{Y} = \boldsymbol{X'}\boldsymbol{X}\boldsymbol{b} \]

Then premultiply both sides by \(\left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\)

\[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\boldsymbol{X'}\boldsymbol{Y} = \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\boldsymbol{X'}\boldsymbol{X}\boldsymbol{b} \]

which simpifies to \[ \left(\boldsymbol{X'}\boldsymbol{X}\right)^{-1}\boldsymbol{X'}\boldsymbol{Y} = \boldsymbol{b} \]

Therefore, we can easily calculate the estimates of the parameters using matrix algebra! And this form of matrix manipulation is, quite literally, what the computer program is performing to estimate a the parameters within a multiple regression.

Taking Stats by the Helm