Chapter 4 ANOVA

This chapter uses the education data to demonstrate 1-way and 2-way between subjects analysis of variance (ANOVA). Then the chapter demonstrates how to use conduct a Tukey post-hoc test to determine specific mean differences between groups.

4.1 Read in the Data

We will use the get_url() and read.csv() functions to read the education data from github into R.

library(RCurl)
## Loading required package: bitops
data_url = getURL('https://raw.githubusercontent.com/TakingStatsByTheHelm/Book/1.-Data-Sets/edu_data.csv')
edu_data = read.csv(text = data_url)

4.2 One-Way ANOVA

Let’s run an analysis of variance. Let’s see if there is any difference in the means for academic performance for parents’ level of college education. In this case, variable some_col 0 = 0%-10% of parents had some college education, 1 = 11%-20% of parents had some college education, 2 = >20% of parents had some college education.

# use the aov() function to fit the analysis of variance
model = aov(api00 ~ as.factor(some_col), data = edu_data)

# use the summary function to get the ANOVA summary table back.
summary(model)
##                      Df  Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(some_col)   2  492909  246455   22.33 8.63e-10 ***
## Residuals           312 3443393   11037                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary() function gives back the degrees of freedom, the sums squares between (in this case, labelled ‘as.factor(some_col)’, and the Sums of squares within (labeled ‘Residuals’). We also see the mean squares, the F-value, and the corresponding p-value.

Based on our results from the ANOVA, we can see that at least one of the academic performance means differs across the three levels of parents’ college education.

4.3 Two-Way ANOVA

Let’s run an analysis of variance with two factors and an interaction. Let’s see if there is any difference in the means for academic performance for year round schools, parents’ level of college education, and their interaction. In this case, variable ‘some_col’ 0 = 0%-10% of parents had some college education, 1 = 11%-20% of parents had some college education, 2 = >20% of parents had some college education; and variable ‘some_col’ equals 0 for schools that do not follow year-round schedules, and 1 for school that do.

# use the aov() function to fit the analysis of variance
model = aov(api00 ~ as.factor(some_col) + as.factor(yr_rnd) + as.factor(some_col):as.factor(yr_rnd), data = edu_data)

# use the summary function to get the ANOVA summary table back.
summary(model)
##                                        Df  Sum Sq Mean Sq F value   Pr(>F)
## as.factor(some_col)                     2  492909  246455  27.328 1.18e-11
## as.factor(yr_rnd)                       1  553938  553938  61.422 7.50e-14
## as.factor(some_col):as.factor(yr_rnd)   2  102722   51361   5.695  0.00373
## Residuals                             309 2786733    9019                 
##                                          
## as.factor(some_col)                   ***
## as.factor(yr_rnd)                     ***
## as.factor(some_col):as.factor(yr_rnd) ** 
## Residuals                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary() function gives back the degrees of freedom, the sums squares for the main effect of ‘some_col’, sums of saures for the main effect of ‘yr_rnd’, and sums ofsquares within (labeled ‘Residuals’). We also see the degrees of freedom, the mean squares, the sample F-values, and the corresponding p-value.

Based on our results from the ANOVA, we can see that there are significant main effects for the percentage of parents that received some college education, for year-round schooling, and their interaction. Accordingly, average academic performance differs across the groups of parents’ college education.

Now, we would like to know more about the significant interaction. We perform post-hoc tests to determine which of the specific means are different from one another.

4.4 Tukey’s Post-Hoc Test

# use the aov() function to fit the analysis of variance
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = api00 ~ as.factor(some_col) + as.factor(yr_rnd) + as.factor(some_col):as.factor(yr_rnd), data = edu_data)
## 
## $`as.factor(some_col)`
##         diff       lwr       upr     p adj
## 1-0  8.96207 -25.75770  43.68184 0.8159125
## 2-0 84.94696  51.68407 118.20984 0.0000000
## 2-1 75.98489  47.37104 104.59874 0.0000000
## 
## $`as.factor(yr_rnd)`
##         diff       lwr       upr p adj
## 1-0 -90.3276 -113.6334 -67.02182     0
## 
## $`as.factor(some_col):as.factor(yr_rnd)`
##                diff          lwr        upr     p adj
## 1:0-0:0  -14.087179  -66.8994247   38.72507 0.9731331
## 2:0-0:0   41.136232   -6.7494899   89.02195 0.1383624
## 0:1-0:0 -167.069697 -237.9166256  -96.22277 0.0000000
## 1:1-0:0  -92.887879 -150.6261308  -35.14963 0.0000846
## 2:1-0:0  -22.850000  -91.6857250   45.98573 0.9324299
## 2:0-1:0   55.223411   12.9631793   97.48364 0.0029017
## 0:1-1:0 -152.982517 -220.1552215  -85.80981 0.0000000
## 1:1-1:0  -78.800699 -131.9663895  -25.63501 0.0004038
## 2:1-1:0   -8.762821  -73.8108175   56.28518 0.9988786
## 0:1-2:0 -208.205929 -271.5784882 -144.83337 0.0000000
## 1:1-2:0 -134.024111 -182.2993602  -85.74886 0.0000000
## 2:1-2:0  -63.986232 -125.1021137   -2.87035 0.0341273
## 1:1-0:1   74.181818    3.0710267  145.29261 0.0352191
## 2:1-0:1  144.219697   63.8368881  224.60251 0.0000070
## 2:1-1:1   70.037879    0.9306114  139.14515 0.0449087

The first two components of the output contain the tests for the marginal effects. More specifically, we see the three differences for the marginal means of ‘some_col’, and then we see the differences for the marginal means of ‘yr_rnd’.

More specifically, we see that schools with 0%-10% of parents had some college education do not significantly differ from school with 11%-20% of parents had some college education (1 vs. 0 in the comparison). However, we see that schools with >20% of parents had some college education differed from both schools with 0%-10% and schools with 11%-20% of parents with college education (2 vs. 0 comparison, and 2 versus 1 comparison).

Moreover, we see a main effect of year round school (1 versus 2 in the second comparison).

Finally, within the third subsection of the post-hos output, we see differences for each of the cell means. In particular the row labled ‘1:0-0:0’ examines the difference between schools with 0%-10% of parents with college education and non year-round calendars (e.g. the 0:0 part), and schools with 11%-20% of parents with college education and non year-round calendars (e.g. the 1:0 part). We do not see a significant different for these two cell means (p > .05). The remaining columns describe more differences between the cell means.

4.5 Summary