Chapter 4 ANOVA
This chapter uses the education data to demonstrate 1-way and 2-way between subjects analysis of variance (ANOVA). Then the chapter demonstrates how to use conduct a Tukey post-hoc test to determine specific mean differences between groups.
4.1 Read in the Data
We will use the get_url() and read.csv() functions to read the education data from github into R.
library(RCurl)
## Loading required package: bitops
data_url = getURL('https://raw.githubusercontent.com/TakingStatsByTheHelm/Book/1.-Data-Sets/edu_data.csv')
edu_data = read.csv(text = data_url)
4.2 One-Way ANOVA
Let’s run an analysis of variance. Let’s see if there is any difference in the means for academic performance for parents’ level of college education. In this case, variable some_col 0 = 0%-10% of parents had some college education, 1 = 11%-20% of parents had some college education, 2 = >20% of parents had some college education.
# use the aov() function to fit the analysis of variance
model = aov(api00 ~ as.factor(some_col), data = edu_data)
# use the summary function to get the ANOVA summary table back.
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(some_col) 2 492909 246455 22.33 8.63e-10 ***
## Residuals 312 3443393 11037
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary() function gives back the degrees of freedom, the sums squares between (in this case, labelled ‘as.factor(some_col)’, and the Sums of squares within (labeled ‘Residuals’). We also see the mean squares, the F-value, and the corresponding p-value.
Based on our results from the ANOVA, we can see that at least one of the academic performance means differs across the three levels of parents’ college education.
4.3 Two-Way ANOVA
Let’s run an analysis of variance with two factors and an interaction. Let’s see if there is any difference in the means for academic performance for year round schools, parents’ level of college education, and their interaction. In this case, variable ‘some_col’ 0 = 0%-10% of parents had some college education, 1 = 11%-20% of parents had some college education, 2 = >20% of parents had some college education; and variable ‘some_col’ equals 0 for schools that do not follow year-round schedules, and 1 for school that do.
# use the aov() function to fit the analysis of variance
model = aov(api00 ~ as.factor(some_col) + as.factor(yr_rnd) + as.factor(some_col):as.factor(yr_rnd), data = edu_data)
# use the summary function to get the ANOVA summary table back.
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(some_col) 2 492909 246455 27.328 1.18e-11
## as.factor(yr_rnd) 1 553938 553938 61.422 7.50e-14
## as.factor(some_col):as.factor(yr_rnd) 2 102722 51361 5.695 0.00373
## Residuals 309 2786733 9019
##
## as.factor(some_col) ***
## as.factor(yr_rnd) ***
## as.factor(some_col):as.factor(yr_rnd) **
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary() function gives back the degrees of freedom, the sums squares for the main effect of ‘some_col’, sums of saures for the main effect of ‘yr_rnd’, and sums ofsquares within (labeled ‘Residuals’). We also see the degrees of freedom, the mean squares, the sample F-values, and the corresponding p-value.
Based on our results from the ANOVA, we can see that there are significant main effects for the percentage of parents that received some college education, for year-round schooling, and their interaction. Accordingly, average academic performance differs across the groups of parents’ college education.
Now, we would like to know more about the significant interaction. We perform post-hoc tests to determine which of the specific means are different from one another.
4.4 Tukey’s Post-Hoc Test
# use the aov() function to fit the analysis of variance
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = api00 ~ as.factor(some_col) + as.factor(yr_rnd) + as.factor(some_col):as.factor(yr_rnd), data = edu_data)
##
## $`as.factor(some_col)`
## diff lwr upr p adj
## 1-0 8.96207 -25.75770 43.68184 0.8159125
## 2-0 84.94696 51.68407 118.20984 0.0000000
## 2-1 75.98489 47.37104 104.59874 0.0000000
##
## $`as.factor(yr_rnd)`
## diff lwr upr p adj
## 1-0 -90.3276 -113.6334 -67.02182 0
##
## $`as.factor(some_col):as.factor(yr_rnd)`
## diff lwr upr p adj
## 1:0-0:0 -14.087179 -66.8994247 38.72507 0.9731331
## 2:0-0:0 41.136232 -6.7494899 89.02195 0.1383624
## 0:1-0:0 -167.069697 -237.9166256 -96.22277 0.0000000
## 1:1-0:0 -92.887879 -150.6261308 -35.14963 0.0000846
## 2:1-0:0 -22.850000 -91.6857250 45.98573 0.9324299
## 2:0-1:0 55.223411 12.9631793 97.48364 0.0029017
## 0:1-1:0 -152.982517 -220.1552215 -85.80981 0.0000000
## 1:1-1:0 -78.800699 -131.9663895 -25.63501 0.0004038
## 2:1-1:0 -8.762821 -73.8108175 56.28518 0.9988786
## 0:1-2:0 -208.205929 -271.5784882 -144.83337 0.0000000
## 1:1-2:0 -134.024111 -182.2993602 -85.74886 0.0000000
## 2:1-2:0 -63.986232 -125.1021137 -2.87035 0.0341273
## 1:1-0:1 74.181818 3.0710267 145.29261 0.0352191
## 2:1-0:1 144.219697 63.8368881 224.60251 0.0000070
## 2:1-1:1 70.037879 0.9306114 139.14515 0.0449087
The first two components of the output contain the tests for the marginal effects. More specifically, we see the three differences for the marginal means of ‘some_col’, and then we see the differences for the marginal means of ‘yr_rnd’.
More specifically, we see that schools with 0%-10% of parents had some college education do not significantly differ from school with 11%-20% of parents had some college education (1 vs. 0 in the comparison). However, we see that schools with >20% of parents had some college education differed from both schools with 0%-10% and schools with 11%-20% of parents with college education (2 vs. 0 comparison, and 2 versus 1 comparison).
Moreover, we see a main effect of year round school (1 versus 2 in the second comparison).
Finally, within the third subsection of the post-hos output, we see differences for each of the cell means. In particular the row labled ‘1:0-0:0’ examines the difference between schools with 0%-10% of parents with college education and non year-round calendars (e.g. the 0:0 part), and schools with 11%-20% of parents with college education and non year-round calendars (e.g. the 1:0 part). We do not see a significant different for these two cell means (p > .05). The remaining columns describe more differences between the cell means.