How to perform an ANOVA and post-hoc tests using R
An in-depth tutorial on ANOVA analysis in R: assumptions, tests, base R functions, post-hoc tests, and rstatix package.
This week I made a tutorial for how to perform an ANOVA analysis using R for the Psych Rstats Club, an open science peer-based learning platform for statistics and R. Here’s the script, enjoy!
ANOVA
ANOVA, which stands for “analysis of variance,” is an extension of the independent two-sample t-test. We use ANOVA when we want to determine if the means of more than two groups are significantly different. In this case, we have several groups defined by a single grouping variable, which is sometimes referred to as the factor variable.
For this demonstration, we will be using the iris
dataset. As always, a good first step is to inspect the data:
# inspect data
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Remember we can run a line of code by click on “Run” when you’re in that line, or you can use the keyboard shortcut control+enter or command+enter in mac.
In this dataset, we have the length and the width of the sepal and the petals of the flowers of each species. In this excercise, we would like to know if the different species differ by the width of their sepals.
Before we dive into the statistical aspects of ANOVA, let’s briefly discuss its assumptions. ANOVA assumes that the data is normally distributed, the variances are unknown but equal between groups, and the samples are independent.
To check the assumption of normality, we can use the Shapiro-Wilk test. We have to run the shapiro.test()
function for the dependent variable (the sepal width) of each species, so we use a specific syntax to subset the dataset.
# shapiro test (normality assumption)
shapiro.test(iris$Sepal.Width[iris$Species == "setosa"])
##
## Shapiro-Wilk normality test
##
## data: iris$Sepal.Width[iris$Species == "setosa"]
## W = 0.97172, p-value = 0.2715
# first we retrieve the values of the sepal width and then we subset those sepal width values for the setosa species
# then we repeat it for each species, copying and pasting it
shapiro.test(iris$Sepal.Width[iris$Species == "virginica"])
##
## Shapiro-Wilk normality test
##
## data: iris$Sepal.Width[iris$Species == "virginica"]
## W = 0.96739, p-value = 0.1809
shapiro.test(iris$Sepal.Width[iris$Species == "versicolor"])
##
## Shapiro-Wilk normality test
##
## data: iris$Sepal.Width[iris$Species == "versicolor"]
## W = 0.97413, p-value = 0.338
Based on the results of these tests, none of them yielded statistically significant results, indicating that the data is normally distributed.
To assess the assumption of homogeneity of variances, we can use the Levene’s test, from the car
package. The first argument is the formula: remember that we always use first the dependent variable, then the tilde and last the grouping variable. The second argument is the dataset.
#Levene's Test (homoscedasticity assumption)
car::leveneTest(Sepal.Width ~ Species, data = iris)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.5902 0.5555
## 147
The Levene’s test assesses whether the variance of the Sepal Width is equal across different species groups. In this case, since the p-value is not statistically significant, we can conclude that the assumption of homogeneity of variances is met.
Now that we have checked the assumptions, we can proceed with the ANOVA analysis. However, it is important to note that if your data is not normally distributed, you may consider using non-parametric tests such as the Kruskal-Wallis or Friedman tests. Additionally, if the assumption of homogeneity of variances is violated, a Welch ANOVA can be used.
Let’s perform the ANOVA analysis using base R functions:
# anova with base R
iris_anova <- aov(Sepal.Width ~ Species, data = iris)
# the function is called aov (analysisi of variance) and the formula is the same we used for the levene's test
# to see the result
summary(iris_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.35 5.672 49.16 <2e-16 ***
## Residuals 147 16.96 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary of the ANOVA output shows that the ANOVA analysis is statistically significant. However, it doesn’t provide information about which specific groups differ from each other. To determine the pairwise differences, we can use post-hoc tests such as the Tukey’s Honestly Significant Difference (HSD) test, also called the Tukey correction:
# Tukey's correction
TukeyHSD(iris_anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Width ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa -0.658 -0.81885528 -0.4971447 0.0000000
## virginica-setosa -0.454 -0.61485528 -0.2931447 0.0000000
## virginica-versicolor 0.204 0.04314472 0.3648553 0.0087802
#tukeyhsd(iris_anova) # error!
Please remember that R is case-sensitive, so it’s important to pay attention to capitalization.
The Tukey’s test reveals that all three groups are statistically different from each other. Alternatively, you can also use other post-hoc tests, such as the Bonferroni correction or Holm’s method, to adjust the p-values for multiple comparisons. T
# Bonferroni correction
# we use a function that calculates a t test, and we specify we want a bonferroni correction
# the first argument is the dependent variable and the second one is the grouping or factor variable.
pairwise.t.test(iris$Sepal.Width, iris$Species, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris$Sepal.Width and iris$Species
##
## setosa versicolor
## versicolor < 2e-16 -
## virginica 1.4e-09 0.0094
##
## P value adjustment method: bonferroni
You can also use Holm’s method, with the same syntax:
# Holm correction
pairwise.t.test(iris$Sepal.Width, iris$Species, p.adjust.method = "holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris$Sepal.Width and iris$Species
##
## setosa versicolor
## versicolor < 2e-16 -
## virginica 9.1e-10 0.0031
##
## P value adjustment method: holm
Although the p-values may slightly differ between these corrections, the overall conclusion remains the same.
The rstatix
package
Now, let’s explore a package called rstatix
, which simplifies the ANOVA analysis and is also compatible with tidyverse functions:
# to install the package
install.packages("rstatix")
# to load the package
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
With rstatix, we can perform the ANOVA analysis:
# anova with rstatix
anova_test(formula = Sepal.Width ~ Species, data = iris)
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Species 2 147 49.16 4.49e-17 * 0.401
The summary results provided by the anova_test
function include not only the p-value but also the effect size. By default, the anova_test
function calculates a generalized eta squared, but you can also calculate a partial eta squared if needed:
# partial eta squared
anova_test(data = iris, formula = Sepal.Width ~ Species, effect.size = "pes")
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 pes
## 1 Species 2 147 49.16 4.49e-17 * 0.401
# in this case it's the same
rstatix also provides its own function to perform the Tukey correction:
# Tukey's correction with rstatix
tukey_hsd(x = iris, formula = Sepal.Width ~ Species)
## # A tibble: 3 × 9
## term group1 group2 null.value estimate conf.low conf.high p.adj
## * <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Species setosa versicolor 0 -0.658 -0.819 -0.497 3.10e-14
## 2 Species setosa virginica 0 -0.454 -0.615 -0.293 1.36e- 9
## 3 Species versicolor virginica 0 0.204 0.0431 0.365 8.78e- 3
## # ℹ 1 more variable: p.adj.signif <chr>
Overall, these are some approaches you can use to conduct ANOVA analysis in R. I hope this explanation helps clarify the process for you. If you have any questions or need further clarification, please let me know!
As always, remember you can suscribe to my blog to stay updated, and if you have any questions, don’t hesitate to contact me. And if you like what I do, you can buy me a cafecito from Argentina or a kofi.