A comparison of formula and tidyverse syntaxes
Amelia McNamara @AmeliaMN
This talk is based on a paper I wrote, which is available as a pre-print via arXiv, https://arxiv.org/abs/2201.12960
Penguin data and penguin art by Allison Horst.
base
formula
tidyverse
base
formula
tidyverse
base
par(mfrow = c(1, 3))
plot(penguins$flipper_length_mm[penguins$species == "Adelie"],
penguins$bill_length_mm[penguins$species == "Adelie"])
plot(penguins$flipper_length_mm[penguins$species == "Chinstrap"],
penguins$bill_length_mm[penguins$species == "Chinstrap"])
plot(penguins$flipper_length_mm[penguins$species == "Gentoo"],
penguins$bill_length_mm[penguins$species == "Gentoo"])
# apply solution?
formula
tidyverse
base
formula/tidyverse
There is debate in the statistics education community about which syntax is best to teach, particularly for the “first course.” All three syntaxes have their proponents, but the folks most interested in pedagogy tend to argue about whether formula or tidyverse is best.
mosaic
package, which Horton coauthored with Randy Pruim and Danny Kaplan.tidyverse
syntax and one to use formula syntaxformula | tidyverse | |
---|---|---|
No | 10 | 9 |
Yes, but not with R | 2 | 4 |
I was able to gather lots of data:
The getParseData()
function from utils
allows you to parse R code and find all sorts of things about the parse tree. I just filtered for functions.
Like “Enough R for Intro Stats” by Randy Pruim.
The formula section saw a total of 37 functions and the tidyverse section saw 50, with an overlap of 18 functions between the two sections.
Neither of these numbers are very large!
The functions both sections of students saw included helper functions like library()
, set.seed()
, and set()
(a function in the knitr options included in the top of each RMark- down document), statistics like mean()
, sd()
, and cor()
, and modeling-related functions like aov()
, lm()
, summary()
and predict()
.
Dealing with missing data.
Solutions:
Explaining things with two categorical variables (two-way tables, inference)
2-sample test for equality of proportions with continuity correction
data: tally(island ~ sex)
X-squared = 2.8876e-30, df = 1, p-value = 1
alternative hypothesis: two.sided
95 percent confidence interval:
-0.1248439 0.1147680
sample estimates:
prop 1 prop 2
0.5673759 0.5724138
Not much difference between sections
Fit a linear mixed-effects, using month as a categorical variable. lme4
package doesn’t provide p-values, but we can look at confidence intervals.
2.5% | 97.5% | |
---|---|---|
.sig01 | 3.3 | 6.1 |
.sigma | 4.5 | 5.9 |
(Intercept) | 8.4 | 14.4 |
sectiontidyverse | -6.2 | 2.2 |
monthOctober | 1.2 | 7.5 |
monthNovember | -4.9 | 1.5 |
monthDecember | -5.5 | 0.9 |
sectiontidyverse:monthOctober | 0.4 | 9.4 |
sectiontidyverse:monthDecember | 0.7 | 9.7 |
It depends!
I used formula this past semester when I taught the R labs again. Most students won’t go on to another stat class, and are mostly “users” of R rather than “doers.”
If most of my students were stat majors, I would definitely teach tidyverse
. If it was intro data science, I would teach tidyverse
.
My big suggestion is be consistent.
Constraints breed creativity!
Use getParseData()
to learn how many functions you are using, and streamline.