--- title: "The tidycat package: expand broom::tidy() output for categorical parameter estimates" output: html_document: fig_caption: false toc: true toc_float: collapsed: false smooth_scroll: true toc_depth: 2 vignette: > %\VignetteIndexEntry{The tidycat package: expand broom::tidy() outputs for categorical parameter estimates} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction The `tidycat` package includes the `tidy_categorical()` function to expand `broom::tidy()` outputs for categorical parameter estimates. ## Installation You can install the released version of tidycat from [CRAN](https://CRAN.R-project.org) with: ``` r install.packages("tidycat") ``` And the development version from [GitHub](https://github.com/) with: ``` r # install.packages("devtools") devtools::install_github("guyabel/tidycat") ``` ## Additional columns for categorical parameter estimates The `tidy()` function in the broom package takes the messy output of built-in functions in R, such as `lm()`, and turns them into tidy data frames. ```{r, message=FALSE, warning=FALSE} library(dplyr) library(broom) m0 <- esoph %>% mutate_if(is.factor, ~factor(., ordered = FALSE)) %>% glm(cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, data = ., family = binomial()) # tidy tidy(m0) ``` Note: Currently ordered factor not supported in `tidycat`, hence their removal in `mutate_if()` above The `tidy_categorical()` function adds further columns (`variable`, `level` and `effect`) to the `broom::tidy()` output to help manage categorical variables ```{r} library(tidycat) m0 %>% tidy() %>% tidy_categorical(m = m0, include_reference = FALSE) ``` ## Additional rows for reference categories Include additional rows for reference category terms and a column to indicate their location by setting `include_reference = TRUE` (default). Setting `exponentiate = TRUE` ensures the parameter estimates in the reference group are set to one instead of zero (even odds in the logistic regression example below). ```{r} m0 %>% tidy(exponentiate = TRUE) %>% tidy_categorical(m = m0, exponentiate = TRUE, reference_label = "Baseline") %>% select(-statistic, -p.value) ``` ## Standard coefficient plots The results from `broom::tidy()` can be used to quickly plot estimated coefficients and their confidence intervals. ```{r, fig.width=6, fig.height=4} # store parameter estimates and confidence intervals (except for the intercept) d0 <- m0 %>% tidy(conf.int = TRUE) %>% slice(-1) d0 library(ggplot2) library(tidyr) ggplot(data = d0, mapping = aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange() ``` ## Enhanced coefficient plots The additional columns from `tidy_categorical()` can be used to group together terms from the same categorical variable by setting `colour = variable` ```{r, fig.width=6, fig.height=4} d0 <- m0 %>% tidy(conf.int = TRUE) %>% tidy_categorical(m = m0, include_reference = FALSE) %>% slice(-1) d0 %>% select(-(3:5)) ggplot(data = d0, mapping = aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, colour = variable)) + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange() ``` The additional rows from `tidy_categorical()` can be used to include the reference categories in a coefficient plot, allowing the reader to better grasp the meaning of the parameter estimates in each categorical variable. Using `ggforce::facet_col()` the terms of each variable can be separated to further improve the presentation of the coefficient plot. ```{r, fig.width=6, fig.height=4} d0 <- m0 %>% tidy(conf.int = TRUE) %>% tidy_categorical(m = m0) %>% slice(-1) d0 %>% select(-(3:5)) library(ggforce) ggplot(data = d0, mapping = aes(x = level, y = estimate, colour = reference, ymin = conf.low, ymax = conf.high)) + facet_col(facets = vars(variable), scales = "free_y", space = "free") + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange() ``` Note the switch of the `x` aesthetic to the `level` column rather than `term`. Alternatively, horizontal plots can be obtained using `ggforce::facet_row()` and loosing `coord_flip()`; ```{r, fig.width=6, fig.height=4} ggplot(data = d0, mapping = aes(x = level, y = estimate, ymin = conf.low, ymax = conf.high, colour = reference)) + facet_row(facets = vars(variable), scales = "free_x", space = "free") + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` ## Interactions Models with interactions can also be handled in `tidy_categorical()`. Using the `mtcars` data we can create three types of interactions (between two numeric variables, between a numeric variable and categorical variable and between two categorical variables) ```{r} m1 <- mtcars %>% mutate(engine = recode_factor(vs, `0` = "straight", `1` = "V-shaped"), transmission = recode_factor(am, `0` = "automatic", `1` = "manual")) %>% lm(mpg ~ as.factor(cyl) + wt * hp + wt * transmission + engine * transmission , data = .) tidy(m1) ``` Setting `n_level = TRUE` creates an additional column to monitor the number of observations in each of level of the categorical variables, including interaction terms in the model: ```{r} d1 <- m1 %>% tidy(conf.int = TRUE) %>% tidy_categorical(m = m1, n_level = TRUE) %>% slice(-1) d1 %>% select(-(2:7)) ``` We can use similar plotting code as above to plot the interactions: ```{r, fig.width=6, fig.height=6} ggplot(data = d1, mapping = aes(x = level, y = estimate, colour = reference, ymin = conf.low, ymax = conf.high)) + facet_col(facets = "variable", scales = "free_y", space = "free") + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange() ``` The empty levels can be dropped by filtering on the `n_level` column for categories with more than zero observations and not `NA` in term column. ```{r, fig.width=6, fig.height=5} d1 %>% dplyr::filter(n_level > 0 | !is.na(term)) %>% ggplot(mapping = aes(x = level, y = estimate, colour = reference, ymin = conf.low, ymax = conf.high)) + facet_col(facets = "variable", scales = "free_y", space = "free") + coord_flip() + geom_hline(yintercept = 0, linetype = "dashed") + geom_pointrange() ``` ## Issues If you have any trouble or suggestions please let me know by creating an issue on the [tidycat Github](https://github.com/guyabel/tidycat/issues)