# Different ways of calculating rowmeans on selected variables in a tidyverse framework

R
Tidyverse

Given the upcoming US elections, several media outlets have shown interest in my research on the heights of US presidents. The Economist improved my graphs.

Author

Gert Stulp

Published

March 24, 2018

Recently, I have been trying to force myself to do all my coding in a consistent coding framework. Mostly this means that I want to switch completely to `tidyverse`/`dplyr`-style coding. Mostly for reasons of readability and teaching, but it also has a slight OCD-component for myself. I find it weird to use the `apply`-family of functions within the `dplyr`-pipes. Mostly, this transition is effortless, however, when it comes to applying some functions on “rows” (values spread across different columns on one row) the process has been painful. It took me much time to figure out how to do it. And I am still not entirely confident in my abilities. I understand the tidy data principles, but something like rowmeans or counting how many variables have missing values for a particular individual (row) is so common that I imagined it to have a bigger role in the `dplyr`-framework.

I’ll start off with calculating the means of variables for all rows using many different ways using the `dplyr`-programming framework. I’ll do it for all variables, plus a selection of variables. Note: of course, calculating rowmeans is rather easy and `rowMeans` is an obvious and easy choice; I have chosen it, though, so that it is also easy to see what is going on in the other possibilities.

### packages

``library(tidyverse)``

### data

We’ll use this dataframe:

``````df <- data.frame(
var1 = c(1, 2),
var2 = c(2, 4),
var3 = c(3, 6)
)
select_vars <- c("var2", "var3")``````

### rowmeans

The easiest way is to use the base R rowmeans function:

``````df %>% mutate(mean_all = rowMeans(.),
mean_sel = rowMeans(select(., select_vars)))``````
``````Warning: There was 1 warning in `mutate()`.
ℹ In argument: `mean_sel = rowMeans(select(., select_vars))`.
Caused by warning:
! Using an external vector in selections was deprecated in tidyselect 1.1.0.
# Was:
data %>% select(select_vars)

# Now:
data %>% select(all_of(select_vars))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.``````
``````  var1 var2 var3 mean_all mean_sel
1    1    2    3        2      2.5
2    2    4    6        4      5.0``````

Easy. Here I would also consider `rowMeans(.[select_vars])` but that clashes with my own rule described above to try to get everything within one framework. `rowMeans` should be the preferred choice I think.

### apply

We can do the same using the `apply`-functions. This is just to highlight the example. It is clear that there is no reason to do this because `rowMeans` does an excellent (and faster) job.

``````df %>% mutate(mean_all = apply(., 1, mean),
mean_sel = apply(select(., select_vars), 1, mean))``````
``````  var1 var2 var3 mean_all mean_sel
1    1    2    3        2      2.5
2    2    4    6        4      5.0``````

Although this again fine, I’d rather not do this option, because then, for instance in teaching, you would also have to explain a bit about the apply functions. I think it is easier to keep everything within one framework, but this is certainly not needed.

So let’s try frameworks that are more “consistent” with the `tidyverse`-framework; `pmap` and `rowwise`. It took me a long time to get this to work; the `pmap`-documentation is not (yet) very extensive.

### pmap

The `map`-functions from `purrr` seem handy in this respect.

This is the solution, and we’ll see a breakdown later (mostly for myself because I didn’t quite understand it initially).

``````df %>% mutate(mean_all = pmap_dbl(., function(...) mean(c(...))),
mean_sel = pmap_dbl(select(., select_vars),
function(...) mean(c(...))))``````
``````  var1 var2 var3 mean_all mean_sel
1    1    2    3        2      2.5
2    2    4    6        4      5.0``````

Let’s see what is going on by looking at the easier cases of `map` and `map2`.The map-function applies some function to each element from a list. `map` only uses one list. With the first argument you can specify the variable name of a dataframe that will be seen as a list.

#### map

``df %>% mutate(mean_var2 = map_dbl(var2, mean))``
``````  var1 var2 var3 mean_var2
1    1    2    3         2
2    2    4    6         4``````

It is obviously not very sensible to calculate a mean from one number. So let’s try two numbers:

#### map2

With `map2` you can specify two lists, again by using variable names from the dataframe, and `map` will iterate through all the “rows” (or elements of the lists). If we do the following:

``df %>% mutate(mean_var2_3 = map2_dbl(var2, var3, mean))``
``````  var1 var2 var3 mean_var2_3
1    1    2    3           2
2    2    4    6           4``````

The code works, but we don’t get the results we want. We get the mean of (of the row of) var2 only, not the mean of the row of var2 and var3. I think this is because the function `mean` doesn’t know what to do with the second argument, `var3`, provided to it. It certainly doesn’t see it as a number for the calculation of the mean, because all that information is passed to mean with the first argument `var2`. So it will only calculate a mean of var2 (for each element in the list of var2).

Let’s improve:

``df %>% mutate(mean_var2_3 = map2_dbl(var2, var3, function(x, y) mean(c(x,y))))``
``````  var1 var2 var3 mean_var2_3
1    1    2    3         2.5
2    2    4    6         5.0``````

This does what we want! The trick here is that in the previous case, `mean` didn’t see the two arguments (var2 and var3) as numbers to be used to calculate the mean for. So now we have to specify directly, that both `var2` and `var3` are numbers that need to be combined (by using `c()`). So the mean function has only one argument passed to it, namely a combination of the numbers in var2 and var3 (for each row!).

#### pmap

The pmap function at the start of this section makes more sense now (at least to me). If we want to make use of more than 2 variables (or lists), we can use the pmap function. We need to use `...` rather than specifying the columns/lists separately. In the end, we combine all the variables with `c()`. The nice thing now, is that we can also use the `select_vars` information that we had stored

``````df %>% mutate(mean_all = pmap_dbl(., function(...) mean(c(...))),
mean_sel = pmap_dbl(select(., select_vars),
function(...) mean(c(...))))``````
``````  var1 var2 var3 mean_all mean_sel
1    1    2    3        2      2.5
2    2    4    6        4      5.0``````

### rowwise

dplyr also has its `rowwise` function. It works much less well as you would expect from such a simple sounding function. This is the “solution”.

``````all_vars_quo <- quo(c(var1, var2, var3))
select_vars_quo <- quo(c(var2, var3))

df %>%
rowwise() %>%
mutate(mean_all = mean(!!all_vars_quo),
mean_sel = mean(!!select_vars_quo))``````
``````# A tibble: 2 × 5
# Rowwise:
var1  var2  var3 mean_all mean_sel
<dbl> <dbl> <dbl>    <dbl>    <dbl>
1     1     2     3        2      2.5
2     2     4     6        4      5  ``````

I don’t think this is ideal at all. This may be due to my limited understanding of quosures and non-standard evaluation (https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html). Also I wonder whether `rowwise` is really meant for the tricks we’re after; it’s somehow strongly tied to `do`, which seems to be related to (statistical) model building mostly.

Also, a huge benefit for me is the “readability” of tidyverse/dplyr R-code (for myself and for teaching). With the `!!` and `quo` the code is much less readable I think. I think it is unfortunate that we can’t use the `select_vars <- c("var2", "var3")` bit, that we have been able to use in all other possibilities.

I think that it would be ideal if there was a `rowwise` function that would work in this way (not unlike `mutate_at`):

``````df %>%
rowwise() %>%
mutate(mean_all = mean(.),
mean_sel = mean(select(., select_vars))``````
``````Error: <text>:5:0: unexpected end of input
3:   mutate(mean_all = mean(.),
4:          mean_sel = mean(select(., select_vars))
^``````

I am probably not understanding something fundamental about non-standard evaluation and I probably fail to see what `rowwise` is for, but I do think the above code reads more like the rowMeans, apply, and pmap coding. If anyone can tell me why this is not an option, I would very much like to hear! Alternatively, I can just not worry, be grateful for the alternatives, and move along.

### testing for speed

Similar to this post by Winston Chang, I will now test these alternatives for speed (while shamelessly copying the code by Chang):

``````# https://rpubs.com/wch/200398
run_benchmark <- function(nrow) {
# Make some data
df <- data.frame(
x = rnorm(nrow),
y = runif(nrow),
z = runif(nrow)
)

res <- list(
rowMeans = system.time(mutate(df, mean_all = rowMeans(df))),
apply = system.time(mutate(df, mean_all = apply(df, 1, mean))),
pmap = system.time(
mutate(df, mean_all = pmap_dbl(df, function(...) mean(c(...))))),
pmap_list = system.time(
mutate(df, mean_all = pmap_dbl(as.list(df), function(...) mean(c(...))))),
# as.list is faster (https://rpubs.com/wch/200398)
rowwise = system.time(
mutate(rowwise(df), mean_all = mean(c(x, y, z))))
)

# Get elapsed times
res <- lapply(res, `[[`, "elapsed")

res <- c(nrow = nrow, res)
res
}

# Run the benchmarks for various size data
all_times <- lapply(1:5, function(n) {
run_benchmark(10^n)
})

# Convert to data frame
times <- lapply(all_times, as.data.frame)
times <- do.call(rbind, times)``````

### visualise results

``````times_long <- times %>% gather(rowMeans, apply, pmap, pmap_list, rowwise,
key = "Function", value = "Time")

ggplot(times_long, aes(x = nrow, y = Time,
colour = reorder(Function, Time, mean))) +
geom_point() +
geom_line() +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10") +
theme_minimal() +
labs(x = "Number of rows", y = "Time (seconds)")`````` `rowMeans` is a clear winner, but can obviously not be used when you have to do something differently to the rows than means. `apply` is a bit faster than the `pmap`-functions. `pmap` performs better than `pmap` list (which is opposite to the situation here.`rowwise` seems to be the least ideal alternative, both in terms of time needed and the code. For other functions than calculating means, I will probably stick to (`p`)`map`, because that also lends itself better to statistical modelling.