1

I am trying to use ddply on some columns with a regular expression and I could not get this to work. I prepared a little example below. Is there a way use ddply on several variables, or did I just miss something in the manual?

df <- data.frame(low_1=rnorm(5),low_2=rnorm(5),high_1=rnorm(5),high_2=rnorm(5),N=c(1,2,3,4,5))
ddply(df,.(N), summarise, low=mean("low.."), high=mean("high.."))
2
  • How does your expected output look like? Commented Nov 24, 2014 at 12:32
  • I thought this might be clear. I expect the mean between low_1 and low_2 and the mean between high_1 and high_2. So I will test your dplyr comment and I think this might help. Commented Nov 24, 2014 at 12:44

4 Answers 4

1

You can use colwise to calculate the same statistic on multiple columns, for example:

ddply(df, .(N), colwise(mean))

  N      low_1      low_2     high_1      high_2
1 1 -1.3105923 -0.5507862  0.6304232 -0.04553457
2 2 -0.1586676  0.6820199 -0.8220206  0.93301381
3 3  0.4434761  0.4337073 -1.2988521  0.84412693
4 4  0.2522467 -0.1393690  0.2361361  1.64288051
5 5  0.4118032  0.4358705 -0.3529169  0.98916518

To use a regular expression on the column names, you can do something like the following:

  1. Use a regular expression with grep() to identify all columns you're interested in.
  2. Extract the column number of the grouping variable
  3. Pass a subset of the data to ddply, where the subset consists of only those columns identified in steps 1 and 2.

Try this:

idx <- grep("low", names(df))
idk <- which(names(df) == "N")
ddply(df[, c(idx, idk)], .(N), colwise(mean))

  N      low_1      low_2
1 1 -1.3105923 -0.5507862
2 2 -0.1586676  0.6820199
3 3  0.4434761  0.4337073
4 4  0.2522467 -0.1393690
5 5  0.4118032  0.4358705
Sign up to request clarification or add additional context in comments.

1 Comment

My understanding of the desired output (based on OP's ddply) is that they want to calculate the mean of all "low" columns (aggregated a single column) by group of N and the same for all "high" columns - but I may be wrong .. The dplyr equivalent of your second option, I think, would be df %>% group_by(N) %>% summarise_each(funs(mean), contains("low"))
0

As it stands, you need to pass a different argument for each statistic that you are calculating.

ddply(
  df,
  .(N), 
  summarise, 
  low_1  = mean(low_1), 
  low_2  = mean(low_2), 
  high_1 = mean(high_1), 
  high_2 = mean(high_2)
)

The idiomatic way of calculating this is to reshape your data to long format before calculating the stats.

library(plyr)
library(reshape2)
library(stringr)
df_long <- melt(df, id.vars = "N")
matches <- str_match(df_long$variable, "(low|high)_([[:digit:]])")
df_long <- within(
  df_long,
  {
    height <- matches[, 2]
    group <- as.integer(matches[, 3])
  }
)
ddply(
  df_long,
  .(N, height, group), 
  summarize, 
  mean_value = mean(value)
)

If you prefer, you can use mutate rather than within, and call to ddply can be replaced with modern dplyr syntax.

df_long %>%
  group_by(N, height, group) %>%
  summarize(mean_value = mean(value))

1 Comment

I would use colwise or numcolwise to do the same thing, without any reshaping.
0

Here's an approach with dplyr and tidyr that I think results in the desired output:

require(dplyr) # if not yet installed, first run: install.packages("dplyr")
require(tidyr) # if not yet installed, first run: install.packages("tidyr")

gather(df, group, val, -N) %>%     # reshape the data to long format
  mutate(group = gsub("*_\\d+$", "", group)) %>%   # delete the numbers from low_x and high_x in the "group" column
  group_by(N, group) %>%           # group the data based on N and group (low/high)
  summarise(val = mean(val)) %>%   # apply the mean
  ungroup() %>%                    # ungroup the data
  spread(group, val)               # reshape to wide format so that low and high are separate columns

#Source: local data frame [5 x 3]
#
#  N        high         low
#1 1  0.29702057  0.15541153
#2 2 -1.02057669  1.09399446
#3 3  0.20745563  0.11582517
#4 4 -0.05573833 -0.22570064
#5 5  0.61697307 -0.06831203

It will work with any number of low_X and high_X columns.

Note: make sure you load dplyr after plyr to avoid function name conflicts.

data

set.seed(4711)
df <- data.frame(low_1=rnorm(5),low_2=rnorm(5),high_1=rnorm(5),high_2=rnorm(5),N=c(1,2,3,4,5))

4 Comments

If I get this right, 'group' will use all other columns and and not N? Can I specify by regular expression the column name. For example I have imputed_value_0h_1, imputed_value_0h_2, imputed_value_2h_1 and imputed_value_2h_2 in my real dataset, but I am not able to adapt this example to my date...
You mean that you want to exclude those columns with imputed_value_XXX ? Or what do you want to do with them? It would be best if you could edit your question to include these details.
@user7601, Can you clarify what you mean in your comment above? thanks
@user7601, if you want to remove extra columns that you have in your data, so that they are not used in the "group" column, you could do the following: df %>% select(-contains("imputed_value")) %>% gather(group, val, -N) %>% ... replace the first line of my code with this and then add the rest of the code.
0

You can do something like :

ddply(df,.(N), summarise, 
      low=mean(sapply(grep("low",colnames(df),value=T),function(x){get(x)})), 
      high=mean(sapply(grep("high",colnames(df),value=T),function(x){get(x)})))

which gives this output :

  N         low        high
1 1  0.94613752  1.47197645
2 2 -0.68887596 -0.05779876
3 3 -0.28589753 -0.55694341
4 4 -0.01378869  0.28204629
5 5 -0.08681600  0.88544497

data :

> dput(df)
structure(list(low_1 = c(0.885675347945903, -1.30343272566325, -2.44201300062675, -1.27709377574332, -0.794159839824383), 
               low_2 = c(1.00659968581264,-0.0743191876393787, 1.87021794472605, 1.24951638739919, 0.620527846366092), 
               high_1 = c(0.630374573470948, 0.169009703225843, -0.573629421621814, 0.340752780334754, 0.417022085050569), 
               high_2 = c(2.31357832822303,-0.284607218026423, -0.540257400090053, 0.223339795927736, 1.35386785598766), 
               N = c(1, 2, 3, 4, 5)), 
               .Names = c("low_1", "low_2", "high_1", "high_2", "N"), 
               row.names = c(NA, -5L), class = "data.frame")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.