using regex in ddply variables

Question

I am trying to use ddply on some columns with a regular expression and I could not get this to work. I prepared a little example below. Is there a way use ddply on several variables, or did I just miss something in the manual?

df <- data.frame(low_1=rnorm(5),low_2=rnorm(5),high_1=rnorm(5),high_2=rnorm(5),N=c(1,2,3,4,5))
ddply(df,.(N), summarise, low=mean("low.."), high=mean("high.."))

I thought this might be clear. I expect the mean between low_1 and low_2 and the mean between high_1 and high_2. So I will test your dplyr comment and I think this might help. — drmariod
– drmariod, Commented Nov 24, 2014 at 12:44

Andrie · Accepted Answer · 2014-11-24 12:33:15Z

1

You can use colwise to calculate the same statistic on multiple columns, for example:

ddply(df, .(N), colwise(mean))

  N      low_1      low_2     high_1      high_2
1 1 -1.3105923 -0.5507862  0.6304232 -0.04553457
2 2 -0.1586676  0.6820199 -0.8220206  0.93301381
3 3  0.4434761  0.4337073 -1.2988521  0.84412693
4 4  0.2522467 -0.1393690  0.2361361  1.64288051
5 5  0.4118032  0.4358705 -0.3529169  0.98916518

To use a regular expression on the column names, you can do something like the following:

Use a regular expression with grep() to identify all columns you're interested in.
Extract the column number of the grouping variable
Pass a subset of the data to ddply, where the subset consists of only those columns identified in steps 1 and 2.

Try this:

idx <- grep("low", names(df))
idk <- which(names(df) == "N")
ddply(df[, c(idx, idk)], .(N), colwise(mean))

  N      low_1      low_2
1 1 -1.3105923 -0.5507862
2 2 -0.1586676  0.6820199
3 3  0.4434761  0.4337073
4 4  0.2522467 -0.1393690
5 5  0.4118032  0.4358705

edited Nov 24, 2014 at 12:33

answered Nov 24, 2014 at 12:27

Andrie

180k52 gold badges456 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

talat Over a year ago

My understanding of the desired output (based on OP's ddply) is that they want to calculate the mean of all "low" columns (aggregated a single column) by group of N and the same for all "high" columns - but I may be wrong .. The dplyr equivalent of your second option, I think, would be df %>% group_by(N) %>% summarise_each(funs(mean), contains("low"))

Richie Cotton · Accepted Answer · 2014-11-24 12:23:14Z

0

As it stands, you need to pass a different argument for each statistic that you are calculating.

ddply(
  df,
  .(N), 
  summarise, 
  low_1  = mean(low_1), 
  low_2  = mean(low_2), 
  high_1 = mean(high_1), 
  high_2 = mean(high_2)
)

The idiomatic way of calculating this is to reshape your data to long format before calculating the stats.

library(plyr)
library(reshape2)
library(stringr)
df_long <- melt(df, id.vars = "N")
matches <- str_match(df_long$variable, "(low|high)_([[:digit:]])")
df_long <- within(
  df_long,
  {
    height <- matches[, 2]
    group <- as.integer(matches[, 3])
  }
)
ddply(
  df_long,
  .(N, height, group), 
  summarize, 
  mean_value = mean(value)
)

If you prefer, you can use mutate rather than within, and call to ddply can be replaced with modern dplyr syntax.

df_long %>%
  group_by(N, height, group) %>%
  summarize(mean_value = mean(value))

answered Nov 24, 2014 at 12:23

Richie Cotton

122k47 gold badges254 silver badges371 bronze badges

1 Comment

Andrie Over a year ago

I would use colwise or numcolwise to do the same thing, without any reshaping.

talat · Accepted Answer · 2014-11-24 13:28:02Z

0

Here's an approach with dplyr and tidyr that I think results in the desired output:

require(dplyr) # if not yet installed, first run: install.packages("dplyr")
require(tidyr) # if not yet installed, first run: install.packages("tidyr")

gather(df, group, val, -N) %>%     # reshape the data to long format
  mutate(group = gsub("*_\\d+$", "", group)) %>%   # delete the numbers from low_x and high_x in the "group" column
  group_by(N, group) %>%           # group the data based on N and group (low/high)
  summarise(val = mean(val)) %>%   # apply the mean
  ungroup() %>%                    # ungroup the data
  spread(group, val)               # reshape to wide format so that low and high are separate columns

#Source: local data frame [5 x 3]
#
#  N        high         low
#1 1  0.29702057  0.15541153
#2 2 -1.02057669  1.09399446
#3 3  0.20745563  0.11582517
#4 4 -0.05573833 -0.22570064
#5 5  0.61697307 -0.06831203

It will work with any number of low_X and high_X columns.

Note: make sure you load dplyr after plyr to avoid function name conflicts.

data

set.seed(4711)
df <- data.frame(low_1=rnorm(5),low_2=rnorm(5),high_1=rnorm(5),high_2=rnorm(5),N=c(1,2,3,4,5))

edited Nov 24, 2014 at 13:28

answered Nov 24, 2014 at 12:52

talat

70.5k22 gold badges130 silver badges158 bronze badges

4 Comments

drmariod Over a year ago

If I get this right, 'group' will use all other columns and and not N? Can I specify by regular expression the column name. For example I have imputed_value_0h_1, imputed_value_0h_2, imputed_value_2h_1 and imputed_value_2h_2 in my real dataset, but I am not able to adapt this example to my date...

talat Over a year ago

You mean that you want to exclude those columns with imputed_value_XXX ? Or what do you want to do with them? It would be best if you could edit your question to include these details.

talat Over a year ago

@user7601, Can you clarify what you mean in your comment above? thanks

talat Over a year ago

@user7601, if you want to remove extra columns that you have in your data, so that they are not used in the "group" column, you could do the following: df %>% select(-contains("imputed_value")) %>% gather(group, val, -N) %>% ... replace the first line of my code with this and then add the rest of the code.

Cath · Accepted Answer · 2014-11-24 13:57:26Z

You can do something like :

ddply(df,.(N), summarise, 
      low=mean(sapply(grep("low",colnames(df),value=T),function(x){get(x)})), 
      high=mean(sapply(grep("high",colnames(df),value=T),function(x){get(x)})))

which gives this output :

  N         low        high
1 1  0.94613752  1.47197645
2 2 -0.68887596 -0.05779876
3 3 -0.28589753 -0.55694341
4 4 -0.01378869  0.28204629
5 5 -0.08681600  0.88544497

data :

> dput(df)
structure(list(low_1 = c(0.885675347945903, -1.30343272566325, -2.44201300062675, -1.27709377574332, -0.794159839824383), 
               low_2 = c(1.00659968581264,-0.0743191876393787, 1.87021794472605, 1.24951638739919, 0.620527846366092), 
               high_1 = c(0.630374573470948, 0.169009703225843, -0.573629421621814, 0.340752780334754, 0.417022085050569), 
               high_2 = c(2.31357832822303,-0.284607218026423, -0.540257400090053, 0.223339795927736, 1.35386785598766), 
               N = c(1, 2, 3, 4, 5)), 
               .Names = c("low_1", "low_2", "high_1", "high_2", "N"), 
               row.names = c(NA, -5L), class = "data.frame")

Collectives™ on Stack Overflow

using regex in ddply variables

4 Answers 4

1 Comment

1 Comment

data

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

data

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related