Subsetting data frame by vector of elements

Question

I spent about 20 minutes looking through previous questions, but could not find what I am looking for. I have a large data frame I want to subset down based on a list of names, but the names in the data frame can also have a postfix not indicated in the list.

In other words, is there a simpler generic way (for infinite numbers of postfixes) to do the following:

data <- data.frame("name"=c("name1","name1_post1","name2","name2_post1",
                            "name2_post2","name3","name4"),
                   "data"=rnorm(7,0,1),
                   stringsAsFactors=FALSE)

names <- c("name2","name3")

subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]

In response to @Arun's answer. The names in my data actually include more than one underscore, making the problem more complicated.

data <- data.frame("name"=c("name1_target_time","name1_target_time_post1","name2_target_time","name2_target_time_post1",
                            "name2_target_time_post2","name3_target_time","name4_target_time"),
                   "data"=rnorm(7,0,1),
                   stringsAsFactors=FALSE)

names <- c("name2_target_time","name3_target_time")

subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]

Not at all. I am just saying I spent time looking through previous questions before posting. — dayne
– dayne, Commented Apr 16, 2013 at 20:03
You don't spent enough time . And you looks for a very specific solution (using grep), Is in't too much? — agstudy
– agstudy, Commented Apr 16, 2013 at 20:06
@agstudy sorry if I offended you. I am just trying to learn. — dayne
– dayne, Commented Apr 16, 2013 at 20:09
I am not offended. I just try to tell you that spending 20 minutes to find a solution is not the right way to learn. — agstudy
– agstudy, Commented Apr 16, 2013 at 20:11

Arun · Accepted Answer · 2013-04-16 20:08:13Z

3

Edit: solution using regular expressions (following OP's follow-up in comment):

data[grepl(paste(names, collapse="|"), data$name), ]
#          name       data
# 3       name2  1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6       name3  0.4220084

On your new data:

#                      name      data
# 3       name2_target_time 0.6295361
# 4 name2_target_time_post1 0.8951720
# 5 name2_target_time_post2 0.6602126
# 6       name3_target_time 2.2734835

Also, as @flodel shows under comments, this also works fine!

subset(data, sub("_post\\d+$", "", name) %in% names)

Old solution:

data[sapply(strsplit(data$name, "_"), "[[", 1) %in% names, ]

#          name       data
# 3       name2  1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6       name3  0.4220084

The idea: First split the string at _ using strsplit. This results in a list. For ex: name2 will result in just name2 (first element of the list). But name2_post1 will result in name2 and post1 (second element of the list). By wrapping it with sapply and using [[ with 1, we can select just the "first" element of this resulting list. Then we can use that with %in% to check if they are present in names (which is straightforward).

edited Apr 16, 2013 at 20:08

answered Apr 16, 2013 at 19:52

Arun

119k28 gold badges290 silver badges396 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

dayne Over a year ago

That's really close (upvote). The problem is the real names I am working with have multiple underscores before the postfix. For example "name1_target1_time_1_postfix". I am really looking for some kind of grep function that will check one list for partial matches of another list.

dayne Over a year ago

There are a lot of ways to do it, I really asked the question to learn more about coding in r. It seems strange to me that there isn't a grep function that will look for multiple patterns.

Arun Over a year ago

It'd be nice if you can edit your post accordingly then showing the input and output?

flodel Over a year ago

Like that? subset(data, sub("_post\\d+$", "", name) %in% names)

agstudy Over a year ago

@Arun +1 because you are patient!

|

gwatson · Accepted Answer · 2013-04-16 20:08:09Z

0

A grep solution would probably look something like the following:

subset <- data[grep("(name2)|(name3)",names(data)),]

answered Apr 16, 2013 at 20:08

gwatson

4882 silver badges8 bronze badges

Collectives™ on Stack Overflow

Subsetting data frame by vector of elements

2 Answers 2

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related