Dplyr select based on multiple strings in a column

Question

I have a data frame containing following columns:-

 sample.data


 a_b_c d_b_e r_f_g c_b_a
1     1     1     1     1
2     2     2     2     2
3     3     3     3     3
4     4     4     4     4

How do I select only columns that contain both let's say "a" and "c" in the column name?

The output data.frame should contain only columns:- a_b_c and c_b_a. Because both these columns contain string "a" and string "c". — itthrill
– itthrill, Commented Mar 24, 2018 at 2:32

tyluRp · Accepted Answer · 2018-03-24 05:43:15Z

4

To select variables that contain a and c we could do:

library(dplyr)

df %>% 
  select(matches("(a.*c)|(c.*a)"))

  a_b_c c_b_a
1     1     1
2     2     2
3     3     3
4     4     4

Note that var a_a_e is not selected because it doesn't contain c and var c_f_g is not selected because it doesn't contain a. Column names with two a's and two c's will not be selected either as seen with var a_a_e.

We could also use str_subset:

library(dplyr)
library(stringr)

df %>% 
  select(str_subset(names(df), "(a.*c)|(c.*a)"))

Data:

df <- data.frame(
  a_b_c = 1:4,
  a_a_e = 1:4,
  c_f_g = 1:4,
  c_b_a = 1:4
)

edited Mar 24, 2018 at 5:43

answered Mar 24, 2018 at 3:08

tyluRp

4,7882 gold badges20 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

itthrill Over a year ago

This is precisely what I was looking for. Thanks.

tyluRp Over a year ago

Np, happy to help

Vinicius Barcelos · Accepted Answer · 2018-03-24 02:39:41Z

2

Try df %>% dplyr::select(matches("(a|c)"))

library(dplyr)
df <- data.frame(
  a_b_c=1:4,
  d_b_e=1:4,
  r_f_g=1:4,
  c_b_a=1:4
)

Results

> df %>% dplyr::select(matches("(a|c)"))
  a_b_c c_b_a
1     1     1
2     2     2
3     3     3
4     4     4

answered Mar 24, 2018 at 2:39

Vinicius Barcelos

613 bronze badges

3 Comments

Frank Over a year ago

Seems like the OP might require that both letters appear, since they say "a and c", not "a or c". They haven't clarified yet, though...

itthrill Over a year ago

Its "a" and "c" that I am trying to get.

Frank Over a year ago

@MadhukarJha I guess df %>% select(intersect(contains("a"), contains("c"))) in that case.

Cybernetic · Accepted Answer · 2018-03-24 02:51:44Z

0

If you want to see how it works under the hood, use the following function:

contain_both <- function(data_frame, letter_a, letter_b) {
    j <- 0
    keep_columns <- NULL
    for(i in 1:ncol(data_frame)) {
    has_letters <- unlist(strsplit(names(data_frame)[i], '_'))
    if(is.element(letter_a, has_letters) && is.element(letter_b, has_letters)) {
    j <- j + 1
    keep_columns[j] <- i
    }
    }
    return(data_frame[, keep_columns])
    }

Data:

df <- data.frame(seq(1:4), seq(1:4), seq(1:4), seq(1:4))
names(df) <- c('a_b_c', 'd_b_e', 'r_f_g', 'c_b_a')

Just pass in your data frame, along with your 2 letter choices:

Usage:

contain_both(df, 'b', 'c')

edited Mar 24, 2018 at 2:51

answered Mar 24, 2018 at 2:44

Cybernetic

13.4k16 gold badges108 silver badges153 bronze badges

1 Comment

itthrill Over a year ago

It would have been great to see something more simple i.e using native function but this does the trick.

score 0 · Accepted Answer · 2018-03-24 03:04:20Z

0

Hope this is what you are looking for:

  a_b_c <- c(1,2,3,4)
     d_b_e <- c(1,2,3,4)
    yy <- cbind(a_b_c, d_b_e)
    > yy
     a_b_c d_b_e
[1,]     1     1
[2,]     2     2
[3,]     3     3
[4,]     4     4
 yy <- as.data.frame(yy)
 yy
  a_b_c d_b_e
1     1     1
2     2     2
3     3     3
4     4     4
 y <- yy[which(names(yy) %in% "a_b_c")]
> y
  a_b_c
1     1
2     2
3     3
4     4

In your example, you can use this:

 y <- sample.data[which(names(sample.data) %in% c("a_b_c","c_b_a" )]

edited Mar 24, 2018 at 3:04

answered Mar 24, 2018 at 2:28

user7905871

4 Comments

itthrill Over a year ago

No that is not what I am looking at.

user7905871 Over a year ago

Ok, could you please write your expected output.

itthrill Over a year ago

"In your example, you can use this: y <- sample.data[which(names(sample.data) %in% c("a_b_c","c_b_a" )]" This is specific example. I want to select all the columns that contains "a" and "c" which means I want "a_c_b_d" as well as "u_a_d_c" and so on.

user7905871 Over a year ago

you meant you want it as general. Ok.

Collectives™ on Stack Overflow

Dplyr select based on multiple strings in a column

4 Answers 4

2 Comments

3 Comments

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

3 Comments

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related