String matching over multiple columns with specific string names

Question

I am interested in doing string detection over a set of columns. If that string (which in this case is ZSD) is found, I want to return the column number/name. If multiple matches are found, I want to return the last column name/number with that string.

Input

My input is this:

a.zsd                b.zsd   c.zsd   d.zsd
'ZSD'                'ZAD'   NA      'ZAD'
'ZAD'                NA      NA      'ZSD'
NA                   NA      'ZAD'   NA
'Not Achieved ZSD'   NA      'ZAD'   NA
'ZSD'                'ZSD'   NA      'ZSD'
NA                   NA      NA      NA

Output

My required output is a new column zsd.level:

a.zsd                b.zsd   c.zsd   d.zsd zsd.level
'ZSD'                'ZAD'   NA      'ZAD'    a
'ZAD'                NA      NA      'ZSD'    d
NA                   NA      'ZAD'   NA       NA
'Not Achieved ZSD'   NA      'ZAD'   NA       a
'ZSD'                'ZSD'   NA      'ZSD'    d
NA                   NA      NA      NA       NA

Info:

My data frame has over a hundred columns. I am interested in ONLY some of the columns having a name that ends at a string .zsd. These columns can either have NA or one of the following string values ZAD, ZSD, Not Achieved ZSD.

I am just interested in detecting the presence of the string ZSD. If not found in any of the columns, it should return NA in the output column (zsd.level). If the string is found in multiple columns, I want to return the last column name/number that contains the string.

My question is similar to this post but not exactly the same dplyr filter with condition on multiple columns

dput

dput(df)

structure(list(a.zsd = c("ZSD", "ZAD", NA, "Not Achieved ZSD", "ZSD", NA), 
               b.zsd = c("ZAD", NA, NA, NA, "ZSD", NA), 
               c.zsd = c(NA, NA, "ZAD", "ZAD", NA, NA), 
               d.zsd = c("ZAD", "ZSD", NA, NA, "ZSD", NA)), 
               class = "data.frame", row.names = c(NA, -6L))

Partial Solution

To select those columns with names ending in .zsd, I can do

library(stringr)
library(tidyverse)

df %>%
  select(ends_with(".zsd"))

To select or filter rows with the string ZSD, I can do

str_detect(., "ZSD"))

But how can I put multiple conditions together? Any help would be greatly appreciated.

"If multiple matches are found, I want to return the last column with that string." — Martin Gal
– Martin Gal, Commented Oct 17, 2021 at 22:36
Sorry, it's a string. I should update and clarify that it's just in one column. — Sandy
– Sandy, Commented Oct 17, 2021 at 22:36

TarJae · Accepted Answer · 2021-10-17 22:41:41Z

2

We could do it this way:

library(dplyr)
library(tidyr)
library(stringr)

df %>%  
  mutate(across(contains("zsd"), ~case_when(str_detect(., "ZSD") ~ cur_column()), .names = 'new_{col}')) %>%
  unite(zsd_level, starts_with('new'), na.rm = TRUE, sep = ' ') %>% 
  mutate(zsd_level = str_remove_all(zsd_level, ".zsd"),
         zsd_level = str_sub(zsd_level, -1))

output:

 a.zsd b.zsd c.zsd d.zsd zsd_level
1              ZSD   ZAD  <NA>   ZAD         a
2              ZAD  <NA>  <NA>   ZSD         d
3             <NA>  <NA>   ZAD  <NA>          
4 Not Achieved ZSD  <NA>   ZAD  <NA>         a
5              ZSD   ZSD  <NA>   ZSD         d
6             <NA>  <NA>  <NA>  <NA>

edited Oct 17, 2021 at 22:41

answered Oct 17, 2021 at 22:35

TarJae

80.2k6 gold badges30 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

TarJae Over a year ago

See my update. Now it should be adequate!

Sandy Over a year ago

That's great, it's mostly working. I can improve on including the NA value when the string is missing. I will now check on my original data. Thanks again!

Sandy Over a year ago

Could you please have a look at the following post <stackoverflow.com/questions/69609589/…>

Sandy · Accepted Answer · 2021-10-18 03:23:38Z

2

Another option, but a little bit more complicated than dear TarJae's:

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
  mutate(rn = row_number()) %>% 
  pivot_longer(-rn) %>% 
  group_by(rn) %>% 
  filter(str_detect(value, "ZSD")) %>% 
  slice_tail() %>% 
  summarise(name = str_remove(name, ".zsd")) %>% 
  right_join(df %>% mutate(rn = row_number()), by = "rn") %>%
  arrange(rn) %>% 
  ungroup() %>% 
  select(ends_with("zsd"), zsd.level = name)

This returns

# A tibble: 6 x 5
  a.zsd            b.zsd c.zsd d.zsd zsd.level
  <chr>            <chr> <chr> <chr> <chr>    
1 ZSD              ZAD   NA    ZAD   a        
2 ZAD              NA    NA    ZSD   d        
3 NA               NA    ZAD   NA    NA       
4 Not Achieved ZSD NA    ZAD   NA    a        
5 ZSD              ZSD   NA    ZSD   d        
6 NA               NA    NA    NA    NA

edited Oct 18, 2021 at 3:23

Sandy

1,14813 silver badges22 bronze badges

answered Oct 17, 2021 at 22:49

Martin Gal

17k5 gold badges24 silver badges42 bronze badges

3 Comments

TarJae Over a year ago

Nice solution. I like it.

Sandy Over a year ago

@Martin Gal, Could you please have a look at the following post <stackoverflow.com/questions/69609589/…>

Sandy Over a year ago

Thank you for the solution, however, it does not work on the original data.

Collectives™ on Stack Overflow

String matching over multiple columns with specific string names

Input

Output

Info:

dput

Partial Solution

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Input

Output

Info:

dput

Partial Solution

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related