String matching over multiple columns with specific strings and value comparisons in associated names

Question

I am interested in doing string detection and value comparison across a range of columns. If the string (which in this case is ZSD) is found in columns, then their corresponding values from another column need to be compared.

Input

My input is as below:

a.zsd                 a.test b.zsd b.test c.zsd c.test d.zsd d.test
'ZSD'                 0.0   'ZAD'  1.0    NA    0.5   'ZAD'    1.0
'ZAD'                 1.0    NA    0.0    NA    0.5   'ZSD'    0.0
NA                    0.5    NA    0.5   'ZAD'  0.5    NA      0.5
'Not Achieved ZSD'    0.0    NA    0.5   'ZAD'  0.5    NA      0.5
'ZSD'                 1.0   'ZSD'  0.5    NA    0.5   'ZSD'    0.0
NA                    0.0    NA    0.0    NA    0.5    NA      0.0
NA                    1.0   'ZSD'  0.0   'ZSD'  0.5   'ZSD'    1.0

Output

In my output, I want two additional columns smallest.test and zsd.level:

a.zsd                 a.test b.zsd b.test c.zsd c.test d.zsd d.test smallest.test zsd.level
'ZSD'                 0.0   'ZAD'  1.0    NA    0.5   'ZAD'    1.0  0.0           a
'ZAD'                 1.0    NA    0.0    NA    0.5   'ZSD'    0.0  0.0           d
NA                    0.5    NA    0.5   'ZAD'  0.5    NA      0.5  0.0           NA        
'Not Achieved ZSD'    0.0    NA    0.5   'ZAD'  0.5    NA      0.5  0.0           a
'ZSD'                 1.0   'ZSD'  0.5    NA    0.5   'ZSD'    0.0  0.0           d
NA                    0.0    NA    0.0    NA    0.5    NA      0.0  0.0           NA
NA                    1.0   'ZSD'  0.0   'ZSD'  0.5   'ZSD'    1.0  0.0           b

Info:

My data frame has over a hundred columns. I am interested in ONLY some of the columns having a name that ends at a string .zsd. These columns can either have NA or one of the following string values ZAD, ZSD, Not Achieved ZSD. Each column with the .zsd string name, has an associated .test column.

Requirements

I want two new columns in the output smallest.test and zsd.level. The requirements are as below:

Iterate through the column names ending with the string .zsd
Across those columns detect the string ZSD
If the ZSD string is found in only one of the columns, return the names of that column in the output column zsd.level and also return the corresponding value from the column name ending at .test to be returned to the output column smallest.test.
If none of the columns contains the string ZSD, return NA in the output column zsd.level and return 0.0 in the corresponding output column smallest.test.
If more than one columns contain the string ZSD, pick the column with the least value of the corresponding .test column and return in the output.
If more than one columns contain the string ZSD, and they all have the same value of the corresponding .test column, then pick the last column name for the output and the corresponding value of the .test for the output.

dput()


    dput(df)
    structure(list(a.zsd = c("ZSD", "ZAD", NA, "Not Achieved ZSD", "ZSD", NA, NA), 
                   a.test = c(0, 1, 0.5, 0, 1, 0, 1), 
                   b.zsd = c("ZAD", NA, NA, NA, "ZSD", NA, "ZSD"), 
                   b.test = c(1, 0, 0.5, 0.5, 0.5, 0, 0), 
                   c.zsd = c(NA, NA, "ZAD", "ZAD", NA, NA, "ZSD"), 
                   c.test = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5), 
                   d.zsd = c("ZAD", "ZSD", NA, NA, "ZSD", NA, "ZSD"), 
                   d.test = c(1, 0, 0.5, 0.5, 0, 0, 1)), 
                   class = "data.frame", row.names = c(NA, -7L))

Partial solution

Based on the following post: String matching over multiple columns with specific string names, this code can iterate and select the .zsd columns and return the highest column name in the output. But it does not take into account the corresponding values of the .test field. Any help on this would be greatly appreciated.

library(dplyr)
library(tidyr)
library(stringr)

df %>%  
  mutate(across(contains("zsd"), ~case_when(str_detect(., "ZSD") ~ cur_column()), .names = 'new_{col}')) %>%
  unite(zsd_level, starts_with('new'), na.rm = TRUE, sep = ' ') %>% 
  mutate(zsd_level = str_remove_all(zsd_level, ".zsd"),
         zsd_level = str_sub(zsd_level, -1))

xilliam · Accepted Answer · 2021-10-19 07:52:02Z

2

Here is a base solution involving which.min that presumes the corresponding '.test' and '.zsd' columns are adjacent.

The first block of code can be used for this purpose, but if a given row has ties in the minima, it will give the first instance. Further below is a function that can be used to select either the first or last instance of the minima.

# sample data
df <- structure(list(a.zsd = c("ZSD", "ZAD", NA, "Not Achieved ZSD", "ZSD", NA, NA), 
               a.test = c(0, 1, 0.5, 0, 1, 0, 1), 
               b.zsd = c("ZAD", NA, NA, NA, "ZSD", NA, "ZSD"), 
               b.test = c(1, 0, 0.5, 0.5, 0.5, 0, 0), 
               c.zsd = c(NA, NA, "ZAD", "ZAD", NA, NA, "ZSD"), 
               c.test = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5), 
               d.zsd = c("ZAD", "ZSD", NA, NA, "ZSD", NA, "ZSD"), 
               d.test = c(1, 0, 0.5, 0.5, 0, 0, 1)), 
          class = "data.frame", row.names = c(NA, -7L))


# select .zsd columns
zsd_cols <- grep(".zsd", names(df), value = TRUE)
zsd_df <- df[, zsd_cols]

# select .test columns
test_cols <- gsub("zsd", "test",zsd_cols)
test_df <- df[, test_cols]

# convert "Not Achieved ZSD" to "ZSD"
zsd_df[zsd_df == "Not Achieved ZSD" ] <- "ZSD"

# assign NA to non "ZSD" cells
zsd_df[zsd_df != "ZSD"] <- NA

# assign 999 test_df values whose corresponding zsd_df is NA
test_df[is.na(zsd_df)] <- 999

# return cols which hold minimum
nams <- names(test_df)[apply(test_df, 1 ,which.min)]

# scrub .test suffix
nams <- gsub(".test", "", nams)

# return mins
mins <- apply(test_df, 1 ,min)

# assign values less than 999 as smallest test, or zero
df$smallest.test <- ifelse(mins < 999, mins, 0)

# assign name if corresponding min less than 999 or NA
df$zsd_level <- ifelse(mins < 999, nams, NA)

> df
             a.zsd a.test b.zsd b.test c.zsd c.test d.zsd d.test smallest.test zsd_level
1              ZSD    0.0   ZAD    1.0  <NA>    0.5   ZAD    1.0             0         a
2              ZAD    1.0  <NA>    0.0  <NA>    0.5   ZSD    0.0             0         d
3             <NA>    0.5  <NA>    0.5   ZAD    0.5  <NA>    0.5             0      <NA>
4 Not Achieved ZSD    0.0  <NA>    0.5   ZAD    0.5  <NA>    0.5             0         a
5              ZSD    1.0   ZSD    0.5  <NA>    0.5   ZSD    0.0             0         d
6             <NA>    0.0  <NA>    0.0  <NA>    0.5  <NA>    0.0             0      <NA>
7             <NA>    1.0   ZSD    0.0   ZSD    0.5   ZSD    1.0             0         b

EDIT

Same idea, but a function that can obtain the first or last minimum when ties are present in a row:

# adjusted sample data
df <- data.frame(a.zsd = c("ZSD", "ZAD", NA, 
                        "Not Achieved ZSD", "ZSD", NA, NA), 
                 a.test = c(0, 1, 0.5, 0, 1, 0, 1), 
                 b.zsd = c("ZAD", NA, NA, NA, "ZSD", NA, "ZSD"), 
                 b.test = c(1, 0, 0.5, 0.5, 0.0, 0, 0), # adjusted 
                 c.zsd = c(NA, NA, "ZAD", "ZAD", NA, NA, "ZSD"), 
                 c.test = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5), 
                 d.zsd = c("ZAD", "ZSD", NA, NA, "ZSD", NA, "ZSD"), 
                 d.test = c(1, 0, 0.5, 0.5, 0, 0, 1))


appendMin <- function(df, last_min = TRUE){
  
  # select .zsd columns
  zsd_cols <- grep(".zsd", names(df), value = TRUE)
  zsd_df <- df[, zsd_cols]
  if(last_min) { zsd_df <- rev(zsd_df) } # for last min
  
  # select .test columns
  test_cols <- gsub("zsd", "test",zsd_cols)
  test_df <- df[, test_cols]
  if(last_min) { test_df <- rev(test_df) } # for last min
  
  # convert "Not Achieved ZSD" to "ZSD"
  zsd_df[zsd_df == "Not Achieved ZSD" ] <- "ZSD"
  
  # assign NA to non "ZSD" cells
  zsd_df[zsd_df != "ZSD" ] <- NA
  
  # assign 999 test_df values whose corresponding zsd_df is NA
  test_df[is.na(zsd_df)] <- 999
  
  # return cols which hold the first minimum 
  nams <- names(test_df)[apply(test_df, 1 ,which.min)]
  
  # scrub .test suffix
  nams <- gsub(".test", "", nams)
  # return mins
  mins <- apply(test_df, 1 ,min)
  
  # assign values less than 999 as smallest test, or zero
  df$smallest.test <- ifelse(mins < 999, mins, 0)
  
  # assign name if corresponding min less than 999 or NA
  df$zsd_level <- ifelse(mins < 999, nams, NA)
  
  return(df)
  
}


ties_first <- appendMin(df, last_min = FALSE)
ties_last <- appendMin(df, last_min = TRUE)

edited Oct 19, 2021 at 7:52

answered Oct 18, 2021 at 10:35

xilliam

2,2972 gold badges18 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sandy Over a year ago

Thank you so much xillian I like your solution. It mostly works except in row 4. Although the string is 'Not Achieved ZSD', it should still consider it because of the presence of the string ZSD.

xilliam Over a year ago

I see, I'll update the answer.

Sandy Over a year ago

I tried the following, but it did not work...: zsd_df[(is.na(zsd_df)) || (zsd_df == "ZAD")] <- NA

Sandy Over a year ago

I just noticed that in the code, which.min picks the first minimum entry. @xilliam, is it possible that we modify the code so that it picks the last minimum entry?

xilliam Over a year ago

I edited to include a function that can return either the first or last minimum in cases of ties.

|

Collectives™ on Stack Overflow

String matching over multiple columns with specific strings and value comparisons in associated names

Input

Output

Info:

Requirements

dput()

Partial solution

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Input

Output

Info:

Requirements

dput()

Partial solution

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related