Change order of multiple optional substrings

Question

That's a bit like this question, but I have multiple substrings that may or may not occur.

The substrings code for two different dimensions, in my example "test" and "eye". They can occur in any imaginable order. The variables can be coded in different ways - in my example, "method|test" would be two ways to code for "test", as well as "r|re|l|le" different ways to code for eyes.

I found a convoluted solution, which is using a chain of seven (!) gsub calls, and I wondered if there is a more concise way.

x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old", 
       "test_l_old", "re_test_new","new_le_method", "new_r_test")
x
#>  [1] "id"            "r_test"        "l_method"      "test_re"      
#>  [5] "method_le"     "test_r_old"    "test_l_old"    "re_test_new"  
#>  [9] "new_le_method" "new_r_test"

Desired output

#>  [1] "id"         "r_test"     "l_test"     "r_test"     "l_test"    
#>  [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"

How I got there (convoluted)

## Unify codes for variables, I use the underscores to make it more unique for future regex 
clean_test<- gsub("(?<![a-z])(test|method)(?![a-z])", "_test_", tolower(x), perl = TRUE)
clean_r <- gsub("(?<![a-z])(r|re)(?![a-z])", "_r_", tolower(clean_test), perl = TRUE)
clean_l <- gsub("(?<![a-z])(l|le)(?![a-z])", "_l_", tolower(clean_r), perl = TRUE)

## Now sort, one after the other
sort_eye <- gsub("(.*)(_r_|_l_)(.*)", "\\2\\1\\3", clean_l, perl = TRUE)
sort_test <- gsub("(_r_|_l_)(.*)(_test_)(.*)", "\\1\\3\\2\\4", sort_eye, perl = TRUE)

## Remove underscores
clean_underscore_mult <- gsub("_{2,}", "_", sort_test)
clean_underscore_ends <- gsub("^_|_$", "", clean_underscore_mult)

clean_underscore_ends
#>  [1] "id"         "r_test"     "l_test"     "r_test"     "l_test"    
#>  [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"

I'd be already very very grateful for a suggestion how to better proceed from ## Now sort, one after the other downwards...

MichaelChirico · Accepted Answer · 2020-05-30 14:36:44Z

3

How about tokenizing the string and using lookup tables instead? I'll use data.table to assist but the idea fits naturally with other data grammars as well

library(data.table)
# build into a table, keeping track of an ID 
#   to say which element it came from originally
l = strsplit(x, '_', fixed=TRUE)
DT = data.table(id = rep(seq_along(l), lengths(l)), token = unlist(l))

Now build a lookup table:

# defined using fread to make it easier to see
#   token & match side-by-side; only define tokens
#   that actually need to be changed here
lookups = fread('
token,match
le,l
re,r
method,test
')

Now combine:

# default value is the token itself
DT[ , match := token]
# replace anything matched
DT[lookups, match := i.match, on = 'token']

Next use factor ordering to get the tokens in the right order:

# the more general [where you don't have an exact list of all the possible
#   tokens ready at hand] is a bit messier -- you might do something
#   similar to setdiff(unique(match), lookups$match)
DT[ , match := factor(match, levels = c('id', 'r', 'l', 'test', 'old', 'new'))]
# sort to this new order
setorder(DT, id, match)

Finally combine again (an aggregation) to get the output:

DT[ , paste(match, collapse='_'), by = id]$V1
#  [1] "id"         "r_test"     "l_test"     "r_test"     "l_test" 
#  [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"

answered May 30, 2020 at 14:36

MichaelChirico

34.9k17 gold badges122 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tjebo Over a year ago

I've translated it to base R now - and suggest a small function. Thanks for you huge inspiration.

Chris Ruehlemann · Accepted Answer · 2020-05-30 15:28:52Z

2

Here's a one-liner with nested sub that transforms x without any intermediary steps:

sub("^(\\w+)_(r|re|l|le)", "\\2_\\1", 
     sub("method", "test", 
          sub("(l|r)e", "\\1", 
               sub("(^new)_(\\w+_\\w+)$", "\\2_\\1", x))))

# [1] "id"  "r_test"  "l_test"  "r_test"  "l_test"  "r_test_old" 
# [7] "l_test_old"  "r_test_new"  "l_test_new" "r_test_new"

Data:

x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old", 
       "test_l_old", "re_test_new","new_le_method", "new_r_test")

edited May 30, 2020 at 15:28

answered May 30, 2020 at 14:38

Chris Ruehlemann

21.5k4 gold badges15 silver badges45 bronze badges

5 Comments

tjebo Over a year ago

thank you! I do learn a lot of regrex from you. +1 I will accept @michaelchirico's answer though, because they hit the nail on the head with what I wanted to achieve. I hope that;s alright

Chris Ruehlemann Over a year ago

@akrun I'm not sure I understand: are you referring to my calling my code a "one-liner"? My definition of it would be: whatever its width, it's a one-liner as long as it takes a single click on "Run".

Chris Ruehlemann Over a year ago

@Tjebo Do feel free to accept whatever suits your needs and your data best. But given that your query was speicifcally geared towards using gsub(or, in this case, subI do admit I'm surprised you accept an answer that takesa completeyl different route and, I should like to add, is way less compact.

Chris Ruehlemann Over a year ago

@akrun It's a one-liner even literally: you can put the whole code on a single line and execute it in one go.

akrun Over a year ago

It's okay. I was curious about your definition.

tjebo · Accepted Answer · 2020-05-31 21:19:12Z

Much inspired and building on user MichaelChirico's answer, this is a function using base R only, which (in theory) should work with any number of substrings to sort. The list defines the sort (by its elements), and it specifies all ways to code for the default tokens (the list names).

## I've added some more ways to code for right and left eyes, as well as different further strings that are not known. 

x <- c("id", "r_random_test_old", "r_test", "r_test_else", "l_method", "test_re", "method_le", "test_od_old", 
       "test_os_old", "re_mth_new","new_le_method", "new_r_test_random")
x
#>  [1] "id"                "r_random_test_old" "r_test"           
#>  [4] "r_test_else"       "l_method"          "test_re"          
#>  [7] "method_le"         "test_od_old"       "test_os_old"      
#> [10] "re_mth_new"        "new_le_method"     "new_r_test_random"

sort_substr(x, list(r = c("od","re"), l = c("os","le"), test = c("method", "mth"), time = c("old","new")))
#>  [1] "id"                 "r_test_time_random" "r_test"            
#>  [4] "r_test_else"        "l_test"             "r_test"            
#>  [7] "l_test"             "r_test_time"        "l_test_time"       
#> [10] "r_test_time"        "l_test_time"        "r_test_time_random"

sort_substr

sort_substr <- function(x, list_substr) {
  lookups <- data.frame(match = rep(names(list_substr), lengths(list_substr)), 
                        token = unlist(list_substr))
  l <- strsplit(x, "_", fixed = TRUE)
  DF <- data.frame(id = rep(seq_along(l), lengths(l)), token = unlist(l))
  match_token <- lookups$match[match(DF$token, lookups$token)]
  DF$match <- ifelse(is.na(match_token), DF$token, match_token)
  rest_token <- base::setdiff(DF$match, names(list_substr))
  DF$match <- factor(DF$match, levels = c(names(list_substr), rest_token))
  DF <- DF[with(DF, order(id, match)), ]
  out <- vapply(split(DF$match, DF$id), 
         paste, collapse = "_", 
         FUN.VALUE = character(1), 
         USE.NAMES = FALSE)
  out
}

Collectives™ on Stack Overflow

Change order of multiple optional substrings

3 Answers 3

1 Comment

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related