2

That's a bit like this question, but I have multiple substrings that may or may not occur.

The substrings code for two different dimensions, in my example "test" and "eye". They can occur in any imaginable order. The variables can be coded in different ways - in my example, "method|test" would be two ways to code for "test", as well as "r|re|l|le" different ways to code for eyes.

I found a convoluted solution, which is using a chain of seven (!) gsub calls, and I wondered if there is a more concise way.

x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old", 
       "test_l_old", "re_test_new","new_le_method", "new_r_test")
x
#>  [1] "id"            "r_test"        "l_method"      "test_re"      
#>  [5] "method_le"     "test_r_old"    "test_l_old"    "re_test_new"  
#>  [9] "new_le_method" "new_r_test"

Desired output

#>  [1] "id"         "r_test"     "l_test"     "r_test"     "l_test"    
#>  [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"

How I got there (convoluted)

## Unify codes for variables, I use the underscores to make it more unique for future regex 
clean_test<- gsub("(?<![a-z])(test|method)(?![a-z])", "_test_", tolower(x), perl = TRUE)
clean_r <- gsub("(?<![a-z])(r|re)(?![a-z])", "_r_", tolower(clean_test), perl = TRUE)
clean_l <- gsub("(?<![a-z])(l|le)(?![a-z])", "_l_", tolower(clean_r), perl = TRUE)

## Now sort, one after the other
sort_eye <- gsub("(.*)(_r_|_l_)(.*)", "\\2\\1\\3", clean_l, perl = TRUE)
sort_test <- gsub("(_r_|_l_)(.*)(_test_)(.*)", "\\1\\3\\2\\4", sort_eye, perl = TRUE)

## Remove underscores
clean_underscore_mult <- gsub("_{2,}", "_", sort_test)
clean_underscore_ends <- gsub("^_|_$", "", clean_underscore_mult)

clean_underscore_ends
#>  [1] "id"         "r_test"     "l_test"     "r_test"     "l_test"    
#>  [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"

I'd be already very very grateful for a suggestion how to better proceed from ## Now sort, one after the other downwards...

0

3 Answers 3

3

How about tokenizing the string and using lookup tables instead? I'll use data.table to assist but the idea fits naturally with other data grammars as well

library(data.table)
# build into a table, keeping track of an ID 
#   to say which element it came from originally
l = strsplit(x, '_', fixed=TRUE)
DT = data.table(id = rep(seq_along(l), lengths(l)), token = unlist(l))

Now build a lookup table:

# defined using fread to make it easier to see
#   token & match side-by-side; only define tokens
#   that actually need to be changed here
lookups = fread('
token,match
le,l
re,r
method,test
')

Now combine:

# default value is the token itself
DT[ , match := token]
# replace anything matched
DT[lookups, match := i.match, on = 'token']

Next use factor ordering to get the tokens in the right order:

# the more general [where you don't have an exact list of all the possible
#   tokens ready at hand] is a bit messier -- you might do something
#   similar to setdiff(unique(match), lookups$match)
DT[ , match := factor(match, levels = c('id', 'r', 'l', 'test', 'old', 'new'))]
# sort to this new order
setorder(DT, id, match)

Finally combine again (an aggregation) to get the output:

DT[ , paste(match, collapse='_'), by = id]$V1
#  [1] "id"         "r_test"     "l_test"     "r_test"     "l_test" 
#  [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"
Sign up to request clarification or add additional context in comments.

1 Comment

I've translated it to base R now - and suggest a small function. Thanks for you huge inspiration.
2

Here's a one-liner with nested sub that transforms x without any intermediary steps:

sub("^(\\w+)_(r|re|l|le)", "\\2_\\1", 
     sub("method", "test", 
          sub("(l|r)e", "\\1", 
               sub("(^new)_(\\w+_\\w+)$", "\\2_\\1", x))))

# [1] "id"  "r_test"  "l_test"  "r_test"  "l_test"  "r_test_old" 
# [7] "l_test_old"  "r_test_new"  "l_test_new" "r_test_new"

Data:

x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old", 
       "test_l_old", "re_test_new","new_le_method", "new_r_test")

5 Comments

thank you! I do learn a lot of regrex from you. +1 I will accept @michaelchirico's answer though, because they hit the nail on the head with what I wanted to achieve. I hope that;s alright
@akrun I'm not sure I understand: are you referring to my calling my code a "one-liner"? My definition of it would be: whatever its width, it's a one-liner as long as it takes a single click on "Run".
@Tjebo Do feel free to accept whatever suits your needs and your data best. But given that your query was speicifcally geared towards using gsub(or, in this case, subI do admit I'm surprised you accept an answer that takesa completeyl different route and, I should like to add, is way less compact.
@akrun It's a one-liner even literally: you can put the whole code on a single line and execute it in one go.
It's okay. I was curious about your definition.
0

Much inspired and building on user MichaelChirico's answer, this is a function using base R only, which (in theory) should work with any number of substrings to sort. The list defines the sort (by its elements), and it specifies all ways to code for the default tokens (the list names).

## I've added some more ways to code for right and left eyes, as well as different further strings that are not known. 

x <- c("id", "r_random_test_old", "r_test", "r_test_else", "l_method", "test_re", "method_le", "test_od_old", 
       "test_os_old", "re_mth_new","new_le_method", "new_r_test_random")
x
#>  [1] "id"                "r_random_test_old" "r_test"           
#>  [4] "r_test_else"       "l_method"          "test_re"          
#>  [7] "method_le"         "test_od_old"       "test_os_old"      
#> [10] "re_mth_new"        "new_le_method"     "new_r_test_random"

sort_substr(x, list(r = c("od","re"), l = c("os","le"), test = c("method", "mth"), time = c("old","new")))
#>  [1] "id"                 "r_test_time_random" "r_test"            
#>  [4] "r_test_else"        "l_test"             "r_test"            
#>  [7] "l_test"             "r_test_time"        "l_test_time"       
#> [10] "r_test_time"        "l_test_time"        "r_test_time_random"

sort_substr

sort_substr <- function(x, list_substr) {
  lookups <- data.frame(match = rep(names(list_substr), lengths(list_substr)), 
                        token = unlist(list_substr))
  l <- strsplit(x, "_", fixed = TRUE)
  DF <- data.frame(id = rep(seq_along(l), lengths(l)), token = unlist(l))
  match_token <- lookups$match[match(DF$token, lookups$token)]
  DF$match <- ifelse(is.na(match_token), DF$token, match_token)
  rest_token <- base::setdiff(DF$match, names(list_substr))
  DF$match <- factor(DF$match, levels = c(names(list_substr), rest_token))
  DF <- DF[with(DF, order(id, match)), ]
  out <- vapply(split(DF$match, DF$id), 
         paste, collapse = "_", 
         FUN.VALUE = character(1), 
         USE.NAMES = FALSE)
  out
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.