2

I have a column in a data.table full of strings in the format string+integer. e.g.

string1, string2, string3, string4, string5,

When I use sort(), I put these strings in the wrong order.

string1, string10, string11, string12, string13, ..., string2, string20, 
string21, string22, string23, ....

How would I sort these to be in the order

string01, string02, string03, string04, strin0g5, ... , string10,, string11, 
string12, etc.   

One method could be to add a 0 to each integer <10, 1-9? I suspect you would extract the string with str_extract(dt$string_column, "[a-z]+") and then add a 0 to each single-digit integer...somehow with sprintf()

0

4 Answers 4

6

We can remove the characters that are not numbers to do the sorting

dt[order(as.integer(gsub("\\D+", "", col1)))]
Sign up to request clarification or add additional context in comments.

3 Comments

What would be the difference of using sort() and order() here? It appears sorts leads to the same errors
The other problem I'm having with this: there are two strings which are labeled stringX and stringZ. These do not have integers, and should be put last. The above as.integer() will result in NAs if I don't somehow exclude these
@ShanZhengYang Sorry, in your post you didn't mention all these things and the reproducible example was incomplete
1

You could go for mixedsort in gtools:

vec <- c("string1", "string10", "string11", "string12", "string13","string2", 
         "string20", "string21", "string22", "string23")

library(gtools)
mixedsort(vec)

#[1] "string1"  "string2"  "string10" "string11" "string12" "string13"
# "string20" "string21" "string22" "string23"

Comments

1

You could use the str_extract of stringr package to obtain the digits and order according to that

x = c("string1","string3","stringZ","string2","stringX","string10")
library(stringr)
c(x[grepl("\\d+",x)][order(as.integer(str_extract(x[grepl("\\d+",x)],"\\d+")))], 
   sort(x[!grepl("\\d+",x)]))
#[1] "string1"  "string2"  "string3"  "string10" "stringX"  "stringZ" 

1 Comment

(Apologies for not mentioning this in the original question) This approach would work, except for one problem: here are two strings which are labeled stringX and stringZ. These do not have integers, and should be put last. The above as.integer() will result in NAs if I don't somehow exclude these
1

Assuming the string is something like below:

library(data.table)
library(stringr)

  xstring <- data.table(x = c("string1","string11","string2",'string10',"stringx"))
  extracts <- str_extract(xstring$x,"(?<=string)(\\d*)")
  y_string <- ifelse(nchar(extracts)==2 | extracts=="",extracts,paste0("0",extracts))
  fin_string <- str_replace(xstring$x,"(?<=string)(\\d*)",y_string)
  sort(fin_string)

Output:

> sort(fin_string)
[1] "string01" "string02" "string10" "string11"
[5] "stringx"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.