2

I’m struggling to get a bit of regular expressions code to work. I have a long list of strings that I need to partially extract. I need only strings that starting with “WER” and I only need the last part of the string commencing (including) on the letter.

test <- c("abc00012Z345678","WER0004H987654","WER12400G789456","WERF12","0-0Y123")

Here is the line of code which is working but only for one letter. However in my list of strings it can have any letter.

ifelse(substr(test,1,3)=="WER",gsub("^.*H.*?","H",test),"")

What I’m hoping to achieve is the following:

H987654
G789456
F12

2 Answers 2

5

You can use the following pattern with gsub:

> gsub("^(?:WER.*([a-zA-Z]\\d*)|.*)$", "\\1", test)
[1] ""        "H987654" "G789456" "F12"     "" 

See the regex demo

This pattern matches:

  • ^ - start of a string
  • (?: - start of an alternation group with 2 alternatives:
    • WER.*([a-zA-Z]\\d*) - WER char sequence followed with 0+ any characters (.*) as many as possible up to the last letter ([a-zA-Z]) followed by 0+ digits (\\d*) (replace with \\d+ to match 1+ digits, to require at least 1 digit)
    • | - or
    • `.* - any 0+ characters
  • )$ - closing the alternation group and match the end of string with $.

With str_match from stringr, it is even tidier:

> library(stringr)
> res <- str_match(test, "^WER.*([a-zA-Z]\\d*)$")
> res[,2]
[1] NA        "H987654" "G789456" "F12"     NA       
> 

See another regex demo

If there are newlines in the input, add (?s) at the beginning of the pattern: res <- str_match(test, "(?s)^WER.*([a-zA-Z]\\d*)$").

Sign up to request clarification or add additional context in comments.

1 Comment

Just FYI: "^.*H.*?" matches the whole chunk of a string from its beginning till the last H, and stops right there as .*? at the end does not consume/return any characters (because it is a lazy subpattern and can match an empty string, so it matches the empty location after H and calls it a day). This matched chunk is replaced with "H" by gsub.
3

If you don't want empty strings or NA for strings that don't start with "WER", you could try the following approach:

sub(".*([A-Z].*)$", "\\1", test[grepl("^WER", test)])
#[1] "H987654" "G789456" "F12" 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.