2

I have the following string from which I want to extract the content between the second pair of colons (in bold in the example):

"20160607181026_0000005:0607181026000000501:ES5206956802492:479"

I am using R and specifically the stringr package to manipulate strings. The command I attempted to use is:

str_extract("20160607181026_0000005:0607181026000000501:ES5206956802492:479", ":(.*):")

where the regex pattern is expressed at the end of the command. This produces the following result:

":0607181026000000501:ES5206956802492:"

I know that there is a way of grouping results and back-reference them, which would allow me to select only the part I am interested in, but I don't seem to be able to figure out the right syntax.

How can I achieve this?

2 Answers 2

3

Also word from stringr,

library(stringr)
word(v1, 3, sep=':')
#[1] "ES5206956802492"
Sign up to request clarification or add additional context in comments.

4 Comments

Hmm, the word again :-)
hehehe...this time I got the hint from your strsplit approach :)
@ColonelBeauvel it is trending :)
Haha that is a neat solution indeed.
2

If the first character after the : starts with LETTERS, then we can use a compact regex. Here, we use regex lookaround ((?<=:)) and match a LETTERS ([A-Z]) that follows the : followed by one of more characters that are not a : ([^:]+).

str_extract(v1, "(?<=:)[A-Z][^:]+")
#[1] "ES5206956802492"

or if it is based on the position i.e. 2nd position, a base R option would be to match zero or more non : ([^:]*) followed by the first : followed by zero or more non : followed by the second : and then we capture the non : in a group ((...)) and followed by rest of the characters (.*). In the replacement, we use the backreference, i.e. \\1 (first capture group).

sub("[^:]*:[^:]*:([^:]+).*", "\\1", v1)
#[1] "ES5206956802492"

Or the repeating part can be captured to make it compact

sub("([^:]*:){2}([^:]+).*", "\\2", v1)
#[1] "ES5206956802492"

Or with strsplit, we split at delimiter : and extract the 3rd element.

strsplit(v1, ":")[[1]][3]
#[1] "ES5206956802492"

data

v1 <- "20160607181026_0000005:0607181026000000501:ES5206956802492:479"

2 Comments

That's great @akrun. Could you explain the logic behind this?
@thepule I added some explanations.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.