regex select multiple groups

Question

I have the following string from which I want to extract the content between the second pair of colons (in bold in the example):

"20160607181026_0000005:0607181026000000501:ES5206956802492:479"

I am using R and specifically the stringr package to manipulate strings. The command I attempted to use is:

str_extract("20160607181026_0000005:0607181026000000501:ES5206956802492:479", ":(.*):")

where the regex pattern is expressed at the end of the command. This produces the following result:

":0607181026000000501:ES5206956802492:"

I know that there is a way of grouping results and back-reference them, which would allow me to select only the part I am interested in, but I don't seem to be able to figure out the right syntax.

How can I achieve this?

Sotos · Accepted Answer · 2016-06-08 09:39:48Z

3

Also word from stringr,

library(stringr)
word(v1, 3, sep=':')
#[1] "ES5206956802492"

answered Jun 8, 2016 at 9:39

Sotos

51.6k6 gold badges35 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

akrun Over a year ago

Hmm, the word again :-)

Sotos Over a year ago

hehehe...this time I got the hint from your strsplit approach :)

Sotos Over a year ago

@ColonelBeauvel it is trending :)

thepule Over a year ago

Haha that is a neat solution indeed.

akrun · Accepted Answer · 2016-06-08 09:39:32Z

2

If the first character after the : starts with LETTERS, then we can use a compact regex. Here, we use regex lookaround ((?<=:)) and match a LETTERS ([A-Z]) that follows the : followed by one of more characters that are not a : ([^:]+).

str_extract(v1, "(?<=:)[A-Z][^:]+")
#[1] "ES5206956802492"

or if it is based on the position i.e. 2nd position, a base R option would be to match zero or more non : ([^:]*) followed by the first : followed by zero or more non : followed by the second : and then we capture the non : in a group ((...)) and followed by rest of the characters (.*). In the replacement, we use the backreference, i.e. \\1 (first capture group).

sub("[^:]*:[^:]*:([^:]+).*", "\\1", v1)
#[1] "ES5206956802492"

Or the repeating part can be captured to make it compact

sub("([^:]*:){2}([^:]+).*", "\\2", v1)
#[1] "ES5206956802492"

Or with strsplit, we split at delimiter : and extract the 3rd element.

strsplit(v1, ":")[[1]][3]
#[1] "ES5206956802492"

data

v1 <- "20160607181026_0000005:0607181026000000501:ES5206956802492:479"

edited Jun 8, 2016 at 9:39

answered Jun 8, 2016 at 9:30

akrun

891k38 gold badges590 silver badges700 bronze badges

2 Comments

thepule Over a year ago

That's great @akrun. Could you explain the logic behind this?

akrun Over a year ago

@thepule I added some explanations.

Collectives™ on Stack Overflow

regex select multiple groups

2 Answers 2

4 Comments

data

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

data

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related