splitting a string with regex in R

Question

I'm trying to split strings based on specific patterns. I have data nested within curly brackets. What I'm trying to do is split the string at the double curly bracket. I've figured out how to do this with "separate" within a data frame, but for future reference I'd love to know why this doesn't work.

I've provided an example below on a single string:

pattern_test<-"[^\\}{2,2}]*\\}{2,2}"
teststring <- "{the {dog} is {hot}},{the {cat} is {lazy}}"
tmp<-unlist(str_extract_all(teststring, pattern_test))
tmp

tmp evaluates to ("hot}}", "lazy}}").

In words, what I'm trying to do in "pattern_test" is to define a class that includes all characters that are not exactly "}}": [^\\}{2,2}] and find as many characters in that class: *, followed by "}}" (outside the square brackets: \\}{2,2}). I suspect I'm making a fundamental error but most of the examples I've found online haven't helped me figure out what the error is. What I want tmp to evaluate to is:

("{the {dog} is {hot}}", ",{the {cat} is {lazy}}"). Why is the substring cutting off at the open bracket?

I can't answer your question, but I would have tried a pattern like "\\{.*?\\}{2}". — Martin Gal
– Martin Gal, Commented Aug 9, 2021 at 18:47
Thanks @MartinGal, your solution worked. To answer your question, I created a simpler example where splitting on "," would've worked. It doesn't work in my real data. Would you mind explaining to me what your pattern is doing? — William Dowd
– William Dowd, Commented Aug 9, 2021 at 18:58
This mattern looks for a { followed by }} and extracts everything in between. The .*? is "non-greedy", so it doesn't take as much as possible. Without the ? this returns the whole string, since the last characters are also }}. Not a very good explanation but I hope the idea is clear. — Martin Gal
– Martin Gal, Commented Aug 9, 2021 at 19:01

Wiktor Stribiżew · Accepted Answer · 2021-08-09 19:04:04Z

3

The problem is that you cannot match any text but a certain multichar substring with a negated character class, as character classes are meant to match single characters as separate chars, not as sequences of chars.

Another issue is that you are trying to match a recursive pattern, and stringr / stringi package is using the ICU regex library that does not support recursion in regex.

To match what you want, you can only use PCRE regex library with R:

pattern_test<-"\\{(?:[^{}]++|(?R))*}"
teststring <- "{the {dog} is {hot}},{the {cat} is {lazy}}"
unlist(regmatches(teststring, gregexpr(pattern_test, teststring, perl=TRUE)))
## => [1] "{the {dog} is {hot}}"  "{the {cat} is {lazy}}"

See the R demo online. That is:

\{ - match a {
(?:[^{}]++|(?R))* - zero or more occurrences of one or more chars other than { and } or the whole regex pattern (recursed)
} - a } char.

answered Aug 9, 2021 at 19:04

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Martin Gal Over a year ago

I learnt something new, never heard of PCRE or ICU and I'm still struggling with your code. However you stated, the only way is the usage of PCRE regex library. But (at least with the example) str_extract_all(teststring, "\\{.*?\\}{2}") did return the expected result. Is this just a coincidence? Or are there any flaws in this pattern?

Wiktor Stribiżew Over a year ago

@MartinGal Your regex matches {, then any zero or more chars other than line break chars as few as possible, up to the leftmost }} substring. It will match {aaa{...}....}....}....}}. If you need this, fine.

William Dowd Over a year ago

Thank you both. I'll need to digest this, but I appreciate the help!

Collectives™ on Stack Overflow

splitting a string with regex in R

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related