3

I'm trying to split strings based on specific patterns. I have data nested within curly brackets. What I'm trying to do is split the string at the double curly bracket. I've figured out how to do this with "separate" within a data frame, but for future reference I'd love to know why this doesn't work.

I've provided an example below on a single string:

pattern_test<-"[^\\}{2,2}]*\\}{2,2}"
teststring <- "{the {dog} is {hot}},{the {cat} is {lazy}}"
tmp<-unlist(str_extract_all(teststring, pattern_test))
tmp

tmp evaluates to ("hot}}", "lazy}}").

In words, what I'm trying to do in "pattern_test" is to define a class that includes all characters that are not exactly "}}": [^\\}{2,2}] and find as many characters in that class: *, followed by "}}" (outside the square brackets: \\}{2,2}). I suspect I'm making a fundamental error but most of the examples I've found online haven't helped me figure out what the error is. What I want tmp to evaluate to is:

("{the {dog} is {hot}}", ",{the {cat} is {lazy}}"). Why is the substring cutting off at the open bracket?

5
  • 1
    Why can't you simply split at ,? Commented Aug 9, 2021 at 18:40
  • I can't answer your question, but I would have tried a pattern like "\\{.*?\\}{2}". Commented Aug 9, 2021 at 18:47
  • Thanks @MartinGal, your solution worked. To answer your question, I created a simpler example where splitting on "," would've worked. It doesn't work in my real data. Would you mind explaining to me what your pattern is doing? Commented Aug 9, 2021 at 18:58
  • 3
    This mattern looks for a { followed by }} and extracts everything in between. The .*? is "non-greedy", so it doesn't take as much as possible. Without the ? this returns the whole string, since the last characters are also }}. Not a very good explanation but I hope the idea is clear. Commented Aug 9, 2021 at 19:01
  • Consider accepting Wiktor Stribiżew's answer. Commented Aug 9, 2021 at 20:21

1 Answer 1

3

The problem is that you cannot match any text but a certain multichar substring with a negated character class, as character classes are meant to match single characters as separate chars, not as sequences of chars.

Another issue is that you are trying to match a recursive pattern, and stringr / stringi package is using the ICU regex library that does not support recursion in regex.

To match what you want, you can only use PCRE regex library with R:

pattern_test<-"\\{(?:[^{}]++|(?R))*}"
teststring <- "{the {dog} is {hot}},{the {cat} is {lazy}}"
unlist(regmatches(teststring, gregexpr(pattern_test, teststring, perl=TRUE)))
## => [1] "{the {dog} is {hot}}"  "{the {cat} is {lazy}}"

See the R demo online. That is:

  • \{ - match a {
  • (?:[^{}]++|(?R))* - zero or more occurrences of one or more chars other than { and } or the whole regex pattern (recursed)
  • } - a } char.
Sign up to request clarification or add additional context in comments.

3 Comments

I learnt something new, never heard of PCRE or ICU and I'm still struggling with your code. However you stated, the only way is the usage of PCRE regex library. But (at least with the example) str_extract_all(teststring, "\\{.*?\\}{2}") did return the expected result. Is this just a coincidence? Or are there any flaws in this pattern?
@MartinGal Your regex matches {, then any zero or more chars other than line break chars as few as possible, up to the leftmost }} substring. It will match {aaa{...}....}....}....}}. If you need this, fine.
Thank you both. I'll need to digest this, but I appreciate the help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.