1

I'm trying to match markdown tags with recursive.

Input Syntax

(TYPE: VALUE ATTR_KEY: ATTR_VALUE)

Note that syntax should be starts with: [a-z0-9_-]+:

Sample Inputs:

(image: sky.jpg)
(image: sky.jpg caption: Sky (Issue This) View)
(link: https://stackoverflow.com text: Stack Overflow)
(link: https://stackoverflow.com text: Stack Overflow rel=nofollow)
(video: http://www.youtube.com/watch?v=49Kh1mS4Fhs)

Currently using following regex:

(?=[^\]])\([a-z0-9_-]+:.*?\)

But issue coming from here, because match with:

(image: sky.jpg caption: Sky (Issue This)

Expected match:

(image: sky.jpg caption: Sky (Issue This) View)

If parentheses are used again in parentheses, it does not match exactly.

I tried following recursive patterns and works but i need to restrict starts with characters.

(?s)\((?:[^()]+|(?R))*+\)
\((?:[^)(]+|(?R))*+\)
2
  • Can't you just match whole tag and then process it with PHP splitting string? Commented Mar 4, 2020 at 13:37
  • @Justinas I process only the tag ones, not inside every parenthesis. Commented Mar 4, 2020 at 13:39

1 Answer 1

2

You should use a positive lookahead to match sure the match starts with that pattern, but you will have to wrap the whole parentheses matching pattern within another capturing group and use a (?1) subroutine instead of (?R) to only recurse that pattern, not the whole regex:

(?=\([a-z0-9_-]+:)(\((?:[^()]+|(?1))*+\))
^^^^^^^^^^^^^^^^^^^            ^^^^     ^

See the regex demo.

Details

  • (?=\([a-z0-9_-]+:) - a positive lookahead that requires (, 1+ lowercase ASCII letters, digits, underscores or hyphens followed with : immediately to the right of the current location
  • (\((?:[^()]+|(?1))*+\)) - Capturing group 1 (it will be recursed later):
    • \( - (
    • (?:[^()]+|(?1))*+ - 1+ repetitions of 1+ any chars other than ( and ) or the whole Group 1 pattern (recursed)
    • \) - )

In case you want to also support smileys, you may add their specific patterns in the alternation group where the regex subroutine resides, as the first alternative:

(?=\([a-z0-9_-]+:)(\((?::[)(]|[^()]|(?1))*+\))
                        ^^^^^  

I add :[)(] that matches :) or :( and removed + from after [^()] so as to be able to check the string inside nested parentheses character by character.

Feel free to adjust it to your needs, or add more smiley patterns.

See this regex demo with the (?=\([a-z0-9_-]+:)(\((?::(?:[()pPDd*oO]|'\()|<3|;\)|[^()]|(?1))*+\)) regex.

Sign up to request clarification or add additional context in comments.

5 Comments

Naturally, we use missing parentheses do not match. But it doesn't match when we use emoji, though it's not very important. Do you think there is a simple solution to this? regex101.com/r/1Tti14/4
Thank you for your great help.This was also very good. But it was true that it didn't match when there were actually missing brackets. It should be able to use emoji in parentheses, but should not match when missing parentheses. Is this hard?
Hmm, still matching with Missing bracket usage :( An error can occur when I try to parse it.
I misunderstood. So, Matching bracket should not match and Emoji should? Like in (?=\([a-z0-9_-]+:)(\((?::[)(]|[^()]|(?1))*+\)) (demo)?
That's exactly it. How can you write so great. I really thank you again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.