1

I'm creating a dictionary application in PHP and MariaDB, and trying to simulate some basic markdown. When I have a definition like this:

This is an example definition. Here is a link to [foo]. This is an [aliased link|bar].

Then [foo] will be translated into a link to the 'foo' definition page, and [aliased link|bar] will translate into a link to the 'bar' definition page. If there's a pipe then whatever's before the pipe (|) will become the link text, and after the pipe becomes the link destination. If there's no pipe, then the expression in brackets becomes the link text and destination.

So I would translate this to the following HTML:

This is an example definition. Here is a link to <a href="foo">foo</a>. This is an <a href="bar">aliased link</a>.

The easiest way I could think of to do this was through two regex replaces. So let's say my example string is called $def, here is the code I've tried to make these replacements:

$pattern1 = '/\[(.*?)?\]/m';
$replace1 = '<a href="$1">$1</a>';
$def = preg_replace($pattern1, $replace1, $def);

$pattern2 = '/\[([^]]*?)(?:\|([^]]*?))\]/m';
$replace2 = '<a href="$2">$1</a>';
$def = preg_replace($pattern2, $replace2, $def);

(I assumed it would be easier to do it using two regexes, but if there's a simpler one-regex solution I'd love to know.)

However, I've clearly got something wrong with the regex, as this is what happens when I echo $def (the links are just illustrative for now, the destination isn't important):

This is an example definition. Here is a link to foo. This is an aliased link|bar.

And the HTML:

"This is an example definition. Here is a link to "
<a href="foo">foo</a>
". This is an" 
<a href="aliased link|bar">aliased link|bar</a>
"."

Can anyone advise what I need to do to fix the regex to get my desired result? I'm especially confused because when I test this regex in www.regex101.com, it seems to do exactly what I think it should do:

enter image description here

I'm using PHP 7.4.6 on Google Chrome, with XAMPP and Apache.

1
  • Your second regex isn't wrong but it doesn't do anything because the first preg_replace has already replaced both links. Commented Mar 27, 2021 at 18:53

1 Answer 1

2

Note that in the pattern that you used, you can exclude matching the | by adding it in the first negated character class to prevent some backtracking. The quantifier for the negated character class also does not have to be non greedy *? as the ] can not be crossed at the end.

You could use 2 capture groups where the second group is in an optional part and check for the presence of group 2 using preg_replace_callback.

\[([^][|]+)(?:\|([^][]+))?]

The pattern matches:

  • \[ Match [
  • ([^][|]+) Capture group 1, match 1+ times any char except [ ] and |
  • (?:\|([^][]+))? Optional non capture group matching | and capture any char except the listed in group 2
  • ] Match closing ]

Regex demo | Php demo

$pattern = "/\[([^][|]+)(?:\|([^][]+))?\]/";
$s = "This is an example definition. Here is a link to [foo]. This is an [aliased link|bar].";
$s = preg_replace_callback($pattern, function($match){
    $template = '<a href="%s">%s</a>';
    return sprintf($template, array_key_exists(2, $match) ? $match[2] : $match[1], $match[1]);
}, $s);

echo $s;

Output

This is an example definition. Here is a link to <a href="foo">foo</a>. This is an <a href="bar">aliased link</a>.
Sign up to request clarification or add additional context in comments.

4 Comments

That's fantastic, the answer works perfectly, thank you! Just one question - why do you use greedy matches for the two capture groups, i.e. + instead of +?? I always thought it was more performant to use lazy matches, especially as the capture groups in question will still capture everything until they reach a ] character in this case.
@Lou when you use a lazy match, there will be backtracking. You don't have to use a lazy quantifier in this case, as you can use a greedy quantifier that can not cross matching the ]
Ah okay, so + would actually perform better than +?, potentially?
@Lou In this case it does, but for example in this scenario where you want to match qq and you can not use a negated character class [^q] (or else you would not get to the qq because you can not pass the first q using that) depending on the length of the string a non greedy match would have less steps according to the regex101 tool as it is located earlier in the string. See this vs this and with a shorter string this vs this

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.