Vim / sed regex backreference in search pattern

Question

Vim help says that:

\1      Matches the same string that was matched by     */\1* *E65*
        the first sub-expression in \( and \). {not in Vi}
        Example: "\([a-z]\).\1" matches "ata", "ehe", "tot", etc.

It looks like the backreference can be used in search pattern. I started playing with it and I noticed behavior that I can't explain. This is my file:

<paper-input label="Input label"> Some text </paper-input>
<paper-input label="Input label"> Some text </paper-inputa>
<aza> Some text </az>
<az> Some text </az>
<az> Some text </aza>

I wanted to match the lines where the opening and closing tags are matching i.e.:

<paper-input label="Input label"> Some text </paper-input>
<az> Some text </az>

And my test regex is:

%s,<\([^ >]\+\).*<\/\1>,,gn

But this matches lines: 1, 3 and 4. Same thing with sed:

$ sed -ne 's,<\([^ >]\+\).*<\/\1>,\0,p' file
<paper-input label="Input label"> Some text </paper-input>
<aza> Some text </az>
<az> Some text </az>

This: <\([^ >]\+\) should be greedy and when trying to match it without \1 at the end then all the groups are correct. But when I add \1 it seems that <\([^ >]\+\) becomes not greedy and it tries to force the match in 3rd line. Can someone explain why it matches 3rd line:

<aza> Some text </az>

This is also a regex101 demo

NOTE This is not about the regex itself (probably there is other way to do it) but about the behavior of that regex.

You should take a look at backtracking engines. If it doesn't find a match the engine backtracks until and chooses something different. For instance \1 equals az on line three after all of the backtracking. (Since you never added anchors) — FDinoff
– FDinoff, Commented Sep 8, 2016 at 1:43
to add to @FDinoff's point, you can add a rule to match a space or > as anchors... <\([^ >]\+\)[ >].*<\/\1> — Sundeep
– Sundeep, Commented Sep 8, 2016 at 2:21
@spasic Yes, I understood how backtracking works and the anchors for space and > seem to be the best idea here. — dbosky
– dbosky, Commented Sep 8, 2016 at 7:34

FDinoff · Accepted Answer · 2016-09-08 20:26:46Z

4

To understand why your regex behaves the way it does you need to understand what a backtracking regex engine does.

The engine will greedily match and consume as many characters as it can. But if it doesn't find a match it goes back and tries to find a different match that still satisfies the pattern.

%s,<\([^ >]\+\).*<\/\1>,,gn

For line three <aza> Some text </az>,

The regex engine looks at \1 = aza. and sees if .*</aza> matches the rest of the string. It doesn't so it chooses something else for \1. The next time it chooses \1 = az and sees if .*</az> matches the rest of the string and it does. So the string matches

(This is a simplified version. I skipped over the fact that .* can potentially do a lot of backtracking itself)

Solving it is as easy as adding an anchor in the regex stops the regex from searching for other values that could satisfy \1. In this case matching a space or > is sufficient.

answered Sep 8, 2016 at 20:26

FDinoff

31.6k5 gold badges79 silver badges99 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

dbosky Over a year ago

This is very good explanation. Ending word with \> i.e. <\([^ >]\+\)\>.*</\1> as suggested by @LucHermitte will also work.

Luc Hermitte · Accepted Answer · 2016-09-08 01:42:39Z

2

You need to add \> to indicate end of word. There may be other solutions with 0-width patterns, but it'll complicates things.

Also, your separator is ,, not /

Which gives:

%s,<\([^ >]\+\)\>.*</\1>,,gn

answered Sep 8, 2016 at 1:42

Luc Hermitte

33.2k7 gold badges73 silver badges91 bronze badges

5 Comments

dbosky Over a year ago

This won't match 1st line. Besides, as I mentioned in the question - I want to understand why my regex is not working.

Luc Hermitte Over a year ago

I've just checked. This does match first line (I've just checked my gvim 7.4-2207, and vim 7-4-2181 I have at hand). Regex 101 doesn't handle it well though. Regarding the explanation, @FDinoff already gave it.

Luc Hermitte Over a year ago

@DawidGrabowski, This works as expected with vim 7.3-429 as well. Could it be that you've altered &isk definition?

dbosky Over a year ago

Instead of copying the regex I did write it and make a mistake. It's working (sed and vim)

Luc Hermitte Over a year ago

That happens :)

Tim Biegeleisen · Accepted Answer · 2016-09-08 00:54:33Z

0

Currently the reason why line 3 (<aza>) is showing up as a match is that the .* term in your regex can match across multiple lines. So line 3 matches because line 5 has the closing tag. To correct this, force the regex to find a matching closing tag on the same line only:

%s,<\([^ >]\+\)[^\n]*?<\/\1>,,gn
               ^^^^^ use [^\n]* instead of .*

edited Sep 8, 2016 at 0:54

answered Sep 8, 2016 at 0:49

Tim Biegeleisen

526k32 gold badges324 silver badges399 bronze badges

4 Comments

dbosky Over a year ago

Why do you think .* is matching accross multiples lines? It matches any character except new line

Tim Biegeleisen Over a year ago

@DawidGrabowski Then how do you explain line 3 is showing up as a match?

dbosky Over a year ago

I don't know. This is why I asked this question. I know that .* is definitely not matching new line. I added regex101 demo.

Luc Hermitte Over a year ago

\_. can match a newline with vim. not .

Collectives™ on Stack Overflow

Vim / sed regex backreference in search pattern

3 Answers 3

1 Comment

5 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related