Unexpected RegEx behavior in Delphi XE

Question

Delphi XE, using Delphi's own RegularExpressions unit.

I'm attempting to correct some bad RTF code, where 'bookmark' tags cross the boundaries of a table cell. Seems simple enough. The code I'm using is below. Here's the general idea.

Given this text

{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell}

Look for a match to this pattern (there should be exactly one in the given text):

{\\\*\\bkmkstart BM0}\\plain\\f[0-9]\\fs[0-9]+\\cf[0-9] \^\\cell}

When found, replace it with this (non-RegEx) string:

{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}

The expected results are that the first string should be replaced with the last string, eg:

{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell} *becomes*
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}

However, the result I'm actually getting is this:

{\*\bkmkstart BM0}\plain{\*\bkmkstart bm0}\plain\f0\fs24\cf0 ^\cell}\fs24\cf0 ^{\*\bkmkend BM0}\plain{\*\bkmkstart bm0}\plain\f0\fs24\cf0 ^\cell}\fs24\cf0 \cell}

It looks as if the RegEx parser is getting horribly confused somehow, but I can't even characterize what is happening. It's not a mere double replacement, or an insertion instead of replacement. The 'ReplaceWith' string does seem to be the source of the confusion, though. If I use a nice simple 'XXXX' for the ReplaceWith string, instead of the RTF, it works exactly as it should.

So, any ideas how/why the RegEx search/replace is breaking so strangely here?

Here is the code I'm using:

procedure TfrmMain.btnProcessClick(Sender: TObject);
const
  SourceString = '{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell}';
  RegExFind = '{\\\*\\bkmkstart BM0}\\plain\\f[0-9]\\fs[0-9]+\\cf[0-9] \^\\cell}';
  ReplaceWith = '{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}';
var
  ResultStr: string;
  MyRegEx: TRegEx;
begin
  MyRegEx := TRegEx.Create (RegExFind);
  ResultStr := MyRegEx.Replace (SourceString, ReplaceWith);
  ShowMessage (ResultStr);
end;

Aside: I hate the Emba dev that decided that TRegEx.Create would make sense considering that TRegEx is a record. — David Heffernan
– David Heffernan, Commented Nov 18, 2013 at 17:17
I assumed the .create was done here to illustrate the expanded record capabilities, so that schmucks like me would maybe give it a whirl. — Eric S.
– Eric S., Commented Nov 18, 2013 at 17:29
It gets exciting when you pas MyRegEx to FreeAndNil ..... — David Heffernan
– David Heffernan, Commented Nov 18, 2013 at 17:32
@OGHaza The class method is implemented by calling TRegEx.Create and assigning the result to a local variable of type TRegEx. And then calling Replace() on that local variable. In other words your now deleted answer is simply expanded to the code in the question. Generally, when a library offers two alternative ways to do the same thing, it is unlikely that the answer to any question is to switch from one alternative to the other. For that to be the answer you'd need to know that the library in question had a broken implementation of one of the alternatives. — David Heffernan
– David Heffernan, Commented Nov 18, 2013 at 17:36

David Heffernan · Accepted Answer · 2013-11-18 17:57:20Z

3

You need to escape the \ characters in your replacement string:

ReplaceWith = '{\\*\\bkmkstart BM0}\\plain\\f0\\fs24\\cf0 ^{\\*\\bkmkend BM0}\\plain\\f0\\fs24\\cf0 \\cell}';

When you make this change the output is:

{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}

In fact, for your replacement string, you only need to escape the backslash in \f0 which, as it happens, appears twice. Personally I think it's just easier to escape the backslash indiscriminately.

By combining regular expressions and RTF you've mixed your own special backslash soup — tread carefully. Just be thankful you aren't using C or older versions of C++ that do not support raw strings. That backslash soup would be completely unpalatable!

edited Nov 18, 2013 at 17:57

answered Nov 18, 2013 at 17:31

David Heffernan

616k46 gold badges1.1k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Eric S. Over a year ago

Well, boggle. That does that trick, David, thanks. It seems really weird to me that the replacement string is being treated as anything other than a literal, but I'll take it. Seems like that bit of information should be in the documentation, in bold, italic and underline.

David Heffernan Over a year ago

If you did not have to escape a backslash then how would the engine interpret \1? Is that the literal \1, or is it the value of a capture? In any case, the documentation can be found here: regular-expressions.info which is not where you might expect it!!

Eric S. Over a year ago

My mistake was that I didn't expect the engine to have to interpret the replacement string at all. It seemed to me that the engine would be concerned only with the search string. Once found, I anticipated that it would just stuff the replacement string in there, without regard for the contents of the replacement string. The utility in question does a fair bit of similar manipulations on RTF documents, and we hadn't encountered anything like this before. Sheer dumb luck, apparently. Thanks again!

David Heffernan Over a year ago

In the replacement string, \f0 appears to mean the source string, not that I can explain why!

Eric S. Over a year ago

I took some time to dig through the link you (David) provided. Apparently the \F0 you identified as the culprit relates to case conversion and inserting found text into replacement text... "Insert the whole regex match or the 1st through 99th backreference with the first letter in the matched text converted to uppercase and the remaining letters converted to lowercase." My whole notion of the replacement string being a literal has been upended. Regex is a lot more powerful and dangerous than I knew. Much to learn.

Collectives™ on Stack Overflow

Unexpected RegEx behavior in Delphi XE

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related