4

Delphi XE, using Delphi's own RegularExpressions unit.

I'm attempting to correct some bad RTF code, where 'bookmark' tags cross the boundaries of a table cell. Seems simple enough. The code I'm using is below. Here's the general idea.

Given this text

{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell}

Look for a match to this pattern (there should be exactly one in the given text):

{\\\*\\bkmkstart BM0}\\plain\\f[0-9]\\fs[0-9]+\\cf[0-9] \^\\cell}

When found, replace it with this (non-RegEx) string:

{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}

The expected results are that the first string should be replaced with the last string, eg:

{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell} *becomes*
{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}

However, the result I'm actually getting is this:

{\*\bkmkstart BM0}\plain{\*\bkmkstart bm0}\plain\f0\fs24\cf0 ^\cell}\fs24\cf0 ^{\*\bkmkend BM0}\plain{\*\bkmkstart bm0}\plain\f0\fs24\cf0 ^\cell}\fs24\cf0 \cell}

It looks as if the RegEx parser is getting horribly confused somehow, but I can't even characterize what is happening. It's not a mere double replacement, or an insertion instead of replacement. The 'ReplaceWith' string does seem to be the source of the confusion, though. If I use a nice simple 'XXXX' for the ReplaceWith string, instead of the RTF, it works exactly as it should.

So, any ideas how/why the RegEx search/replace is breaking so strangely here?

Here is the code I'm using:

procedure TfrmMain.btnProcessClick(Sender: TObject);
const
  SourceString = '{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^\cell}';
  RegExFind = '{\\\*\\bkmkstart BM0}\\plain\\f[0-9]\\fs[0-9]+\\cf[0-9] \^\\cell}';
  ReplaceWith = '{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}';
var
  ResultStr: string;
  MyRegEx: TRegEx;
begin
  MyRegEx := TRegEx.Create (RegExFind);
  ResultStr := MyRegEx.Replace (SourceString, ReplaceWith);
  ShowMessage (ResultStr);
end;
4
  • Aside: I hate the Emba dev that decided that TRegEx.Create would make sense considering that TRegEx is a record. Commented Nov 18, 2013 at 17:17
  • I assumed the .create was done here to illustrate the expanded record capabilities, so that schmucks like me would maybe give it a whirl. Commented Nov 18, 2013 at 17:29
  • It gets exciting when you pas MyRegEx to FreeAndNil ..... Commented Nov 18, 2013 at 17:32
  • @OGHaza The class method is implemented by calling TRegEx.Create and assigning the result to a local variable of type TRegEx. And then calling Replace() on that local variable. In other words your now deleted answer is simply expanded to the code in the question. Generally, when a library offers two alternative ways to do the same thing, it is unlikely that the answer to any question is to switch from one alternative to the other. For that to be the answer you'd need to know that the library in question had a broken implementation of one of the alternatives. Commented Nov 18, 2013 at 17:36

1 Answer 1

3

You need to escape the \ characters in your replacement string:

ReplaceWith = '{\\*\\bkmkstart BM0}\\plain\\f0\\fs24\\cf0 ^{\\*\\bkmkend BM0}\\plain\\f0\\fs24\\cf0 \\cell}';

When you make this change the output is:

{\*\bkmkstart BM0}\plain\f0\fs24\cf0 ^{\*\bkmkend BM0}\plain\f0\fs24\cf0 \cell}

In fact, for your replacement string, you only need to escape the backslash in \f0 which, as it happens, appears twice. Personally I think it's just easier to escape the backslash indiscriminately.

By combining regular expressions and RTF you've mixed your own special backslash soup — tread carefully. Just be thankful you aren't using C or older versions of C++ that do not support raw strings. That backslash soup would be completely unpalatable!

Sign up to request clarification or add additional context in comments.

5 Comments

Well, boggle. That does that trick, David, thanks. It seems really weird to me that the replacement string is being treated as anything other than a literal, but I'll take it. Seems like that bit of information should be in the documentation, in bold, italic and underline.
If you did not have to escape a backslash then how would the engine interpret \1? Is that the literal \1, or is it the value of a capture? In any case, the documentation can be found here: regular-expressions.info which is not where you might expect it!!
My mistake was that I didn't expect the engine to have to interpret the replacement string at all. It seemed to me that the engine would be concerned only with the search string. Once found, I anticipated that it would just stuff the replacement string in there, without regard for the contents of the replacement string. The utility in question does a fair bit of similar manipulations on RTF documents, and we hadn't encountered anything like this before. Sheer dumb luck, apparently. Thanks again!
In the replacement string, \f0 appears to mean the source string, not that I can explain why!
I took some time to dig through the link you (David) provided. Apparently the \F0 you identified as the culprit relates to case conversion and inserting found text into replacement text... "Insert the whole regex match or the 1st through 99th backreference with the first letter in the matched text converted to uppercase and the remaining letters converted to lowercase." My whole notion of the replacement string being a literal has been upended. Regex is a lot more powerful and dangerous than I knew. Much to learn.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.