4

Lets say I have an XML in the form of a string. I wish to remove the content between two tags within the XML String, say . I have tried:

String newString = oldString.replaceFirst("\\<tagName>.*?\\<//tagName>",
                                                              "Content Removed");

but it does not work. Any pointers as to what am I doing wrong?

1
  • 1
    If you have anything other than the most simple, non-nested xml a regex isn't going to work. Commented Jun 27, 2011 at 14:32

3 Answers 3

12

OK, apart from the obvious answer (don't parse XML with regex), maybe we can fix this:

String newString = oldString.replaceFirst("(?s)<tagName[^>]*>.*?</tagName>",
                                          "Content Removed");

Explanation:

(?s)             # turn single-line mode on (otherwise '.' won't match '\n')
<tagName         # remove unnecessary (and perhaps erroneous) escapes
[^>]*            # allow optional attributes
>.*?</tagName>   

Are you sure your matching the tag case correctly? Perhaps you also want to add the i flag to the pattern: (?si)

Sign up to request clarification or add additional context in comments.

1 Comment

In the end, simply using string.replaceFirst("<tagName>.*</tagName>", "Content Removed"); worked fine, I don't know why I was making it so complicated. Thanks for explaining the regex attributes in Java though, pretty helpful!
0

Probably the problem lies here:

<//tagName>

Try changing it to

<\/tagName>

5 Comments

In Java, </tagName>will do nicely without any escapes.
@Pable yes, but that doesn't use a Java Regex engine, it's flex / flash
@Pable no, it works, it's just not necessary: "A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct." ( source )
All right so no harm done then. Thanks for the info (and BTW it's Pablo not Pable :) )
@Pablo Grrr, the same typo twice. I knew it was Pablo all along, but somehow my fingers didn't agree. Sorry!!!
0

XML is a grammar; regular expressions are not the best tools to work with grammars.

My advice would be working with a real parser to work with the DOM instead of doing matches

For example, if you have:

<xml>
 <items>
  <myItem>
     <tagtoRemove>something1</tagToRemove>
  </myItem>
  <myItem>
     <tagtoRemove>something2</tagToRemove>
  </myItem>
 </items>

A regex could try to match it (due to the greedy mechanism)

<xml>
 <items>
  <myItem>
     matchString
  </myItem>
 </items>

Also, some uses that some DTDs may allow (such as <tagToRemove/> or <tagToRemove attr="value">) make catching tags with regex more difficult.

Unless it is very clear to you that none of the above may occur (nor or in the future) I would go with a parser.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.