Java: String.replace(regex, string) to remove content from XML

Question

Lets say I have an XML in the form of a string. I wish to remove the content between two tags within the XML String, say . I have tried:

String newString = oldString.replaceFirst("\\<tagName>.*?\\<//tagName>",
                                                              "Content Removed");

but it does not work. Any pointers as to what am I doing wrong?

If you have anything other than the most simple, non-nested xml a regex isn't going to work. — Richard H
– Richard H, Commented Jun 27, 2011 at 14:32

Community · Accepted Answer · 2017-05-23 10:27:14Z

12

OK, apart from the obvious answer (don't parse XML with regex), maybe we can fix this:

String newString = oldString.replaceFirst("(?s)<tagName[^>]*>.*?</tagName>",
                                          "Content Removed");

Explanation:

(?s)             # turn single-line mode on (otherwise '.' won't match '\n')
<tagName         # remove unnecessary (and perhaps erroneous) escapes
[^>]*            # allow optional attributes
>.*?</tagName>

Are you sure your matching the tag case correctly? Perhaps you also want to add the i flag to the pattern: (?si)

edited May 23, 2017 at 10:27

CommunityBot

11 silver badge

answered Jun 27, 2011 at 14:30

Sean Patrick Floyd

301k72 gold badges481 silver badges598 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

TookTheRook Over a year ago

In the end, simply using string.replaceFirst("<tagName>.*</tagName>", "Content Removed"); worked fine, I don't know why I was making it so complicated. Thanks for explaining the regex attributes in Java though, pretty helpful!

Pablo Fernandez · Accepted Answer · 2011-06-27 14:29:33Z

0

Probably the problem lies here:

<//tagName>

Try changing it to

<\/tagName>

answered Jun 27, 2011 at 14:29

Pablo Fernandez

106k59 gold badges196 silver badges234 bronze badges

5 Comments

Sean Patrick Floyd Over a year ago

In Java, </tagName>will do nicely without any escapes.

Sean Patrick Floyd Over a year ago

@Pable yes, but that doesn't use a Java Regex engine, it's flex / flash

Sean Patrick Floyd Over a year ago

@Pable no, it works, it's just not necessary: "A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct." ( source )

Pablo Fernandez Over a year ago

All right so no harm done then. Thanks for the info (and BTW it's Pablo not Pable :) )

Sean Patrick Floyd Over a year ago

@Pablo Grrr, the same typo twice. I knew it was Pablo all along, but somehow my fingers didn't agree. Sorry!!!

SJuan76 · Accepted Answer · 2011-06-27 14:40:20Z

XML is a grammar; regular expressions are not the best tools to work with grammars.

My advice would be working with a real parser to work with the DOM instead of doing matches

For example, if you have:

<xml>
 <items>
  <myItem>
     <tagtoRemove>something1</tagToRemove>
  </myItem>
  <myItem>
     <tagtoRemove>something2</tagToRemove>
  </myItem>
 </items>

A regex could try to match it (due to the greedy mechanism)

<xml>
 <items>
  <myItem>
     matchString
  </myItem>
 </items>

Also, some uses that some DTDs may allow (such as <tagToRemove/> or <tagToRemove attr="value">) make catching tags with regex more difficult.

Unless it is very clear to you that none of the above may occur (nor or in the future) I would go with a parser.

Collectives™ on Stack Overflow

Java: String.replace(regex, string) to remove content from XML

3 Answers 3

1 Comment

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related