RegEx for HTML replace

Question

Hi I am trying to find RegEx which helps me to replace words in HTML. Problem occurs if the word i am trying to replace is in HTML tag as well.

Example:<img class="TEST">asd TEST asd dsa asd </img>
and i need to get the second "TEST" only.

RegEx i am looking for should look like >[^<]*TEST, but this regex takes chars before the word TEST as well. Is it possible to select only word TEST ? but imagine other combinations as well (i dont think " TEST " is a good solution as soon as text could contain another chars as well)

This is a job for a parser. Do a search for: "java html parser" and you will be on your way. — ridgerunner
– ridgerunner, Commented Apr 21, 2011 at 15:33

Community · Accepted Answer · 2017-05-23 12:13:17Z

2

First of all, regex is not good option for html parsing.. There are lots of enhanced html parsers that you can use..

But if you insist to use regex , here is the regex ;

(?<=>.*)TEST(?=.*<)

for java,

(?<=>.{0,100000})TEST(?=.{0,100000}<)

for more information why we can not use * or + with lookbehind regex in Java , Regex look-behind without obvious maximum length in Java

edited May 23, 2017 at 12:13

CommunityBot

11 silver badge

answered Apr 21, 2011 at 13:44

Gursel Koca

21.4k2 gold badges26 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rhorvath Over a year ago

i am not parsing whole html, for that i use Jericho. I just wanted easy way of replacing some words. I cant make your regex working ...testing here myregexp.com

rhorvath Over a year ago

I like your solution, but not working for code like this: <p> [newLine here] TEST [newLine here] </p>

Joeri Hendrickx · Accepted Answer · 2011-04-21 13:45:22Z

1

First of all, like has been said and will be said again, using regex for XML is usually a bad idea. But for really simple cases it can work, especially if you can live with sub-optimal results.

So, just put the test in a group and replace only the group

Something like

Pattern replacePattern = Pattern.compile(">[^<]*(TEST)");
Matcher matcher = replacePattern.matcher(theString);
String result = theString.substr(1,matcher.start(1)) + replacement + theString.substr(matcher.end(1));

Disclaimer: Not tested, might have some off-by-ones. But the concept should be clear.

answered Apr 21, 2011 at 13:45

Joeri Hendrickx

17.5k4 gold badges43 silver badges53 bronze badges

Comments

Dude Dawg Homie · Accepted Answer · 2011-04-21 15:20:10Z

0

How about if "TEST" is inside another tag than , like say inside the body tag, or for that matter inside the html tag?

answered Apr 21, 2011 at 15:20

Dude Dawg Homie

1

1 Comment

rhorvath Over a year ago

ahh maybe i said it wrong way. i mean between '<' and '>'. it is okey if word is inside tag <> here </>, not ok if its < here>.

Collectives™ on Stack Overflow

RegEx for HTML replace

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related