3

I am trying to replace HTML content with regular expression.

from

<A HREF="ZZZ">test test ZZZ<SPAN>ZZZ test test</SPAN></A>

to

<A HREF="ZZZ">test test AAA<SPAN>AAA test test</SPAN></A>

note that only words outside HTML tags are replaced from ZZZ to AAA.

Any idea? Thanks a lot in advance.

3
  • 2
    please read the first answer to this question: RegEx match open tags except XHTML self-contained tags Commented May 18, 2011 at 7:31
  • 1
    Thanks Mat for the referral. After reading the link, I've simplified the question, since I know the HTML will be "regular" type of HTML. Commented May 18, 2011 at 7:40
  • 1
    Then you misread that link. Don't use regex to parse HTML, it's too complex. Use an (X)HTML parser. Commented May 18, 2011 at 7:42

5 Answers 5

7

You could walk all nodes, replacing text in text ones (.nodeType == 3):

Something like:

element.find('*:contains(ZZZ)').contents().each(function () {
    if (this.nodeType === 3)
        this.nodeValue = this.nodeValue.replace(/ZZZ/g,'AAA')
})

Or same without jQuery:

function replaceText(element, from, to) {
    for (var child = element.firstChild; child !== null; child = child.nextSibling) {
        if (child.nodeType === 3)
            this.nodeValue = this.nodeValue.replace(from,to)
        else if (child.nodeType === 1)
            replaceText(child, from, to);
    }
}

replaceText(element, /ZZZ/g, 'AAA');
Sign up to request clarification or add additional context in comments.

3 Comments

textContent isn't universally supported for text nodes. Use either nodeValue or data properties instead. Also, if the first parameter passed to a string's replace() method is a string, only the first occurrence of that string will be replaced. Use a regular expression with the global flag (e.g. /ZZZ/g) instead to replace all occurrences.
Hi Suor, your JS function is answering the question i described. But i just realized that I oversimplify actual issue. I need to change from ZZZ to something like <span class="highlight>AAA</span>, and above function -- will render <span> tag as "Text" instead of HTML
Then you should create that span node and insert it there. You can do var html = this.nodeValue.replace(from, to); $(this).replaceWith(html) instead of simple nodeValue assignment. And it would be trickier but possible without jQuery.
2

The best idea in this case is most certainly to not use regular expressions to do this. At least not on their own. JavaScript surely has a HTML Parser somewhere?

If you really must use regular expressions, you could try to look for every instance of ZZZ that is followed by a "<" before any ">". That would look like

ZZZ(?=[^>]*<)

This might break horribly if the code contains HTML comments or script blocks, or is not well formed.

1 Comment

yep, that makes sense - HTMLcomments, script blocks, or any other xhtml CDATA would mess things up no matter what regexp you come up with
0

Assuming a well-formed html document with outer/enclosing tags like <html>, I would think the easiest way would be to look for the > and < signs:

/(\>[^\>\<]*)ZZZ([^\>\<]*\<)/$1AAA$2/

If you're dealing with HTML fragments that may not have enclosing tags, it gets a little more complicated, you'd have to allow for start of string and end of string

Example JS (sorry, missed the tag):

alert('<A HREF="ZZZ">test test ZZZ<SPAN>ZZZ test test</SPAN></A>'.replace(/(\>[^\>\<]*)ZZZ([^\>\<]*\<)/g, "$1AAA$2"));

Explanation: for each match that

  • starts with >: \>
  • follows with any number of characters that are neither > nor <: [^\>\<]*
  • then has "ZZZ"
  • follows with any number of characters that are neither > nor <: [^\>\<]*
  • and ends with <: \<

Replace with

  • everything before the ZZZ, marked with the first capture group (parentheses): $1
  • AAA
  • everything after the ZZZ, marked with the second capture group (parentheses): $2

Using the "g" (global) option to ensure that all possible matches are replaced.

2 Comments

Thanks TAO, you're great.. if you have brief explanation about the regex it will be helpful, thanks again...
done, hope that helps. I would recommend you use a DOM-traversal method if possible through, as outlined in @Suor's answer and @Tim Down's comment; this type of solution will always be more reliable. As @Jens noted, the regexp solution in this answer will probably break under some circumstances.
0

Try this:

var str = '<DIV>ZZZ test test</DIV><A HREF="ZZZ">test test ZZZ</A>';
var rpl = str.match(/href=\"(\w*)\"/i)[1];
console.log(str.replace(new RegExp(rpl + "(?=[^>]*<)", "gi"), "XXX"));

Comments

0

have you tried this:

replace:

>([^<>]*)(ZZZ)([^<>]*)<

with:

>$1AAA$3<

but beware all the savvy suggestions in the post linked in the first comment to your question!

3 Comments

beware!? i did not mean it, sorry... take into account all the savvy suggestions....
hi sergio, thank you for your suggestion, i like your idea.. it almost works, but not perfect yet , it gave me -- < A HREF="ZZZ"/>test test AAA</S P A N>ZZZ test test</SPAN >< /A >
try this link: regexr.com?2tpq1, it is flash-based regex engine initialized with my suggestion... it seems to work ok... are you using the "g" flag (for global replace)?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.