Regexp to search/replace only text, not in HTML attribute

Question

I'm using JavaScript to do some regular expression. Considering I'm working with well-formed source, and I want to remove any space before[,.] and keep only one space after [,.], except that [,.] is part of a number. Thus I use:

text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');

The problem is that this replaces also text in the html tag attributes. For example my text is (always wrapped with a tag):

<p>Test,and test . Again <img src="xyz.jpg"> ...</p>

Now it adds a space like this src="xyz. jpg" that is not expected. How can I rewrite my regular expression? What I want is

<p>Test, and test. Again <img src="xyz.jpg"> ...</p>

Thanks!

This isn't something Regex's are good at as HTML isn't a regular language. There is too much scope/nesting/context. — CaffGeek
– CaffGeek, Commented Aug 11, 2010 at 15:26
Yes, I think, even I haven't tried. I wanted to write it as a CKEditor plugin, that's why I said "well-formed" (well, I meant XHTML anyway). I have the source code, but I think I can get is as DOM elements. — jcisio
– jcisio, Commented Aug 13, 2010 at 8:10

Alan Moore · Accepted Answer · 2010-08-14 04:00:43Z

4

You can use a lookahead to make sure the match isn't occurring inside a tag:

text = text.replace(/(?![^<>]*>) *([.,]) *([^ \d])/g, '$1 $2');

The usual warnings apply regarding CDATA sections, SGML comments, SCRIPT elements, and angle brackets in attribute values. But I suspect your real problems will arise from the vagaries of "plain" text; HTML's not even in the same league. :D

edited Aug 14, 2010 at 4:00

answered Aug 11, 2010 at 22:40

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jcisio Over a year ago

I doesn't work. "Test,and" should become "Test, and". I was thinking of lookafter too, but I couldn't get it. Something like looking for "...> anything but < (text to find/replace)". And I think the [^<>]* part above is not necessary.

Alan Moore Over a year ago

There more asterisks in there when I tested it, but they disappeared. Try it now.

jcisio Over a year ago

I was using another solution. But this one is much better :) Thanks.

Rusty Over a year ago

@AlanMoore can you kindly provide a regex to also take care of any Search pattern provided....

scy won't contribute anymore · Accepted Answer · 2010-08-11 15:30:19Z

1

Do not try to rewrite your expression to do this. You won’t succeed and will almost certainly forget about some corner cases. In the best case, this will lead to nasty bugs and in the worst case you will introduce security problems.

Instead, when you’re already using JavaScript and have well-formed code, use a genuine XML parser to loop over the text nodes and only apply your regex to them.

answered Aug 11, 2010 at 15:30

scy won't contribute anymore

7,3192 gold badges30 silver badges36 bronze badges

Comments

Gumbo · Accepted Answer · 2010-08-11 16:30:56Z

1

If you can access that text through the DOM, you can do this:

function fixPunctuation(elem) {
    // check if parameter is a an ELEMENT_NODE
    if (!(elem instanceof Node) || elem.nodeType !== Node.ELEMENT_NODE) return;
    var children = elem.childNodes, node;
    // iterate the child nodes of the element node
    for (var i=0; children[i]; ++i) {
        node = children[i];
        // check the child’s node type
        switch (node.nodeType) {
        case Node.ELEMENT_NODE:
            // call fixPunctuation if it’s also an ELEMENT_NODE
            fixPunctuation(node);
            break;
        case Node.TEXT_NODE:
            // fix punctuation if it’s a TEXT_NODE
            node.nodeValue = node.nodeValue.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
            break;
        }
    }
}

Now just pass the DOM node to that function like this:

fixPunctuation(document.body);
fixPunctuation(document.getElementById("foobar"));

edited Aug 11, 2010 at 16:30

answered Aug 11, 2010 at 15:44

Gumbo

657k112 gold badges792 silver badges852 bronze badges

1 Comment

Richard JP Le Guen Over a year ago

You mis-spelt the function name fixPunctuation as fixPunctutation a few times ;)

Doug · Accepted Answer · 2010-08-11 15:29:11Z

0

Html is not a "regular language", therefore regex is not the optimal tool for parsing it. You might be better suited to use a html parser like this one to get at the attribute and then apply regex to do something with the value.

Enjoy!

answered Aug 11, 2010 at 15:29

Doug

5,3681 gold badge26 silver badges31 bronze badges

1 Comment

BalusC Over a year ago

That's a Java HTML parser. He want to do this in JavaScript.

Richard JP Le Guen · Accepted Answer · 2010-08-11 15:50:46Z

As stated above and many times before, HTML is not a regular language and thus cannot be parsed with regular expressions.

You will have to do this recursively; I'd suggest crawling the DOM object.

Try something like this...

function regexReplaceInnerText(curr_element) {
    if (curr_element.childNodes.length <= 0) { // termination case:
                                               // no children; this is a "leaf node"
        if (curr_element.nodeName == "#text" || curr_element.nodeType == 3) { // node is text; not an empty tag like <br />
            if (curr_element.data.replace(/^\s*|\s*$/g, '') != "") { // node isn't just white space
                                                                     // (you can skip this check if you want)
                var text = curr_element.data;
                text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
                curr_element.data = text;
            }
        }
    } else {
        // recursive case:
        // this isn't a leaf node, so we iterate over all children and recurse
        for (var i = 0; curr_element.childNodes[i]; i++) {
            regexReplaceInnerText(curr_element.childNodes[i]);
        }
    }
}
// then get the element whose children's text nodes you want to be regex'd
regexReplaceInnerText(document.getElementsByTagName("body")[0]);
// or if you don't want to do the whole document...
regexReplaceInnerText(document.getElementById("ElementToRegEx"));

Community · Accepted Answer · 2017-05-23 12:01:52Z

0

Don't parse ~~regex~~HTML with ~~HTML~~regex. If you know your HTML is well-formed, use an HTML/XML parser. Otherwise, run it through Tidy first and then use an XML parser.

edited May 23, 2017 at 12:01

CommunityBot

11 silver badge

answered Aug 11, 2010 at 15:29

Vivin Paliath

95.8k42 gold badges230 silver badges302 bronze badges

3 Comments

scy won't contribute anymore Over a year ago

You probably mean “don’t parse HTML with regex”, not the other way around. ;)

Richard JP Le Guen Over a year ago

@Scytale - He's just being thorough; so long as we're on the subject, though, people shouldn't parse RegEx with HTML either! ;)

Vivin Paliath Over a year ago

@Scytale @Richard hahaha I didn't even see that. My bad - will fix :)

Collectives™ on Stack Overflow

Regexp to search/replace only text, not in HTML attribute

6 Answers 6

4 Comments

Comments

1 Comment

1 Comment

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

4 Comments

Comments

1 Comment

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related