0

Possible Duplicate:
How to remove HTML tag in Java
RegEx match open tags except XHTML self-contained tags

I want to remove specific HTML tag with its content.

For example, if the html is:

<span style='font-family:Verdana;mso-bidi-font-family:
"Times New Roman";display:none;mso-hide:all'>contents</span>

If the tag contains "mso-*", it must remove the whole tag (opening, closing and content).

3

1 Answer 1

1

As Dave Newton pointed out in his comment, a html parser is the way to go here. If you really want to do it the hard way, here's a regex that works:

    String html = "FOO<span style='font-family:Verdana;mso-bidi-font-family:"
        + "\"Times New Roman\";display:none;mso-hide:all'>contents</span>BAR";
    // regex matches every opening tag that contains 'mso-' in an attribute name
    // or value, the contents and the corresponding closing tag
    String regex = "<(\\S+)[^>]+?mso-[^>]*>.*?</\\1>";
    String replacement = "";
    System.out.println(html.replaceAll(regex, replacement)); // prints FOOBAR
Sign up to request clarification or add additional context in comments.

4 Comments

And if the style attribute doesn't contain any mso- directive... maybe a more generalized regexp would be in order.
@pap let me quote the OP: If the tag contains "mso-*", it must remove the whole tag (opening, closing and content). My post answers his question, and I don't understand your comment.
Indeed you are correct. Shame on me for not reading the question properly :) And I think you underestimate yourself, you seem to have understood my comment just fine, just that I was incorrect ;)
@pap it was my polite way of saying, I think your comment is wrong ;)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.