remove html tags from string using java [duplicate]

Question

I am writing one program which reads and separate spam and ham emails. Now I am reading it using bufferedreader class of java. I am able to remove any unwanted characters like '(' or '.' etc, using replaceAll() method. I want to remove html tags too, including &amp. How to achieve this!?

thanks

EDIT: Thanks for the response, but I am already having a regex, how to combine both my needs and put into one. Heres the regex i am using now.

lines.replaceAll("[^a-zA-Z]", " ")

Note: I am getting lines from a txt file. Any other suggestions plss?!

I tried Jsoup..but its not working..no complile error, its just simply not working.. — Maverick
– Maverick, Commented Dec 13, 2010 at 19:47
Similar topics stackoverflow.com/questions/1699313/… stackoverflow.com/questions/240546/… — user467871
– user467871, Commented Dec 13, 2010 at 20:49

mishmash · Accepted Answer · 2016-07-16 18:43:24Z

37

Maybe this will work:

String noHTMLString = htmlString.replaceAll("\\<.*?>","");

It uses regular expressions to remove all HTML tags in a string.

More specifically, it removes all XML like tags from a string. So <1234> will be removed even though it is not a valid HTML tag. But it's good for most intents and purposes.

Hope this helps.

edited Jul 16, 2016 at 18:43

answered Dec 13, 2010 at 19:16

mishmash

4,4683 gold badges36 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dave Costa Over a year ago

A second call, like replaceAll("&.*?;","") would take out entity references. Although it seems odd to me that one would just want to remove these, rather than translating them back into the characters they represent.

Vincent Jia Over a year ago

Very useful and powerful solution .

Jan Hruby Over a year ago

Be careful with regex, this is not a full solution. It replaces also unwanted characters -> using this expression for string like "a < b only when c > d" results in "a d".

mishmash Over a year ago

True, it's a quick and dirty way of doing it and should be used with caution.

Program-Me-Rev · Accepted Answer · 2015-11-15 08:22:15Z

11

JSOUP

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

answered Nov 15, 2015 at 8:22

Program-Me-Rev

6,73421 gold badges72 silver badges157 bronze badges

1 Comment

Ben Over a year ago

For the maven users the link to maven repos for latest version: mvnrepository.com/artifact/org.jsoup/jsoup

user473395 · Accepted Answer · 2010-12-13 19:20:29Z

You will want to do some lightweight parsing to strip the HTML:

String extractText(String html) throws IOException {
    final ArrayList<String> list = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
        public void handleText(final char[] data, final int pos) { 
            list.add(new String(data));
        }
        public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
        public void handleEndTag(Tag t, final int pos) {  }
        public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
        public void handleComment(final char[] data, final int pos) { }
        public void handleError(final java.lang.String errMsg, final int pos) { }
    };
    parserDelegator.parse(new StringReader(html), parserCallback, true);

    String text = "";

    for(String s : list) {
        text += " " + s;
    }

    return text;
}

jsingh · Accepted Answer · 2017-01-21 05:49:31Z

-1

import java.io.*;

public class Html2TextWithRegExp {


public static void main (String[] args) throws Exception{
 StringBuilder sb = new StringBuilder();
 BufferedReader br = new BufferedReader(new FileReader("java-new.html"));
 String line;
 while ( (line=br.readLine()) != null) {
   sb.append(line);
   // or
   //  sb.append(line).append(System.getProperty("line.separator"));
 }
 String nohtml = sb.toString().replaceAll("\\<.*?>","");
 System.out.println(nohtml);
 }
}

edited Jan 21, 2017 at 5:49

answered Jan 21, 2017 at 5:28

jsingh

1,36613 silver badges24 bronze badges

1 Comment

Bálint Over a year ago

Why did you create an empty constructor?

Collectives™ on Stack Overflow

remove html tags from string using java [duplicate]

4 Answers 4

4 Comments

1 Comment

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

1 Comment

Comments

1 Comment

Linked

Related