10

I am writing one program which reads and separate spam and ham emails. Now I am reading it using bufferedreader class of java. I am able to remove any unwanted characters like '(' or '.' etc, using replaceAll() method. I want to remove html tags too, including &amp. How to achieve this!?

thanks

EDIT: Thanks for the response, but I am already having a regex, how to combine both my needs and put into one. Heres the regex i am using now.

lines.replaceAll("[^a-zA-Z]", " ")

Note: I am getting lines from a txt file. Any other suggestions plss?!

2

4 Answers 4

37

Maybe this will work:

String noHTMLString = htmlString.replaceAll("\\<.*?>","");

It uses regular expressions to remove all HTML tags in a string.

More specifically, it removes all XML like tags from a string. So <1234> will be removed even though it is not a valid HTML tag. But it's good for most intents and purposes.

Hope this helps.

Sign up to request clarification or add additional context in comments.

4 Comments

A second call, like replaceAll("&.*?;","") would take out entity references. Although it seems odd to me that one would just want to remove these, rather than translating them back into the characters they represent.
Very useful and powerful solution .
Be careful with regex, this is not a full solution. It replaces also unwanted characters -> using this expression for string like "a < b only when c > d" results in "a d".
True, it's a quick and dirty way of doing it and should be used with caution.
11

JSOUP

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

1 Comment

For the maven users the link to maven repos for latest version: mvnrepository.com/artifact/org.jsoup/jsoup
3

You will want to do some lightweight parsing to strip the HTML:

String extractText(String html) throws IOException {
    final ArrayList<String> list = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
        public void handleText(final char[] data, final int pos) { 
            list.add(new String(data));
        }
        public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
        public void handleEndTag(Tag t, final int pos) {  }
        public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
        public void handleComment(final char[] data, final int pos) { }
        public void handleError(final java.lang.String errMsg, final int pos) { }
    };
    parserDelegator.parse(new StringReader(html), parserCallback, true);

    String text = "";

    for(String s : list) {
        text += " " + s;
    }

    return text;
}

Comments

-1

import java.io.*;

public class Html2TextWithRegExp {


public static void main (String[] args) throws Exception{
 StringBuilder sb = new StringBuilder();
 BufferedReader br = new BufferedReader(new FileReader("java-new.html"));
 String line;
 while ( (line=br.readLine()) != null) {
   sb.append(line);
   // or
   //  sb.append(line).append(System.getProperty("line.separator"));
 }
 String nohtml = sb.toString().replaceAll("\\<.*?>","");
 System.out.println(nohtml);
 }
}

1 Comment

Why did you create an empty constructor?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.