89

I am trying to parse HTML in android from a webpage, and since the webpage it not well formed, I get SAXException.

Is there a way to parse HTML in Android?

2
  • I suspect the Rhino dependency will make htmlunit hell to compile on Android, but you could try... Also, some other non-strict HTML parser such as soup might work. Commented Feb 2, 2010 at 22:04
  • I wonder if webkit can be used here. Commented Feb 2, 2010 at 22:20

5 Answers 5

76

I just encountered this problem. I tried a few things, but settled on using JSoup. The jar is about 132k, which is a bit big, but if you download the source and take out some of the methods you will not be using, then it is not as big.
=> Good thing about it is that it will handle badly formed HTML

Here's a good example from their site.

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

//http://jsoup.org/cookbook/input/load-document-from-url
//Document doc = Jsoup.connect("http://example.com/").get();

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}
Sign up to request clarification or add additional context in comments.

4 Comments

You could try including the full jar, and run ProGuard on your app in your production release to strip out unused code.
CAUTION: JSoup is very very slow.
@kevin a source for that claim? You may have some debugging enabled.
What about dynamically loaded content using java script during rendering of the html page on the client side? Will Jsoup show this content as well?
57

Have you tried using Html.fromHtml(source)?

I think that class is pretty liberal with respect to source quality (it uses TagSoup internally, which was designed with real-life, bad HTML in mind). It doesn't support all HTML tags though, but it does come with a handler you can implement to react on tags it doesn't understand.

2 Comments

This is very simple, I cannot search for exact things (like XPATH)
attention please. this will "Suspending all threads". I face with than when get a json with html format text with in it. there was no problem with showing html text rightly but after use html.fromhtml() I face with this.
25
String tmpHtml = "<html>a whole bunch of html stuff</html>";
String htmlTextStr = Html.fromHtml(tmpHtml).toString();

4 Comments

nice and simple, no plugins, love it! tnxs
As a note: calling toString() on the Spanned object returned from Html.fromHtml(str) will make many of the HTML tags not work (including <i> <u> <b>). So if you're setting a textview just do: myTextView.setText(Html.fromHtml(str))
@Sakiboy You are right. In addition to this there are many other tags that does not work with Html.fromHtml(). Check this out stackoverflow.com/a/3150456/1987045
awesome , exactly what i wanted , my server side dev was sending me html , now i can easily convert it to raw string thanks
3

We all know that programming have endless possibilities.There are numbers of solutions available for a single problem so i think all of the above solutions are perfect and may be helpful for someone but for me this one save my day..

So Code goes like this

  private void getWebsite() {
    new Thread(new Runnable() {
      @Override
      public void run() {
        final StringBuilder builder = new StringBuilder();

        try {
          Document doc = Jsoup.connect("http://www.ssaurel.com/blog").get();
          String title = doc.title();
          Elements links = doc.select("a[href]");

          builder.append(title).append("\n");

          for (Element link : links) {
            builder.append("\n").append("Link : ").append(link.attr("href"))
            .append("\n").append("Text : ").append(link.text());
          }
        } catch (IOException e) {
          builder.append("Error : ").append(e.getMessage()).append("\n");
        }

        runOnUiThread(new Runnable() {
          @Override
          public void run() {
            result.setText(builder.toString());
          }
        });
      }
    }).start();
  }

You just have to call the above function in onCreate Method of your MainActivity

I hope this one is also helpful for you guys.

Also read the original blog at Medium

Comments

1

Maybe you can use WebView, but as you can see in the doc WebView doesn't support javascript and other stuff like widgets by default.

http://developer.android.com/reference/android/webkit/WebView.html

I think that you can enable javascript if you need it.

2 Comments

Yes, you can enable JS, easily. But no need to use webview for html parsing.
That doesn't answer the question

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.