4

I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).

I used scanner objects with reg. expressions and jsoup with its html parser.

Both methods are slow and with jsoup I get the following error: java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)

Is there anything better?

EDIT:

Now that I've gotten jsoup to work, I think a better question is how do I speed it up?

1
  • 1
    Jsoup supports both DOM traversal and [CSS] selectors, no? (Why use regular expressions? :-/) Commented Jul 14, 2011 at 3:13

3 Answers 3

5

Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. I got the jsoup code working. Its has a running time of 2 minutes.
2

I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.

1 Comment

Jericho is a good alternative too. I've used Nutch and Jericho, but have no experience with JSoup so can't comment on why it would be taking so long.
0

A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.

Here's a nice link since you are interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.

8 Comments

Except ... HTML is not XML. I suspect this post wouldn't have received a down-vote (not mine) if a link to a library that exposed HTML via XPath was also included. (Such tools, which are capable of treating HTML "as" an XML DOM, are definitely worth talking about.)
XPath is for XML, and won't work on any HTML that isn't XML compatible.
@Mr. Wanta Yes, so what Java library parses HTML (not just XML) and exposes XPath over it? :) This answer isn't bad, but it is missing some important pieces of the puzzle. (Note that jsoup, which the question is tagged, supports CSS selectors, but not XPath -- it looks like this feature is requested)
Here's an example of use XOM and TagSoup to find elements in HTML - stackoverflow.com/questions/773340/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.