Parsing HTML webpages in Java

Question

I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).

I used scanner objects with reg. expressions and jsoup with its html parser.

Both methods are slow and with jsoup I get the following error: java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)

Is there anything better?

EDIT:

Now that I've gotten jsoup to work, I think a better question is how do I speed it up?

Jsoup supports both DOM traversal and [CSS] selectors, no? (Why use regular expressions? :-/) — user166390
– user166390, Commented Jul 14, 2011 at 3:13

Ed Staub · Accepted Answer · 2011-07-14 02:55:34Z

5

Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.

answered Jul 14, 2011 at 2:55

Ed Staub

15.8k3 gold badges63 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

samwise Over a year ago

Thanks. I got the jsoup code working. Its has a running time of 2 minutes.

MD Sayem Ahmed · Accepted Answer · 2011-07-14 03:00:53Z

2

I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.

edited Jul 14, 2011 at 3:00

MD Sayem Ahmed

29.3k27 gold badges115 silver badges183 bronze badges

answered Jul 14, 2011 at 2:54

billygoat

22.1k5 gold badges43 silver badges50 bronze badges

1 Comment

jkraybill Over a year ago

Jericho is a good alternative too. I've used Nutch and Jericho, but have no experience with JSoup so can't comment on why it would be taking so long.

JustBeingHelpful · Accepted Answer · 2011-07-14 02:56:37Z

0

A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.

Here's a nice link since you are interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.

answered Jul 14, 2011 at 2:56

JustBeingHelpful

19.1k39 gold badges169 silver badges254 bronze badges

8 Comments

user166390 Over a year ago

Except ... HTML is not XML. I suspect this post wouldn't have received a down-vote (not mine) if a link to a library that exposed HTML via XPath was also included. (Such tools, which are capable of treating HTML "as" an XML DOM, are definitely worth talking about.)

Ed Staub Over a year ago

XPath is for XML, and won't work on any HTML that isn't XML compatible.

JustBeingHelpful Over a year ago

it's used for both HTML and XML. tech-read.com/2011/03/09/extract-html-content-using-xpath

user166390 Over a year ago

@Mr. Wanta Yes, so what Java library parses HTML (not just XML) and exposes XPath over it? :) This answer isn't bad, but it is missing some important pieces of the puzzle. (Note that jsoup, which the question is tagged, supports CSS selectors, but not XPath -- it looks like this feature is requested)

laz Over a year ago

Here's an example of use XOM and TagSoup to find elements in HTML - stackoverflow.com/questions/773340/…

|

Collectives™ on Stack Overflow

Parsing HTML webpages in Java

3 Answers 3

1 Comment

1 Comment

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related