Java: extract all resources links from HTML

Question

I am looking for a way to extract all resources links from an HTML page in Java. (URL links, links to files..)

I first thought of extracting all elements inside src, href attributes, but the list will not be exhaustive. There is an example of code here: Jsoup, extract links, images, from website. Exception on runtime.

As a tricky example, I want to be able to detect links hidden inside JavaScript (which can also be hidden anywhere in the HTML DOM):

<IMG onmouseover="window.open('http://www.evil.com/image.jpg')">

EDIT:

1) I am not looking for a regex-based solution because they are not reliable to deal with HTML document

2) I have tried to use Html DOM parser like JSoup. They allows the extractions of tags and their properties quite well. However I have not found a way to detect links inside JavaScript with it.

3) Maybe there is an API available that tries to render the page and detect which resources needs to be loaded?

Do you have any thoughts?

Thanks.

Considering there is no limit to the potential complexity of JavaScript code (just think about how many ways you can compose a string, for instance), I don't think it's feasible to detect every resource obtained via JavaScript. You'll have to come up with a heuristic limit on evaluation of the document's JavaScript. A highly simplified example would be "Look for window.open calls in elements' script attributes." — VGR
– VGR, Commented Jul 14, 2014 at 20:34
Thank you for your answer. Yes, As you said, JavaScript is very complex so it will be hard to cover all the possible methods that can call a resource. That is why I was wondering if an API would make possible the testing of the rendering of an HTML page and detect which resources needs to be loaded. — Yannickv
– Yannickv, Commented Jul 14, 2014 at 21:18

SerhatCan · Accepted Answer · 2014-07-14 18:20:41Z

1

If you want to use PHP with a bit of programming knowledge here is a library.

http://simplehtmldom.sourceforge.net/

I used this library to extract info from tags, even from properties of tags. This is exactly what you need to do what you want without working with complicated code.

answered Jul 14, 2014 at 18:20

SerhatCan

5901 gold badge8 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Yannickv Over a year ago

Thank you for your answer. Yes I have already tried to use a HTML DOM Parser (I used Jsoup which is quite similar but for Java). The problem is that it allows the extractions of tags and their properties, but there is no way to get all the resources link of the DOM document. It is easy to request all the a tags to get the URL links for example. But it is harder to detect the links hidden like in the example in the first post

SerhatCan Over a year ago

Well, if you examine simplehtmldom.sourceforge.net/manual.htm this link you would see that you are capable of saying get onmouseover's content. After that you could say that get me content within the ' ' tags. Maybe you could do it with on phrase. I recommend you to look at its properties a little bit deeper. If you have to, you could look for other tutorials about this library because I know that it is capable of doing such operations. Hope it helps :)

Yannickv Over a year ago

Thank you again :). Yes I have seen that it is possible to write regex expression on the attributes. But it will be hard to cover all the possibilities. There a lot of HTML event attributes w3schools.com/tags/ref_eventattributes.asp and there are also an infinite number of extension types. There must be a faster way to get these resources

Collectives™ on Stack Overflow

Java: extract all resources links from HTML

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related