Parse href out of html document and evaluating by xpath returns null pointer exception

Question

I am going to parse URLs in specific location of one website. For this purpose I wrote a simple program in Java. But this program returns null pointer exception. It seems that getNameItem("href") returns null. I am suspicious about wrong way of using getNameItem to extract URLs inside "href" tag.

DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
org.w3c.dom.Document doc = b.parse(new FileInputStream("clean.html"));

//Evaluate XPath against Document itself
javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList)xPath.evaluate(".//*[@class='r_news_box']",
        doc.getDocumentElement(), XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
    Element e = (Element) nodes.item(i);
    System.out.println(e.getAttributes().getNamedItem("href").getTextContent());
}

P.S: here is one of the nodes that should be selected by this xpath:

<div class="r_news_box">
<a class="picLink" target="_blank" href="/fa/news/427583/test">
<img class="r_news_img" width="50" height="65" src="/files/fa/news/1393/5/29/411217_553.jpg" alt="test"/>
</a>

har07 · Accepted Answer · 2014-08-23 11:52:29Z

1

Possibly because not all nodes selected has href attribute. You may want to change your XPath to make sure only elements having href attribute are returned :

.//*[@class='r_news_box' and @href]

UPDATE :

According to your update, href is the attribute of <a> node within an element having class attribute equals r_news_box, so here is corrected XPath :

.//*[@class='r_news_box']/a[@href]

edited Aug 23, 2014 at 11:52

answered Aug 23, 2014 at 10:26

har07

89.5k12 gold badges87 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

har07 Over a year ago

That simply means there is no node that has such class and href attribute... post the node you want to select so we can help to correct the xpath

Ali Over a year ago

Please check the main question, I added one sample node.

Lars · Accepted Answer · 2014-08-23 10:20:35Z

0

Writing an html parser with XML Parser Librarys is not a good idea. Most html sites are not valid xml documents. You can better use a html parser like jsoup. It is really easy to use and self explained. Here is an example.

answered Aug 23, 2014 at 10:20

Lars

1,7502 gold badges17 silver badges27 bronze badges

1 Comment

Ali Over a year ago

I already cleaned HTML to xml with HTMLCleaner. So the clean.html is actually a clean xml document.

Collectives™ on Stack Overflow

Parse href out of html document and evaluating by xpath returns null pointer exception

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related