0

I am going to parse URLs in specific location of one website. For this purpose I wrote a simple program in Java. But this program returns null pointer exception. It seems that getNameItem("href") returns null. I am suspicious about wrong way of using getNameItem to extract URLs inside "href" tag.

DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
org.w3c.dom.Document doc = b.parse(new FileInputStream("clean.html"));

//Evaluate XPath against Document itself
javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList)xPath.evaluate(".//*[@class='r_news_box']",
        doc.getDocumentElement(), XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
    Element e = (Element) nodes.item(i);
    System.out.println(e.getAttributes().getNamedItem("href").getTextContent());
}

P.S: here is one of the nodes that should be selected by this xpath:

<div class="r_news_box">
<a class="picLink" target="_blank" href="/fa/news/427583/test">
<img class="r_news_img" width="50" height="65" src="/files/fa/news/1393/5/29/411217_553.jpg" alt="test"/>
</a>

2 Answers 2

1

Possibly because not all nodes selected has href attribute. You may want to change your XPath to make sure only elements having href attribute are returned :

.//*[@class='r_news_box' and @href]

UPDATE :

According to your update, href is the attribute of <a> node within an element having class attribute equals r_news_box, so here is corrected XPath :

.//*[@class='r_news_box']/a[@href]
Sign up to request clarification or add additional context in comments.

2 Comments

That simply means there is no node that has such class and href attribute... post the node you want to select so we can help to correct the xpath
Please check the main question, I added one sample node.
0

Writing an html parser with XML Parser Librarys is not a good idea. Most html sites are not valid xml documents. You can better use a html parser like jsoup. It is really easy to use and self explained. Here is an example.

1 Comment

I already cleaned HTML to xml with HTMLCleaner. So the clean.html is actually a clean xml document.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.