0

I am trying to extract text between 1 HTML tags but fail to do this:

HTML - Text to be extracted (http://www.alexa.com/siteinfo/google.com)

<span class="font-4 box1-r">3,757,209</span>

PHP

$data = frontend::file_get_contents_curl('http://www.alexa.com/siteinfo/'.$domain); // custom function that return the HTML string
$dom = new DOMDocument();
$dom->loadHTML(htmlentities($data));
$xpath = new DOMXpath($dom);
$backlinks = $xpath->query('//span[@class="font-4 box1-r"]/text()');
var_dump($backlinks); // returns null
7
  • Check what you actually get in $data. Some elements my not exists in the initial HTML (generated dynamically by JS) Commented May 10, 2016 at 10:19
  • @har07 Checked font-4 box1-r view-source:alexa.com/siteinfo/google.com and it's there alright. Commented May 10, 2016 at 10:21
  • 2
    @har07 Yes, did this also, and same thing it's appearing the source code. Is not dinamically generated by javascript. Commented May 10, 2016 at 10:32
  • 1
    Your XPath looks correct (tested with lxml in Python). Could it be that DOMXpath does not return text nodes but only elements? (I don't know PHP and DOMXpath) Commented May 10, 2016 at 10:47
  • 1
    @paultrmbrth seems like that isn't the case (eval.in/567895) Commented May 10, 2016 at 10:54

2 Answers 2

2

The actual problem is due to htmlentities() escaping all tag delimiters (< and >), so you end up loading a long string with no elements and attributes to DOMDocument() :

$data = <<<HTML
<div><span class="font-4 box1-r">3,757,209</span></div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML(htmlentities($data));
echo $doc->saveXML();

eval.in demo (problem) eval.in demo (solution)

output :

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>&lt;div&gt;&lt;span class="font-4 box1-r"&gt;3,757,209&lt;/span&gt;&lt;/div&gt;</p></body></html>
Sign up to request clarification or add additional context in comments.

2 Comments

actually htmlentities converts &lt; to < in my case. It does the opposite, checked with var dump on htmlentities($data)
I don't have PHP locally to test. Is it possible that your browser actually do that? (var_dump spit out &lt, then your browser displayed it as <, care to check view source?) Or ultimately, as I've suggested before, have you tried to see the output of $dom->saveXML();?
1

You can use the simplehtmldom library for this purpose (http://simplehtmldom.sourceforge.net/). And implement the code as:

require_once 'simplehtmldom/simple_html_dom.php';
$html = file_get_html('http://www.alexa.com/siteinfo/google.com');
echo $html->find('span.box1-r', 0)->plaintext;

1 Comment

It is a solution but wanted with native DOM and Xpath.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.