Extracting text between html tags with multiple classes with DOM and XPATH

Question

I am trying to extract text between 1 HTML tags but fail to do this:

HTML - Text to be extracted (http://www.alexa.com/siteinfo/google.com)

<span class="font-4 box1-r">3,757,209</span>

PHP

$data = frontend::file_get_contents_curl('http://www.alexa.com/siteinfo/'.$domain); // custom function that return the HTML string
$dom = new DOMDocument();
$dom->loadHTML(htmlentities($data));
$xpath = new DOMXpath($dom);
$backlinks = $xpath->query('//span[@class="font-4 box1-r"]/text()');
var_dump($backlinks); // returns null

Check what you actually get in $data. Some elements my not exists in the initial HTML (generated dynamically by JS) — har07
– har07, Commented May 10, 2016 at 10:19
@har07 Checked font-4 box1-r view-source:alexa.com/siteinfo/google.com and it's there alright. — Adrian
– Adrian, Commented May 10, 2016 at 10:21
@har07 Yes, did this also, and same thing it's appearing the source code. Is not dinamically generated by javascript. — Adrian
– Adrian, Commented May 10, 2016 at 10:32
Your XPath looks correct (tested with lxml in Python). Could it be that DOMXpath does not return text nodes but only elements? (I don't know PHP and DOMXpath) — paul trmbrth
– paul trmbrth, Commented May 10, 2016 at 10:47
@paultrmbrth seems like that isn't the case (eval.in/567895) — har07
– har07, Commented May 10, 2016 at 10:54

har07 · Accepted Answer · 2016-05-10 12:29:48Z

2

The actual problem is due to htmlentities() escaping all tag delimiters (< and >), so you end up loading a long string with no elements and attributes to DOMDocument() :

$data = <<<HTML
<div><span class="font-4 box1-r">3,757,209</span></div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML(htmlentities($data));
echo $doc->saveXML();

eval.in demo (problem) eval.in demo (solution)

output :

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>&lt;div&gt;&lt;span class="font-4 box1-r"&gt;3,757,209&lt;/span&gt;&lt;/div&gt;</p></body></html>

edited May 10, 2016 at 12:29

answered May 10, 2016 at 11:26

har07

89.5k12 gold badges87 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Adrian Over a year ago

actually htmlentities converts < to < in my case. It does the opposite, checked with var dump on htmlentities($data)

har07 Over a year ago

I don't have PHP locally to test. Is it possible that your browser actually do that? (var_dump spit out &lt, then your browser displayed it as <, care to check view source?) Or ultimately, as I've suggested before, have you tried to see the output of $dom->saveXML();?

AhsanBilal · Accepted Answer · 2016-05-10 11:05:03Z

1

You can use the simplehtmldom library for this purpose (http://simplehtmldom.sourceforge.net/). And implement the code as:

require_once 'simplehtmldom/simple_html_dom.php';
$html = file_get_html('http://www.alexa.com/siteinfo/google.com');
echo $html->find('span.box1-r', 0)->plaintext;

answered May 10, 2016 at 11:05

AhsanBilal

1215 bronze badges

1 Comment

Adrian Over a year ago

It is a solution but wanted with native DOM and Xpath.

Collectives™ on Stack Overflow

Extracting text between html tags with multiple classes with DOM and XPATH

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related