Extracting dom elements from html using PHP Simple HTML DOM Parser

Question

I'm trying to extract links to the articles including the text, from this site using PHP Simple HTML DOM PARSER.

I want to extract all h2 tags for articles in the main page and I'm trying to do it this way:

    $html = file_get_html('http://www.winbeta.org');
    $articles = $html->getElementsByTagName('article');
    $a = null;

    foreach ($articles->find('h2') as $header) {
                $a[] = $header;
    }

    print_r($a);

according to the manual it should first get all the content inside article tags then for each article extract the h2 and save in array. but instead it gives me :

EDIT

trincot · Accepted Answer · 2016-01-05 22:08:30Z

5

There are several problems:

getElementsByTagName apparently returns a single node, not an array, so it would not work if you have more than one article tag on the page. Instead use find which does return an array;
But once you make that switch, you cannot use find on a result of find, so you should do that on each individual matched article tag, or better use a combined selector as argument to find;
Main issue: You must retrieve the text content of the node explicitly with ->plaintext, otherwise you get the object representation of the node, with all its attributes and internals;
Some of the text contains HTML entities like ’. These can be decoded with html_entity_decode.

So this code should work:

$a = array();
foreach ($html->find('article h2') as $h2) { // any h2 within article
    $a[] = html_entity_decode($h2->plaintext);
}

Using array_map, you could also do it like this:

$a = array_map(function ($h2) { return html_entity_decode($h2->plaintext); }, 
               $html->find('article h2'));

If you need to retrieve other tags within articles as well, to store their texts in different arrays, then you could do as follows:

$a = array();
$b = array();
foreach ($html->find('article') as $article) {
    foreach ($article->find('h2') as $h2) {
        $a[] = html_entity_decode($h2->plaintext);
    }
    foreach ($article->find('h3') as $h3) {
        $b[] = html_entity_decode($h3->plaintext);
    }
}

edited Jan 5, 2016 at 22:08

answered Jan 5, 2016 at 20:32

trincot

357k38 gold badges282 silver badges338 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Vahid Amiri Over a year ago

It definitely works but in the strings there is some encoded stuff, how do i deal with those? (snapshot in the edit)

Vahid Amiri Over a year ago

Microsoft back in China’s

Vahid Amiri Over a year ago

that should be Microsoft back in China's

Vahid Amiri Over a year ago

And there is one more thing, what if I wanted to extract several elements from each article and save in different arrays, of course I could just run the same code again and replace the h2 with different element but this way we would extract the articles several times and that's a waste. Is there a way to get all articles once and then operate on them?

Collectives™ on Stack Overflow

Extracting dom elements from html using PHP Simple HTML DOM Parser

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related