simple HTML file parser with php

Question

I have a particular problem which I can't crack. I searched for every tutorial or form entries, but had no luck in succeeding in what I need to do. So my HTML file:

<html>
 <head>**SOMETHING HERE**</head>
 <body>
  <div>
   <table>
    <thead>
  <tr><th>TEXT/NUM IS HERE</th><th>TEXT/NUM IS HERE</th><th>TEXT/NUM IS HERE</th></tr>
    </thead><tbody>**SOMETHING HERE**</tbody></tfoot>**SOMETHING HERE**</tfoot>
   </table>
  </div>
 </body>
</html>

What I need is to go through every tag (th) in the "thead=>tr" tag and record the value between these "th" tags into an array;

For this I was planning to use DOMDocument and DOMXPath.

There was many ways I tried to solve this issue, but most found one online was:

$file = "index.html";
$dom = new DOMDocument();
$dom->loadHTMLfile($file);
$thead = $dom->getElementsByTagName('thead');
$thead->parentNode;
$th = $thead->getElementsByTagName('th')
echo $th->nodeValue . "\n";

But I'm still getting many errors and can't find a way to do this. Is there any way of doing this nice end simple and of course foreach element in the parent element.

Thank you.

getElementsByTagName. Elements. Not element, but elements. It returns an DOMNodeList as specified by the manual. You need to iterate through this. — h2ooooooo
– h2ooooooo, Commented Dec 4, 2013 at 11:27

Community · Accepted Answer · 2023-11-17 20:24:57Z

3

Use DOMXPath:

$html = <<<EOL
<html>
    <head>**SOMETHING HERE**</head>
    <body>
        <div>
            <table>
                <thead>
                    <tr>
                        <th>TEXT/NUM IS HERE</th>
                        <th>TEXT/NUM IS HERE</th>
                        <th>TEXT/NUM IS HERE</th>
                    </tr>
                </thead>
                <tbody>**SOMETHING HERE**</tbody>
                <tfoot>**SOMETHING HERE**</tfoot>
            </table>
        </div>
    </body>
</html>
EOL;

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$nodes = $xpath->query('//table/thead/tr/th');

$data = array();

foreach ($nodes as $node) {
    $data[] = $node->textContent;
}

print_r($data);

edited Nov 17, 2023 at 20:24

CommunityBot

11 silver badge

answered Dec 4, 2013 at 11:34

bagonyi

3,3182 gold badges24 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ali · Accepted Answer · 2013-12-04 11:24:19Z

1

<?php
$html = new file_get_html('file.html');
$th = $html->find('thead th');
$array = array();
foreach($th as $text) 
    $array[] = $th->innertext;
?>

This uses the Simple HTML Dom Parser which can be found here.

answered Dec 4, 2013 at 11:24

Ali

3,4364 gold badges19 silver badges31 bronze badges

Comments

Rob Baillie · Accepted Answer · 2013-12-04 11:42:07Z

If you want to keep it in the same style as what you have (and therefore learn what you did wrong) try this:

$file = "index.html";
$dom = new DOMDocument();
$dom->loadHTMLfile($file);

$oTHeadList = $dom->getElementsByTagName('thead');

foreach( $oTHeadList as $oThisTHead ){

    $oThList = $oThisTHead->getElementsByTagName('th');

    foreach( $oThList as $oThisTh ) {

        echo $oThisTh->nodeValue . "\n";
    }
}

Basically "getElementsByTagName" returns a NodeList instead of a Node, so you have to loop over them to get to the individual nodes.

Additionally, in your HTML you have a closing tfoot instead of an opening one, and if you test using the html document you provided then the **SOMETHING HERE** inside your head tag will cause warnings to be thrown (as will any other invalid HTML).

If you want to suppress the warnings an loading you can add an '@', but it's not a good idea to pepper that symbol too much around your code.

@$dom->loadHTMLfile($file);

Collectives™ on Stack Overflow

simple HTML file parser with php

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related