Trouble extracting td values from php xpath parsed table html string

Question

I have the following html string snippett from a wikipedia page...

<table class="wikitable">
<tbody>
 <tr>
     <td>mod_access</td>
     <td>Versions older than 2.1</td>
     <td>Included by Default</td>
 </tr>
 <tr>
     <td>mod_actions</td>
     <td>Versions 1.1 and later</td>
     <td>Included by Default</td>
 </tr>
 <tr>
    <td>mod_alias</td>
    <td>Versions 1.1 and later</td>
    <td>Included by Default</td>
 </tr>
</tr>
</tbody>

I have the following php code....

ini_set('display_errors','On');
$url="https://en.wikipedia.org/wiki/List_of_Apache_modules";
$dom=new DomDocument();
$dom->preserveWhiteSpace=false;
$dom->loadHtmlFile($url);
$xpath=new DomXpath($dom);
$elements=$xpath->query('//*[@id="mw-content-text"]/div/table/tbody/tr/td');
foreach($elements as $i=>$row){
    $tds=$xpath->query('td',$row);
    foreach($tds as $td){
       echo "Td($i):", $td->nodeValue,"\n";
    }
}

What i'd like in return is a numerical array with each index a table row filled with the td values.

Not quite sure what to do next.

So you're essentially trying to replicate the table? Shouldn't you initially query the table rows (tr) instead of the individual cells? Your initial $elements contains all of the cells, not all of the rows. — Obsidian Age
– Obsidian Age, Commented Jan 16, 2018 at 23:17

astrangeloop · Accepted Answer · 2018-01-16 23:36:07Z

1

If you remove both tbody and td from your first xpath query, it will find all of the tr elements:

$elements = $xpath->query('//*[@id="mw-content-text"]/div/table/tr');

Then you can loop through each row, use your existing code to find td elements, and add them to an array:

$data = array();
foreach ($elements as $y => $row) {
    $tds = $xpath->query('td', $row);
    foreach($tds as $x => $td) {
        $data[$y][$x] = $td->nodeValue;
    }
}
var_dump($data);

Tested with php 5.6, gives this output:

array(157) {
  [1]=>
  array(6) {
    [0]=>
    string(10) "mod_access"
    [1]=>
    string(23) "Versions older than 2.1"
    [2]=>
    string(19) "Included by Default"
    [3]=>
    string(26) "Apache Software Foundation"
    [4]=>
    string(27) "Apache License, Version 2.0"
    [5]=>
    string(71) "Provides access control based on the client and the client's request[2]"
  }
  [2]=>
  array(6) {
    [0]=>
    string(11) "mod_actions"
    [1]=>
    string(22) "Versions 1.1 and later"
    [2]=>
    string(19) "Included by Default"
    [3]=>
    string(26) "Apache Software Foundation"
    [4]=>
    string(27) "Apache License, Version 2.0"
    [5]=>
    string(62) "Provides CGI ability based on request method and media type[3]"
  }
// etc ...

answered Jan 16, 2018 at 23:36

astrangeloop

1,5302 gold badges13 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

somejkuser Over a year ago

works perflectly. not sure though why it was important to remove those queries?

astrangeloop Over a year ago

The tbody didn’t appear to be in the HTML source of the page, I had to remove it to be able to find any rows at all. The td was removed as your first query only wants to find rows, so that your second query can then find the cells inside each row.

Collectives™ on Stack Overflow

Trouble extracting td values from php xpath parsed table html string

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related