0

I have the following html string snippett from a wikipedia page...

<table class="wikitable">
<tbody>
 <tr>
     <td>mod_access</td>
     <td>Versions older than 2.1</td>
     <td>Included by Default</td>
 </tr>
 <tr>
     <td>mod_actions</td>
     <td>Versions 1.1 and later</td>
     <td>Included by Default</td>
 </tr>
 <tr>
    <td>mod_alias</td>
    <td>Versions 1.1 and later</td>
    <td>Included by Default</td>
 </tr>
</tr>
</tbody>

I have the following php code....

ini_set('display_errors','On');
$url="https://en.wikipedia.org/wiki/List_of_Apache_modules";
$dom=new DomDocument();
$dom->preserveWhiteSpace=false;
$dom->loadHtmlFile($url);
$xpath=new DomXpath($dom);
$elements=$xpath->query('//*[@id="mw-content-text"]/div/table/tbody/tr/td');
foreach($elements as $i=>$row){
    $tds=$xpath->query('td',$row);
    foreach($tds as $td){
       echo "Td($i):", $td->nodeValue,"\n";
    }
}

What i'd like in return is a numerical array with each index a table row filled with the td values.

Not quite sure what to do next.

1
  • 1
    So you're essentially trying to replicate the table? Shouldn't you initially query the table rows (tr) instead of the individual cells? Your initial $elements contains all of the cells, not all of the rows. Commented Jan 16, 2018 at 23:17

1 Answer 1

1

If you remove both tbody and td from your first xpath query, it will find all of the tr elements:

$elements = $xpath->query('//*[@id="mw-content-text"]/div/table/tr');

Then you can loop through each row, use your existing code to find td elements, and add them to an array:

$data = array();
foreach ($elements as $y => $row) {
    $tds = $xpath->query('td', $row);
    foreach($tds as $x => $td) {
        $data[$y][$x] = $td->nodeValue;
    }
}
var_dump($data);

Tested with php 5.6, gives this output:

array(157) {
  [1]=>
  array(6) {
    [0]=>
    string(10) "mod_access"
    [1]=>
    string(23) "Versions older than 2.1"
    [2]=>
    string(19) "Included by Default"
    [3]=>
    string(26) "Apache Software Foundation"
    [4]=>
    string(27) "Apache License, Version 2.0"
    [5]=>
    string(71) "Provides access control based on the client and the client's request[2]"
  }
  [2]=>
  array(6) {
    [0]=>
    string(11) "mod_actions"
    [1]=>
    string(22) "Versions 1.1 and later"
    [2]=>
    string(19) "Included by Default"
    [3]=>
    string(26) "Apache Software Foundation"
    [4]=>
    string(27) "Apache License, Version 2.0"
    [5]=>
    string(62) "Provides CGI ability based on request method and media type[3]"
  }
// etc ...
Sign up to request clarification or add additional context in comments.

2 Comments

works perflectly. not sure though why it was important to remove those queries?
The tbody didn’t appear to be in the HTML source of the page, I had to remove it to be able to find any rows at all. The td was removed as your first query only wants to find rows, so that your second query can then find the cells inside each row.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.