1

I am trying to parse the html format data into arrays using the a tag classes but i was not able to get the desired format . Below is my data

$text ='<div class="result results_links results_links_deep web-result ">
  <div class="links_main links_deep result__body">
    <h2 class="result__title">
      <a rel="nofollow" class="result__a" href="">Text1</a> 
    </h2>
    <a class="result__snippet" href="">Text1</a> 
    <a class="result__url" href="">
    example.com
    </a>
  </div>
</div>

<div class="result results_links results_links_deep web-result ">
  <div class="links_main links_deep result__body">
    <h2 class="result__title">
      <a rel="nofollow" class="result__a" href="">text3</a> 
    </h2>
    <a class="result__snippet" href="">text23</a> 
    <a class="result__url" href="">
    text.com
    </a>
  </div>
</div>';

I am trying to get the result using below code

$lines = explode("\n", $text);
$out = array();
foreach ($lines as $line) {
    $parts = explode(" > ", $line);
    $ref = &$out;
    while (count($parts) > 0) {
        if (isset($ref[$parts[0]]) === false) {
            $ref[$parts[0]] = array();
        }
        $ref = &$ref[$parts[0]];
        array_shift($parts);
    }
}
print_r($out);

But i need the result exactly like below

array:2 [
  0 => array:3 [
    0 => "Text1"
    1 => "Text1"
    2 => "example.com"
  ]
  1 => array:3 [
    0 => "text3"
    1 => "text23"
    2 => "text.com"
  ]
]

Demo : https://eval.in/746170

Even i was trying dom like below in laravel :

$dom = new DOMDocument;
$dom->loadHTML($text);
foreach($dom->getElementsByTagName('a') as $node)
{
    $array[] = $dom->saveHTML($node);
}

print_r($array);

So how can i use the classes to separate the data as i wanted .Any suggestions please.Thank you .

10
  • You should use SimpleXML or XmlReader/XmlParser or DOM parsing classes. Exploding > is not reliable. Commented Mar 2, 2017 at 12:41
  • even i was trying with that as well but not getting exactly : Commented Mar 2, 2017 at 12:42
  • $dom = new DOMDocument; $dom->loadHTML($text); foreach($dom->getElementsByTagName('a') as $node) { $array[] = $dom->saveHTML($node); } print_r($array); Commented Mar 2, 2017 at 12:42
  • Your html example doesn't reproduce the structure of your real html content. Commented Mar 2, 2017 at 15:46
  • @CasimiretHippolyte : please check the below link with my actual data [eval.in/746302] But when i try the same concept removing the \n from the code by using preg_replace("/\r\n|\r|\n|'/", ' ', $text); , its giving me trying to non object error . Commented Mar 2, 2017 at 17:18

2 Answers 2

3

Here you go, try this and tell me if you need any more help:

<?php
$test = <<<EOS
<div class="result results_links results_links_deep web-result ">
  <div class="links_main links_deep result__body">
    <h2 class="result__title">
      <a rel="nofollow" class="result__a" href="">Text1</a>
    </h2>
    <a class="result__snippet" href="">Text1</a>
    <a class="result__url" href="">
    example.com
    </a>
  </div>
</div>

<div class="result results_links results_links_deep web-result ">
  <div class="links_main links_deep result__body">
    <h2 class="result__title">
      <a rel="nofollow" class="result__a" href="">text3</a>
    </h2>
    <a class="result__snippet" href="">text23</a>
    <a class="result__url" href="">
    text.com
    </a>
  </div>
</div>
EOS;

$document = new DOMDocument();
$document->loadHTML($test);

// first extract all the divs with the links_deep class
$divs = [];
foreach ($document->getElementsByTagName('div') as $div) {
    $classes = $div->attributes->getNamedItem('class')->nodeValue;
    if (!$classes) continue;

    $classes = explode(' ', $classes);

    if (in_array('links_main', $classes)) {
        $divs[] = $div;
    }
}

// now iterate through them and retrieve all the links in order
$results = [];
foreach ($divs as $div) {
    $temp = [];
    foreach ($div->getElementsByTagName('a') as $link) {
        $temp[] = $link->nodeValue;
    }
    $results[] = $temp;
}

var_dump($results);

Working version - http://sandbox.onlinephpfunctions.com/code/e7ed2615ea32c5b9f0a89e3460da28a2702343f1

Sign up to request clarification or add additional context in comments.

10 Comments

DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 22 I m getting error like this
Can you post the entire code you're using and which php version? You can see my example working here: sandbox.onlinephpfunctions.com/code/…
Do i need to add <<<EOS at the top and bottom of html data ? as here i m getting the data dynamically i can not add that . what to do ? i m doing with laravel
No you don't...that's the heredoc syntax for php strings - php.net/manual/en/language.types.string.php. Can you make sure that the data you're retrieving is correctly formatted as HTML?
As i m using laravel 5.4 it supports PHP >= 5.6.4 . Any idea why i m getting that error ?
|
1

I will do it using DOMDocument and DOMXPath to target interesting parts more easily. In order to be more precise, I register a function that checks if a class attribute contains a set of classes:

function hasClasses($attrValue, $requiredClasses) {
    $requiredClasses = explode(' ', $requiredClasses);
    $classes = preg_split('~\s+~', $attrValue, -1, PREG_SPLIT_NO_EMPTY);
    return array_diff($requiredClasses, $classes) ? false : true;
}

$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);

$xp = new DOMXPath($dom);
$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions('hasClasses');

$mainDivClasses = 'result results_links results_links_deep web-result';
$childDivClasses = 'links_main links_deep result__body';

$divNodeList = $xp->query('//div[php:functionString("hasClasses", @class, "' . $mainDivClasses . '")]
                           /div[php:functionString("hasClasses", @class, "' . $childDivClasses . '")]');

$results = [];
foreach ($divNodeList as $divNode) {
    $results[] = [
        trim($xp->evaluate('string(./h2/a[@class="result__a"])', $divNode)),
        trim($xp->evaluate('string(.//a[@class="result__snippet"])', $divNode)),
        trim($xp->evaluate('string(.//a[@class="result__url"])', $divNode))
    ];
}

print_r($results);

without registering a function, you can also use the XPath function contains in your predicates. It's less precise since it only checks if a substring is in a larger string (and not if a class attribute have a specific class like the hasClasses function) but it must be enough:

$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);

$xp = new DOMXPath($dom);

$divNodeList = $xp->query('//div[contains(@class, "results_links_deep")]
                                [contains(@class, "web-result")]
                           /div[contains(@class, "links_main")]
                               [contains(@class, "links_deep")]
                               [contains(@class, "result__body")]');

$results = [];
foreach ($divNodeList as $divNode) {
    $results[] = [
        trim($xp->evaluate('string(./h2/a[@class="result__a"])', $divNode)),
        trim($xp->evaluate('string(.//a[@class="result__snippet"])', $divNode)),
        trim($xp->evaluate('string(.//a[@class="result__url"])', $divNode))
    ];
}

print_r($results);

12 Comments

Wow ..it works . Thank you so much @Casimir et Hippolyte .
How to call the hasClasses function inside in registerPhpFunctions in laravel ? I m getting the error like this DOMXPath::query(): Unable to call handler hasClasses() ? In laravel
I am just trying to pass the function like this $xp->query('//div[php:functionString('.$this->hasClasses($attrValue,$mainDivClasses).')] /div[php:functionString('.$this->hasClasses($attrValue,$childDivClasses).')]'); Then it started giving me another error Allowed memory size of 134217728 bytes exhausted (tried to allocate 68719476720 bytes). Any suggestions ? @Casimir
can you please help me out i m stuck at this part . ?
@5367683: I don't know what is the context, but it is possible to use static methods with DOMXPath::registerPhpFunctions. Something like $xp->registerPhpFunctions('YourClass::staticMethod'); and then you use it the same way in XPath queries: /div[php:functionString('YourClass::staticMethod', arg1, arg2...)].
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.