1

I need to capture specific tags from a HTML page using PHP.

A single HTML document can have multiple results (Multiline as well). Also ONLY need to match tags if it includes a data-uid value.

  • Tag name (div, span etc...)
  • data-uid's value
  • Children nodes.

So far, I was able to capture tag name, data-uid's value. But not Children nodes.

<div class="testClassOne" data-uid="123456">
    <div class="testClassTwo">Content</div>
    <-- More nodes -->
</div>

Result: { tag: "div", data-uid: 123456, childrens: "<div class="testClassTwo">Content</div>" }

or

<div class="testClassOne" data-uid="123456"></div>

Result: { tag: "div", data-uid: 123456, childrens: " " }

My current Regex and the function are as follow...

$regex = '/<(.*) (?:.*?)data-uid="([^"]*?)"(?:.*?)>(.*?)<\/\1>/';
$content = preg_replace_callback($regex, 'test', $content);

function test($arg){
    print_r($arg);
}

Does anyone know to resolve this issue (Capture childrens as a string as well?) ?

4
  • 1
    you'd be far better off doing this with DOM parsing; using regex for this kind of task gets complicated, and ends up being rather brittle Commented Jun 1, 2018 at 20:39
  • 2
    Do not parse HTML with Regex. Commented Jun 1, 2018 at 20:42
  • @landru27 I tried to do this with DOMDocument as well. But failed, Not achieved this far. Any suggestion to catch tagName, data-uid as well as children in an efficient way? Commented Jun 1, 2018 at 20:43
  • @stackminu : if you have fully researched, tried, and failed with DOM parsing, you'd be far better off posting a SO question detailing what is not working with your DOM parsing, rather than giving up, switching to regex, failing there too, and posting to SO about your regex attempts; in other words, go back to DOM parsing; future you will thank you greatly Commented Jun 1, 2018 at 20:54

1 Answer 1

1

As stated by others, use a DOM parser with xpath expressions instead.
The following expression

$items = $xpath->query("//*[@data-uid]");

will query the dom for all elements having data-uid as an attribute and will return a list. Afterwards, you can call getAttribute() on each item.


In PHP:

<?php

$data = <<<DATA
<div class="testClassOne" data-uid="123456">
    <div class="testClassTwo">Content</div>
    <-- More nodes -->
</div>
DATA;

$dom = new DOMDocument();

# suppress warnings
libxml_use_internal_errors(true);
$dom->loadHTML($data);
libxml_clear_errors();

# set up an xpath expression
$xpath = new DOMXPath($dom);
$items = $xpath->query("//*[@data-uid]");

foreach ($items as $item) {
    echo "tagname: " . $item->tagName . "\n";
    echo "uid: " . $item->getAttribute("data-uid") . "\n";
    foreach($item->getElementsByTagName('*') as $child ){
        print_r($child);
    }   
}

?>


This yields

tagname: div
uid: 123456
DOMElement Object
(
    [tagName] => div
    [schemaTypeInfo] => 
    [nodeName] => div
    [nodeValue] => Content
    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] => (object value omitted)
    [lastChild] => (object value omitted)
    [previousSibling] => (object value omitted)
    [nextSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => div
    [baseURI] => 
    [textContent] => Content
)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.