Regex PHP find and match HTML tags with specific data-attributes

Question

I'm trying to parse a CTP file (CakePHP template with HTML and PHP tags in it) and want to match all the HTML tags with specific data-attributes (data-edit="true"). Each tag with data-edit="true" MUST have a data-type="..." and data-name="..." attribute. I would like to capture these attributes in (named) groups, so I can use them in my code. So far I have the following regex:

\<(?<tagname>\w+).*?(?>data\-edit="true").*?\>(?<content>.*?)\<\/(?&tagname)\>

Here are some samples of the tags it should match:

<h4 data-type="text" data-edit="true" data-name="SomeName">Some content, with or without newlines.</h4>

and

<span data-edit="true" data-type="wysiwyg" data-name="Beoordeling">Some text 
with <strong>tags</strong> and newlines in it that 
should not break the parser.</span>

From the above examples I would like the regex to return the content of the data-type and data-name tag, and of course the content (between the tags) itself.

The data-attributes can occur in whatever order and it is possible other attributes are present in the tags (such as classes). So far I've managed to get the content of only the tags with a data-edit="true" attribute, but when has a newline, the match breaks. Also I can't capture the other data-attributes.

It is even possible what I want to achieve? I know regex isn't the preferred way to parse HTML, but as this is a CTP file with all kinds of other tags in it, I can't use an XML parser.

Edit: sample code: https://regex101.com/r/nF6a96/2

"as this is a CTP file with all kinds of other tags in it, I can't use an XML parser" ... an XML parser would be absolutely fine with the two examples you've given, where it might fall over would be where HTML tags haven't been closed in an XML way (e.g. <br>) (though DOMDocument can cope with that) or there's embedded PHP <?php ... ?>). — CD001
– CD001, Commented Nov 9, 2018 at 11:41
@mickmackusa: updated the question, included a link to the sample code — Fabian van Schevikhoven
– Fabian van Schevikhoven, Commented Nov 9, 2018 at 12:55

mickmackusa · Accepted Answer · 2018-11-09 13:13:20Z

2

XPath is such a fantastic and versative tool. Your logic seamlessily transfers to an xpath query which is easy to construct, read, and maintain in the future.

Furthermore, XPath is superior to regex because it will successfully match qualifying elements no matter the order of the attributes. Regex will struggle to do the same with just one preg_ call.

The following will validate, extract, and store by loop the results of just one query.

Code: (Demo)

$dom=new DOMDocument; 
libxml_use_internal_errors(true);  // for malformed html warning suppression
$dom->loadHTML($text, LIBXML_NOENT);
//libxml_clear_errors();             // for  warning suppression
$xpath = new DOMXPath($dom);

foreach ($xpath->query("//*[@data-edit='true' and @data-type and @data-name]") as $node) {
    $results[] = [
                    'type' => $node->getAttribute('data-type'),
                    'name' => $node->getAttribute('data-name'),
                    'text' => $node->textContent
                 ];
}
var_export($results);

Output:

array (
  0 => 
  array (
    'type' => 'wysiwyg',
    'name' => 'Beoordeling',
    'text' => 'We beoordelen uw aanvraag en                                        berichten u over de acceptatie daarvan.',
  ),
  1 => 
  array (
    'type' => 'text',
    'name' => 'Bellen',
    'text' => 'We bellen u voor een afspraak.',
  ),
  2 => 
  array (
    'type' => 'text',
    'name' => 'Technisch specialist',
    'text' => 'Technisch specialist neemt bij u alles nog even door.',
  ),
)

edited Nov 9, 2018 at 13:13

answered Nov 9, 2018 at 13:00

mickmackusa♦

49.2k13 gold badges98 silver badges165 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Fabian van Schevikhoven Over a year ago

Wow @mickmackusa, this works so much better, thanks!

Max Over a year ago

Late to the party, but it would be just great to understand as well how you can get the HTML inside the found elements and not just the plain text. In any case this answer already helped me in 2022. :-) Thanks!

mickmackusa Over a year ago

@MLGS Maybe you want $dom->saveXML($node)

Max Over a year ago

Very nice, close - I want only the insides of the query, not the total element. In any case, thanks a lot already! You're a fast boii

Max Over a year ago

What I did to get the result I wanted was to iterate over childNodes and concat it to one string - same result with saveXML. Works like a charm. Thank you!

Pushpesh Kumar Rajwanshi · Accepted Answer · 2018-11-09 12:55:23Z

2

You should avoid parsing html using regex but since this is a case of attribute lookup within a tag and not some nested scenario of tags, hence you can use regex to do a quick validation here.

You need to use lookaheads in ensuring that the tag does contain all three kind of attributes you are looking for. You can use this regex,

<(\w+)(?=.*?data-edit="true")(?=.*?data-type="[^"]*")(?=.*?data-name="[^"]*")[^>]*?>.*?<\/\1>

Explanation:

<(\w+) --> matches a tag and captures the tagname in group1 to match at the end of closing tag
(?=.*?data-edit="true") --> lookahead and ensures data-edit attribute is present
(?=.*?data-type="[^"]*") --> lookahead and ensures data-type attribute is present
(?=.*?data-name="[^"]*") --> lookahead and ensures data-name attribute is present
[^>]*?> --> matches rest of the input and closing tag
.*? --> matches whatever text is within the starting and ending tag
<\/\1> --> matches the closing tag

Demo

edited Nov 9, 2018 at 12:55

answered Nov 9, 2018 at 12:20

Pushpesh Kumar Rajwanshi

18.4k2 gold badges22 silver badges39 bronze badges

3 Comments

Fabian van Schevikhoven Over a year ago

Thanks @Pushpesh! I've altered the regex to this: <(\w+)(?=.*?data-edit="true")(?=.*?data-type="(?<type>[^"]*)")(?=.*?data-name="(?<name>[^"]*)")[^>]*?>(?<content>.*?)<\/\1> This does exactly what I want!

Pushpesh Kumar Rajwanshi Over a year ago

@mickmackusa: This regex I gave was intended for validating whether the tag meets the expected criteria or not. For capturing individual parts, it may require one or more regex depending upon how the attributes appear in the tag. If they always occur in the same sequence, then yes it can be captured using a single regex else will need multiple regex.

xate Over a year ago

@mickmackusa You should stop trying to make this answer less valuable by advertising your own.

Collectives™ on Stack Overflow

Regex PHP find and match HTML tags with specific data-attributes

2 Answers 2

5 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related