2

I'm trying to parse a CTP file (CakePHP template with HTML and PHP tags in it) and want to match all the HTML tags with specific data-attributes (data-edit="true"). Each tag with data-edit="true" MUST have a data-type="..." and data-name="..." attribute. I would like to capture these attributes in (named) groups, so I can use them in my code. So far I have the following regex:

\<(?<tagname>\w+).*?(?>data\-edit="true").*?\>(?<content>.*?)\<\/(?&tagname)\>

Here are some samples of the tags it should match:

<h4 data-type="text" data-edit="true" data-name="SomeName">Some content, with or without newlines.</h4>

and

<span data-edit="true" data-type="wysiwyg" data-name="Beoordeling">Some text 
with <strong>tags</strong> and newlines in it that 
should not break the parser.</span>

From the above examples I would like the regex to return the content of the data-type and data-name tag, and of course the content (between the tags) itself.

The data-attributes can occur in whatever order and it is possible other attributes are present in the tags (such as classes). So far I've managed to get the content of only the tags with a data-edit="true" attribute, but when has a newline, the match breaks. Also I can't capture the other data-attributes.

It is even possible what I want to achieve? I know regex isn't the preferred way to parse HTML, but as this is a CTP file with all kinds of other tags in it, I can't use an XML parser.

Edit: sample code: https://regex101.com/r/nF6a96/2

2
  • "as this is a CTP file with all kinds of other tags in it, I can't use an XML parser" ... an XML parser would be absolutely fine with the two examples you've given, where it might fall over would be where HTML tags haven't been closed in an XML way (e.g. <br>) (though DOMDocument can cope with that) or there's embedded PHP <?php ... ?>). Commented Nov 9, 2018 at 11:41
  • @mickmackusa: updated the question, included a link to the sample code Commented Nov 9, 2018 at 12:55

2 Answers 2

2

XPath is such a fantastic and versative tool. Your logic seamlessily transfers to an xpath query which is easy to construct, read, and maintain in the future.

Furthermore, XPath is superior to regex because it will successfully match qualifying elements no matter the order of the attributes. Regex will struggle to do the same with just one preg_ call.

The following will validate, extract, and store by loop the results of just one query.

Code: (Demo)

$dom=new DOMDocument; 
libxml_use_internal_errors(true);  // for malformed html warning suppression
$dom->loadHTML($text, LIBXML_NOENT);
//libxml_clear_errors();             // for  warning suppression
$xpath = new DOMXPath($dom);

foreach ($xpath->query("//*[@data-edit='true' and @data-type and @data-name]") as $node) {
    $results[] = [
                    'type' => $node->getAttribute('data-type'),
                    'name' => $node->getAttribute('data-name'),
                    'text' => $node->textContent
                 ];
}
var_export($results);

Output:

array (
  0 => 
  array (
    'type' => 'wysiwyg',
    'name' => 'Beoordeling',
    'text' => 'We beoordelen uw aanvraag en                                        berichten u over de acceptatie daarvan.',
  ),
  1 => 
  array (
    'type' => 'text',
    'name' => 'Bellen',
    'text' => 'We bellen u voor een afspraak.',
  ),
  2 => 
  array (
    'type' => 'text',
    'name' => 'Technisch specialist',
    'text' => 'Technisch specialist neemt bij u alles nog even door.',
  ),
)
Sign up to request clarification or add additional context in comments.

5 Comments

Wow @mickmackusa, this works so much better, thanks!
Late to the party, but it would be just great to understand as well how you can get the HTML inside the found elements and not just the plain text. In any case this answer already helped me in 2022. :-) Thanks!
@MLGS Maybe you want $dom->saveXML($node)
Very nice, close - I want only the insides of the query, not the total element. In any case, thanks a lot already! You're a fast boii
What I did to get the result I wanted was to iterate over childNodes and concat it to one string - same result with saveXML. Works like a charm. Thank you!
2

You should avoid parsing html using regex but since this is a case of attribute lookup within a tag and not some nested scenario of tags, hence you can use regex to do a quick validation here.

You need to use lookaheads in ensuring that the tag does contain all three kind of attributes you are looking for. You can use this regex,

<(\w+)(?=.*?data-edit="true")(?=.*?data-type="[^"]*")(?=.*?data-name="[^"]*")[^>]*?>.*?<\/\1>

Explanation:

  • <(\w+) --> matches a tag and captures the tagname in group1 to match at the end of closing tag
  • (?=.*?data-edit="true") --> lookahead and ensures data-edit attribute is present
  • (?=.*?data-type="[^"]*") --> lookahead and ensures data-type attribute is present
  • (?=.*?data-name="[^"]*") --> lookahead and ensures data-name attribute is present
  • [^>]*?> --> matches rest of the input and closing tag
  • .*? --> matches whatever text is within the starting and ending tag
  • <\/\1> --> matches the closing tag

Demo

3 Comments

Thanks @Pushpesh! I've altered the regex to this: <(\w+)(?=.*?data-edit="true")(?=.*?data-type="(?<type>[^"]*)")(?=.*?data-name="(?<name>[^"]*)")[^>]*?>(?<content>.*?)<\/\1> This does exactly what I want!
@mickmackusa: This regex I gave was intended for validating whether the tag meets the expected criteria or not. For capturing individual parts, it may require one or more regex depending upon how the attributes appear in the tag. If they always occur in the same sequence, then yes it can be captured using a single regex else will need multiple regex.
@mickmackusa You should stop trying to make this answer less valuable by advertising your own.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.