PHP regex html data attributes with fixed markup

Question

I have the following fixed pattern markup scenarios

<div class="myclass" id="id123" data-foo="bar">content</div>
<div class="myclass" id="id123" data-foo="bar" >content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux">content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux" >content</div>

I'm trying to parse the following values out

id123
bar
qux (if it ever exists)

I was able to figure out how to get the different scenarios, but I'm haven't trouble coming up with one final rule that would work for all scenarios.

/<div class="myclass" id="(.*)" data-foo="(.*)"(data-baz="(.*)")?>/

I seem to be missing some basic regex principle. I tried bounding and ending and whitespace but not luck.

For any HTML reading task, I prefer to use DomDocument, you can also leverage DOMXpath to search, Basically regex is not an easy solution for HTML parsing. Sorry I'm not providing an exact solution/example for your challenge. — Scuzzy
– Scuzzy, Commented Apr 12, 2021 at 4:10
Correct Scuzzy. However I'm avoiding DomDocument on purpose due to performance. — seesoe
– seesoe, Commented Apr 12, 2021 at 4:14

mickmackusa · Accepted Answer · 2021-04-12 05:06:28Z

3

I do not endorse using regex to parse html, but you say that you are optimizing for speed and that the markup is predictably structured.
You just need to use lazy quantifiers with those dots and show a little more care regarding the optional spaces

Code: (Demo)

$text = <<<TEXT
<div class="myclass" id="id123" data-foo="bar">content</div>
<div class="myclass" id="id123" data-foo="bar" >content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux">content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux" >content</div>
TEXT;

preg_match_all('~<div class="myclass" id="(.*?)" data-foo="(.*?)" ?(?:data-baz="(.*?)" ?)?>~', $text, $matches);
var_export(array_slice($matches, 1));

Output:

  0 => 
  array (
    0 => 'id123',
    1 => 'id123',
    2 => 'id123',
    3 => 'id123',
  ),
  1 => 
  array (
    0 => 'bar',
    1 => 'bar',
    2 => 'bar',
    3 => 'bar',
  ),
  2 => 
  array (
    0 => '',
    1 => '',
    2 => 'qux',
    3 => 'qux',
  ),
)

You can improve the regex efficiency by not using lazy quantifiers. If you know that the attribute values will not contain double-quotes, then you can use a this negated character class with a greedy quantifier: [^"]*.

answered Apr 12, 2021 at 5:06

mickmackusa♦

49.2k13 gold badges98 silver badges165 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

seesoe Over a year ago

thank you for the thorough answer and explanation. It's prefect. The attribute values will only ever contain alphanumeric and possibly hyphens.

Collectives™ on Stack Overflow

PHP regex html data attributes with fixed markup

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related