1

I have the following fixed pattern markup scenarios

<div class="myclass" id="id123" data-foo="bar">content</div>
<div class="myclass" id="id123" data-foo="bar" >content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux">content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux" >content</div>

I'm trying to parse the following values out

id123
bar
qux (if it ever exists)

I was able to figure out how to get the different scenarios, but I'm haven't trouble coming up with one final rule that would work for all scenarios.

/<div class="myclass" id="(.*)" data-foo="(.*)"(data-baz="(.*)")?>/

I seem to be missing some basic regex principle. I tried bounding and ending and whitespace but not luck.

3
  • For any HTML reading task, I prefer to use DomDocument, you can also leverage DOMXpath to search, Basically regex is not an easy solution for HTML parsing. Sorry I'm not providing an exact solution/example for your challenge. Commented Apr 12, 2021 at 4:10
  • 1
    Correct Scuzzy. However I'm avoiding DomDocument on purpose due to performance. Commented Apr 12, 2021 at 4:14
  • that is a good call :) Commented Apr 12, 2021 at 21:09

1 Answer 1

3
  1. I do not endorse using regex to parse html, but you say that you are optimizing for speed and that the markup is predictably structured.
  2. You just need to use lazy quantifiers with those dots and show a little more care regarding the optional spaces

Code: (Demo)

$text = <<<TEXT
<div class="myclass" id="id123" data-foo="bar">content</div>
<div class="myclass" id="id123" data-foo="bar" >content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux">content</div>
<div class="myclass" id="id123" data-foo="bar" data-baz="qux" >content</div>
TEXT;

preg_match_all('~<div class="myclass" id="(.*?)" data-foo="(.*?)" ?(?:data-baz="(.*?)" ?)?>~', $text, $matches);
var_export(array_slice($matches, 1));

Output:

  0 => 
  array (
    0 => 'id123',
    1 => 'id123',
    2 => 'id123',
    3 => 'id123',
  ),
  1 => 
  array (
    0 => 'bar',
    1 => 'bar',
    2 => 'bar',
    3 => 'bar',
  ),
  2 => 
  array (
    0 => '',
    1 => '',
    2 => 'qux',
    3 => 'qux',
  ),
)

You can improve the regex efficiency by not using lazy quantifiers. If you know that the attribute values will not contain double-quotes, then you can use a this negated character class with a greedy quantifier: [^"]*.

Sign up to request clarification or add additional context in comments.

1 Comment

thank you for the thorough answer and explanation. It's prefect. The attribute values will only ever contain alphanumeric and possibly hyphens.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.