Finding HTML tags in string

Question

I know this question is around SO, but I can't find the right one and I still suck in Regex :/

I have an string and that string is valid HTML. Now I want to find all the tags with an certain name and attribute.

I tried this regex (i.e. div with type): /(<div type="my_special_type" src="(.*?)<\/div>)/.

Example string:

<div>Do not match me</div>
<div type="special_type" src="bla"> match me</div>
<a>not me</a>
<div src="blaw" type="special_type" > match me too</div>

If I use preg_match then I only get <div type="special_type" src="bla"> match me</div> what is logical because the other one has the attributes in a different order.

What regex do I need to get the following array when using preg_match on the example string?:

array(0 => '<div type="special_type" src="bla"> match me</div>',
      1 => '<div src="blaw" type="special_type" > match me too</div>')

if it is a valid HTML can't you use PHP DOM? i don't recommend using preg_* for HTML — bansi
– bansi, Commented Sep 14, 2013 at 10:37
PHP parse DOM: how to: DOM-Methoden/wiki/PHP/Tutorials/DOMDocument: wiki.selfhtml.org/wiki/PHP/Tutorials/DOMDocument codingreflections.com/php-parse-html We will do the following jobs with our sample HTML: Select element by Id Get elements by its tag name Find elements by class Find all links in a page Inserting HTML element Deleting an element Dealing with attributes codingreflections.com/php-parse-html — And
– And, Commented Jul 12, 2021 at 16:32

hek2mgl · Accepted Answer · 2013-09-14 12:06:45Z

20

A general advice: Dont use regex to parse HTML It will get messy if the HTML changes..

Use DOMDocument instead:

$str = <<<EOF
<div>Do not match me</div>
<div type="special_type" src="bla"> match me</div>
<a>not me</a>
<div src="blaw" type="special_type" > match me too</div>
EOF;

$doc = new DOMDocument();
$doc->loadHTML($str);    
$selector = new DOMXPath($doc);

$result = $selector->query('//div[@type="special_type"]');

// loop through all found items
foreach($result as $node) {
    echo $node->getAttribute('src');
}

edited Sep 14, 2013 at 12:06

answered Sep 14, 2013 at 10:37

hek2mgl

159k31 gold badges263 silver badges279 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sven van Zoelen Over a year ago

Thanks for the quick answer! I accepted the other question because that one looped through the found nodes.

hek2mgl Over a year ago

@Sven Know that looping in PHP is slow. Using XPath is definitely they way to go here - especially if the HTML is larger than shown in question. test it

Sven van Zoelen Over a year ago

Can you give me an example how to loop through the found items from $result? Because I can't find an method like items() or something and i have multiple tags in the html that i want to find.

hek2mgl Over a year ago

@Sven OHHH. this means I've ruined another post of mine O: (edited the wrong answer accidently) ...

hek2mgl Over a year ago

@Sven No problem! :) I'm happy that you've finally accepted the XPath solution. Not because of the 15 reps I got, it's because XPath is the way to go here. If you get fluent with XPath you'll see that you can select everything that comes in mind - and more! ;) But there are also situations where XPath has disadvantages (from performance sight of things) that's especially if you select by node name or id attr.. in this cases pure DOM would perform better

|

Kilise · Accepted Answer · 2013-09-14 10:55:21Z

5

As hek2msql said, you better use DOMDocument

$html = '
<div>Do not match me</div>
<div type="special_type" src="bla"> match me</div>
<a>not me</a>
<div src="blaw" type="special_type" > match me too</div>';

$matches = get_matched($html);


function get_matched($html){
    $matched = array();

    $dom = new DOMDocument();
    @$dom->loadHtml($html);

    $length = $dom->getElementsByTagName('div')->length;

    for($i=0;$i<$length;$i++){
        $type = $dom->getElementsByTagName("div")->item($i)->getAttribute("type");

        if($type != 'special_type')
            continue;

        $matched[] = $dom->getElementsByTagName("div")->item($i)->getAttribute('src');
    // or   $matched[] = $dom->getElementsByTagName("div")->item($i)->nodeValue;

    }

    return $matched;

}

edited Sep 14, 2013 at 10:55

answered Sep 14, 2013 at 10:47

Kilise

1,1194 gold badges16 silver badges36 bronze badges

3 Comments

hek2mgl Over a year ago

That's bollocks. You really suggest to iterate over nodes instead of using XPATH?? Seems that you don't understood what hek2mgl said

Kilise Over a year ago

i never said "instead of using XPATH" ofcourse he could use XPATH.

Musaddiq Khan Over a year ago

If the hmtl is like this $html = ' <div>Do not <strong>match</strong> me</div> then it remove the strong tag.

Collectives™ on Stack Overflow

Finding HTML tags in string

2 Answers 2

6 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related