0

here I'm looking for a regular expression in PHP which would match the anchor with a specific "target="_parent" on it.I would like to get anchors with text like:

preg_match_all('<a href="http://" target="_parent">Text here</a>', subject, matches, PREG_SET_ORDER);

HTML:

<a href="http://" target="_parent">

    <FONT style="font-size:10pt" color=#000000 face="Tahoma">
        <DIV><B>Text</B> - Text </DIV>
    </FONT>

</a>

</DIV>

2 Answers 2

2

To be honest, the best way would be not to use a regular expression at all. Otherwise, you are going to be missing out on all kinds of different links, especially if you don't know that the links are always going to have the same way of being generated.

The best way is to use an XML parser.

<?php

$html = '<a href="http://" target="_parent">Text here</a>';
function extractTags($html) {
    $dom = new DOMDocument;
    libxml_use_internal_errors(true);
    $dom->loadHTML($html); // because dom will complain about badly formatted html
    $sxe = simplexml_import_dom($dom);
    $nodes = $sxe->xpath("//a[@target='_parent']");

    $anchors = array();
    foreach($nodes as $node) {
        $anchor = trim((string)dom_import_simplexml($node)->textContent);
        $attribs = $node->attributes();
        $anchors[$anchor] = (string)$attribs->href;
    }

    return $anchors;
}

print_r(extractTags($html))

This will output:

Array (
    [Text here] => http://
)

Even using it on your example:

$html = '<a href="http://" target="_parent">

<FONT style="font-size:10pt" color=#000000 face="Tahoma">
        <DIV><B>Text</B> - Text </DIV>
            </FONT>

            </a>

            </DIV>
            ';
            print_r(extractTags($html));

will output:

Array (
    [Text - Text] => http://
)

If you feel that the HTML is still not clean enough to be used with DOMDocument, then I would recommend using a project such as HTMLPurifier (see http://htmlpurifier.org/) to first clean the HTML up completely (and remove unneeded HTML) and use the output from that to load into DOMDocument.

Sign up to request clarification or add additional context in comments.

8 Comments

i have already idea about DOM, but i need regex to handle these matches, html markup is not okay to handle by DOM.
@user1218948 even with your example, this code still works. You are going to need to provide a larger example of this failing before regex even thinks of becoming an acceptable solution :P
Not only, target='_parent' its grab target='_blank' as well
@user1218948 then change the xpath query to: $nodes = $sxe->xpath("//a[@target='_blank' or @target='_parent']");
$nodes = $sxe->xpath('//a[@target="_parent"]'); i used double quotation, it's fixed, So we have no longer, html parse issue in any bad format?
|
2

You should be making using DOMDocument Class instead of Regex. You would be getting a lot of false positive results if you handle HTML with Regex.

<?php

$html='<a href="http://" target="_parent">Text here</a>';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
    if ($tag->getAttribute('target') === '_parent') {
       echo $tag->nodeValue;
    }
}

OUTPUT :

Text here

3 Comments

html markup is too ugly, and DOM not capable to handle misleading tags.
You need to put up your HTML source so all could have a look.. If you think DOM is worse for this situation, Regex would be even more worse !
@user1218948, What is your expected output ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.