1

I have several HTML pages with codes that look like this:

<!-- ID: 123456 -->

What I need is a PHP script that can pull that ID number. I have tried the following:

if (preg_match('#^<!--(.*?)-->#i', $output)) {
                echo "A match was found.";
            } else {
                echo array_flip(get_defined_constants(true)['pcre'])[preg_last_error()];
                echo "No match found.";
            }`

That always gives "No match found", with no error reported. I have also tried the preg_match_all and the same results. The only thing I have found to work is to create an array based on spaces, but that is very time consuming and waste of processor power.

For reference, I have looked and tried just about every suggestion on these pages:

Explode string by one or more spaces or tabs

http://php.net/manual/en/function.preg-split.php

How to extract html comments and all html contained by node?

5
  • 2
    Maybe this is because - is a special symbol and should be escaped? Commented Sep 22, 2015 at 20:38
  • How is the ID generated? Why can't you intercept that? Commented Sep 22, 2015 at 20:39
  • 1
    Remove the ^ from the pattern. Otherwise, it will match only at the start of the string. Commented Sep 22, 2015 at 20:40
  • $output is the string with <!-- ID: 123456 --> or the ID you want captured? Works here, eval.in/437735. Might need m modifier if you want the <! to be only at the start of each line. Commented Sep 22, 2015 at 20:51
  • 1
    @u_mulder - is not a special symbol, except inside square brackets. Commented Sep 22, 2015 at 21:06

3 Answers 3

1

How about try this:

<!-- ID: ([\w ]+) -->

This will search for all the literals mentioned in your example, and extract the numeric ID. You can fetch it with the help of numbered group.

PS:Use the escaping.

Sign up to request clarification or add additional context in comments.

2 Comments

Here, only \w must be escaped.
Thanks I have updated it, I was trying the regex in java environment and forgot to remove the escape characters.
1

To extract informations from structured data (as HTML, XML, Json...) use the correct parser (DOMDocument and DOMXPath to query the DOM tree):

$html = <<<'EOD'
<script>var a='<!-- ID: avoid_this --> and that <!-- ID: 666 -->';</script>
blahblah<!-- ID: 123456 -->blahblah
EOD;

$query = '//comment()[starts-with(., " ID: ")]';

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);

$nodeList = $xp->query($query);

foreach ($nodeList as $node) {
    echo substr($node->textContent, 5, -1);
}

Feel free to check the result after with is_numeric or a regex. You can register your own php function and include it in the xpath query too: http://php.net/manual/en/domxpath.registerphpfunctions.php

Comments

-1

First think the HTML file as a Text file because you want to read only some text from the .html file.

test.html

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
<p>This is a test HTML page<p>
<!-- ID: 123456 -->
</body>
</html>

PHP script that fetch ID from HTML file

<?php

$fileName = 'test.html';

$content = file_get_contents($fileName);
$start = '<!-- ID:';
$end   = '-->';
function getBetween($content,$start,$end){
    $r = explode($start, $content);

    if (isset($r[1])){

        $r = explode($end, $r[1]);
        return $r[0];

    }
    return '';
}


echo str_replace(' ', '', getBetween($content,$start,$end));


?>

1 Comment

This is a very... original approach :) But it's really far better to use a proper XML/HTML parser as shown by Casimir above.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.