1

This is driving me insane...

I have the following code:

    # open pdf
    $pdf = file_get_contents('myfile.pdf');

    echo("RE 1:\n");
    preg_match('/^[0-9]+ 0 obj.*\/Contents \[ ([0-9]+ [0-9]+) R \\]/msU', $pdf, $m);
    var_dump($m);

    echo("\nRE 2:\n");
    preg_match('/^8 0 obj.*\/Contents \[ ([0-9]+ [0-9]+) R \\]/msU', $pdf, $m);
    var_dump($m);

The file myfile.pdf contains the following text:

...
8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj
...

The only difference between those two regular expressions is the numeric range at the beginning of the string. Yet I get the following output:

RE 1:
array(0) {
}

RE 2:
array(2) {
  [0]=>
  string(78) "8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]"
  [1]=>
  string(3) "5 0"
}

I would expect both regular expressions to return similar results, but the regular expression with the numeric range at the start (RE 1) doesn't return any results. Is this a bug or am I doing something wrong?

Update

After adding preg_last_error(), I am getting PREG_BACKTRACK_LIMIT_ERROR. How can I fix that?

16
  • 1
    @Emma Yes, that is what I'm trying to capture. It works perfectly on regex101.com, but not in my code. Commented Jul 29, 2019 at 18:39
  • Both of your regexes work fine at sandbox.onlinephpfunctions.com so it could be that your PHP or PCRE version is causing a headache? Commented Jul 29, 2019 at 18:42
  • Try using preg_last_error() to see if it gives you any hints. Commented Jul 29, 2019 at 18:44
  • @MonkeyZeus Good call! I am getting PREG_BACKTRACK_LIMIT_ERROR. Commented Jul 29, 2019 at 18:52
  • Check your php.ini file and see what pcre.backtrack_limit is set to or use echo ini_get( 'pcre.backtrack_limit' ); if you don't have access to php.ini Commented Jul 29, 2019 at 18:54

1 Answer 1

1

I'm guessing that you might be designing an expression that would somewhat look like,

[0-9]+\s+0\s+obj\b.*?\/Contents\s+\[\s*([0-9]+\s+[0-9]+)\s+R\s*\]

on s mode.

Test

$re = '/[0-9]+\s+0\s+obj\b.*?\/Contents\s+\[\s*([0-9]+\s+[0-9]+)\s+R\s*\]/s';
$str = '8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj

8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Sign up to request clarification or add additional context in comments.

6 Comments

OP is using /msU so their . matches everything including newlines.
Yes, but that RE does work. And my output of preg_last_error() is PREG_BACKTRACK_LIMIT_ERROR. So that's why mine doesn't work I guess. But I'm not sure what causes that...
Yours works with /msU. I'm wondering if it's the word boundary you used... Testing more things now.
Ok, yours works because you added the .*? quantifier. I am already using the /U modifier, which means Ungreedy. But then your .*? reverses that. But I need it to be ungreedy, as there are many instances of this string I'm trying to capture.
I see that. It will fail in regex101.com. But both are acceptable in PHP. I have tried both and they yield the exact same result.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.