Regex behaves differently for the same input string

Question

I am trying to get a pdf page with a particular string and the string is:

"statement of profit or loss"

and I'm trying to accomplish this using following regex:

re.search('statement of profit or loss', text, re.IGNORECASE)

But even though the page contained this string "statement of profit or loss" the regex returned None. On further investigating the document, I found that the characters 'fi' in the "profit" as written in the document are more congested. When I copied it from the document and pasted it in my code it worked fine.

So, If I copy "statement of profit or loss" from document and paste it in re.search() in my code, it works fine. But if I write "statement of profit or loss" manually in my code, re.search() returns none. How can I avoid this behavior?

PDF is not a strict text document. Hence you are having difficulty in matching pattern. — Rahul
– Rahul, Commented Feb 9, 2020 at 7:42
It is so common with whitespaces. Replace each space with \s+ or \s in the pattern. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 9, 2020 at 7:57
Just FYI It is most likely you have a hard space in the input string. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 9, 2020 at 10:50

Jongware · Accepted Answer · 2020-02-09 11:14:48Z

The 'congested' characters copied from your PDF are actually a single character: the 'fi ligature' U+FB01: ﬁ.

Either it was entered as such in the source document, or the typesetting engine that was used to create the PDF, replaced the combination f+i by fi.

Combining two or more characters into a single glyph is a fairly usual operation for "nice typesetting", and is not limited to fi, fl, ff, and fj, although these are the most used combinations. (That is because in some fonts the long overhang of the f glyph jarringly touches or overlaps the next character.) Actually, you can have any amount of ligatures; some Adobe fonts use a single ligature for Th.

Usually this is not a problem with text extracting, because in the PDF it can be specified that certain glyphs must be decoded as a string of characters – the original characters. So, possibly your PDF does not contain such a definition, or the typesetting engine did not bother because the single character ﬁ is a valid Unicode character on itself (although it is highly advised not to use it).

You can work around this by explicitly cleaning up your text strings before processing any further:

text = text.replace('ﬁ', 'fi')

– repeat this for other problematic ligatures which have a Unicode codepoint: ﬂ, ﬀ, ﬃ, ﬄ (I possibly missed some more).

Collectives™ on Stack Overflow

Regex behaves differently for the same input string

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related