2

I am trying to get a pdf page with a particular string and the string is:

"statement of profit or loss"

and I'm trying to accomplish this using following regex:

re.search('statement of profit or loss', text, re.IGNORECASE)

But even though the page contained this string "statement of profit or loss" the regex returned None. On further investigating the document, I found that the characters 'fi' in the "profit" as written in the document are more congested. When I copied it from the document and pasted it in my code it worked fine.

So, If I copy "statement of profit or loss" from document and paste it in re.search() in my code, it works fine. But if I write "statement of profit or loss" manually in my code, re.search() returns none. How can I avoid this behavior?

3
  • PDF is not a strict text document. Hence you are having difficulty in matching pattern. Commented Feb 9, 2020 at 7:42
  • It is so common with whitespaces. Replace each space with \s+ or \s in the pattern. Commented Feb 9, 2020 at 7:57
  • Just FYI It is most likely you have a hard space in the input string. Commented Feb 9, 2020 at 10:50

1 Answer 1

3

The 'congested' characters copied from your PDF are actually a single character: the 'fi ligature' U+FB01: .

Either it was entered as such in the source document, or the typesetting engine that was used to create the PDF, replaced the combination f+i by fi.

Combining two or more characters into a single glyph is a fairly usual operation for "nice typesetting", and is not limited to fi, fl, ff, and fj, although these are the most used combinations. (That is because in some fonts the long overhang of the f glyph jarringly touches or overlaps the next character.) Actually, you can have any amount of ligatures; some Adobe fonts use a single ligature for Th.

Usually this is not a problem with text extracting, because in the PDF it can be specified that certain glyphs must be decoded as a string of characters – the original characters. So, possibly your PDF does not contain such a definition, or the typesetting engine did not bother because the single character is a valid Unicode character on itself (although it is highly advised not to use it).

You can work around this by explicitly cleaning up your text strings before processing any further:

text = text.replace('fi', 'fi')

– repeat this for other problematic ligatures which have a Unicode codepoint: , , , (I possibly missed some more).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.