1

I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.

INTERNATIONAL MONETARY FUND            7\n\x0cBELGIUM\n\n\n\nPOLICY DISCUSSIONS—MAINTAINING THE REFORM\nMOMENTUM\n7.     The current recovery is an opportunity to strengthen the resilience and growth\npotential of the Belgian economy. The government's ability to deal with future shocks will depend\non whether it implements the right policies now while the economy continues to recover.\n\n\uf0b7   First, with public debt above 100 percent of GDP and only starting to come down, Belgium still\n    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will\n    require following through on plans to gradually move toward structural balance.\n\n\uf0b7   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,\n    further labor and product market reforms are needed to increase productivity growth, raise\n    potential output, and integrate vulnerable groups into the labor market.\n\n\uf0b7   Third, although the financial sector has recovered since the crisis and is generally sound, cyclical\n    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance\n    and proactive policies.3\n\n8.      The government agreed last summer on a new package of measures related to\ntaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was\na reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be\nphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in\n2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was\nmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-\ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the\nmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.\n\n9.      Policy discussions focused on the importance of maintaining the reform momentum\nand not yielding to complacency. Achieving the balanced budget goal will require efforts at all\nlevels of government to make spending more efficient and safeguard revenues (Section A).\nA combination of policies and reforms could help raise productivity growth, including increasing\ninvestment in infrastructure and enhancing competition in services (Section B). To fully realize\nBelgium's employment potential, it will be critical to address the severe fragmentation of the labor\nmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in the\nmortgage market and carefully navigate the transition toward a European Banking Union (Section D).\n\n\n\n\n3\n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector\nAssessment Program (FSAP).\n4\n  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with\na deduction that is the product of corporate equity and a notional interest rate.\n\n\n8

Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line \n\n followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:

first paragraph:

  1. The current recovery is an opportunity to strengthen the resilience and growth\npotential of the Belgian economy. The government's ability to deal with future shocks will depend\non whether it implements the right policies now while the economy continues to recover.\n\n\uf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium still\n has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will\n require following through on plans to gradually move toward structural balance.\n\n\uf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,\n further labor and product market reforms are needed to increase productivity growth, raise\n potential output, and integrate vulnerable groups into the labor market.\n\n\uf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclical\n vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance\n and proactive policies.3\n\n

second paragraph:

  1. The government agreed last summer on a new package of measures related to\ntaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was\na reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be\nphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in\n2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was\nmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-\ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the\nmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.\n\n

and finally the third:

  1. Policy discussions focused on the importance of maintaining the reform momentum\nand not yielding to complacency. Achieving the balanced budget goal will require efforts at all\nlevels of government to make spending more efficient and safeguard revenues (Section A).\nA combination of policies and reforms could help raise productivity growth, including increasing\ninvestment in infrastructure and enhancing competition in services (Section B). To fully realize\nBelgium's employment potential, it will be critical to address the severe fragmentation of the labor\nmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in the\nmortgage market and carefully navigate the transition toward a European Banking Union (Section D).\n\n\n\n\n3\n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector\nAssessment Program (FSAP).\n4\n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with\na deduction that is the product of corporate equity and a notional interest rate.\n\n

I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) \n\n with the reasoning to select everything from the start to the end

  1. (?m)[0-99].*[.] {3,7}: To identify the beginning, for each line separately.
  2. \n\n specifying the end.

However, it doesn't find anything with it.

2
  • 2
    If you think [0-99] match numbers from 0 to 99, you are wrong. You may replace that with \d\d?. re.M ((?m)) modifies ^ and $, you do not have them in the pattern. You must have wanted to use (?s). Try r'(?sm)^\d\d?\. {3,7}(.*?)(?:\n\n|\Z)', see the regex demo. Commented Nov 20, 2018 at 13:41
  • Can you provide de raw input? Commented Nov 20, 2018 at 13:42

1 Answer 1

3

The [0-99] pattern is erroneous since it matches any 1 digit from 0 to 9. See Why doesn't [01-12] range work as expected?. The re.M ((?m)) modifies ^ and $ anchors, but you haved neither in the pattern.

You may use

r'(?sm)^\d\d?\. {3,7}(.*?)(?=\n\n\d\d?\. |\Z)'

See the regex demo.

Details

  • (?sm) - re.DOTALL and re.MULTILINE options enabled
  • ^ - start of a line
  • \d\d? - 1 or 2 digits (0 to 99)
  • \. - a dot
  • <code> {3,7}</code> - 3 to 7 spaces (replace with[^\S\r\n]{3,7}` to match any horizontal whitespace)
  • (.*?) - Group 1: any 0+ chars as few as possible
  • (?=\n\n\d\d?\. |\Z) - a location, immediately followed with two newline chars (\n\n) and then 1 or 2 digits (\d\d?) and a dot followed with space or (|) end of the whole string (\Z).

Python demo:

import re
s="INTERNATIONAL MONETARY FUND            7\n\x0cBELGIUM\n\n\n\nPOLICY DISCUSSIONS—MAINTAINING THE REFORM\nMOMENTUM\n7.     The current recovery is an opportunity to strengthen the resilience and growth\npotential of the Belgian economy. The government's ability to deal with future shocks will depend\non whether it implements the right policies now while the economy continues to recover.\n\n\uf0b7   First, with public debt above 100 percent of GDP and only starting to come down, Belgium still\n    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will\n    require following through on plans to gradually move toward structural balance.\n\n\uf0b7   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,\n    further labor and product market reforms are needed to increase productivity growth, raise\n    potential output, and integrate vulnerable groups into the labor market.\n\n\uf0b7   Third, although the financial sector has recovered since the crisis and is generally sound, cyclical\n    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance\n    and proactive policies.3\n\n8.      The government agreed last summer on a new package of measures related to\ntaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was\na reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be\nphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in\n2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was\nmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-\ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the\nmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.\n\n9.      Policy discussions focused on the importance of maintaining the reform momentum\nand not yielding to complacency. Achieving the balanced budget goal will require efforts at all\nlevels of government to make spending more efficient and safeguard revenues (Section A).\nA combination of policies and reforms could help raise productivity growth, including increasing\ninvestment in infrastructure and enhancing competition in services (Section B). To fully realize\nBelgium's employment potential, it will be critical to address the severe fragmentation of the labor\nmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in the\nmortgage market and carefully navigate the transition toward a European Banking Union (Section D).\n\n\n\n\n3\n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector\nAssessment Program (FSAP).\n4\n  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with\na deduction that is the product of corporate equity and a notional interest rate.\n\n\n8"
for r in re.findall(r'(?sm)^\d\d?\. {3,7}(.*?)(?=\n\n\d\d?\. |\Z)', s):
    print(r, "\n---------")

Output:

The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.

   First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
    require following through on plans to gradually move toward structural balance.

   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
    further labor and product market reforms are needed to increase productivity growth, raise
    potential output, and integrate vulnerable groups into the labor market.

   Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
    and proactive policies.3 
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality. 
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).




3
 A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.


8 
---------
Sign up to request clarification or add additional context in comments.

9 Comments

many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
@math I already did it for you - see this demo. And here is a Python demo.
I've just noticed that my ending condition was not correct. I will change it above. It should be \n\n followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like \n\n\\uf0b7 in there. I tried to change your solution to r'(?sm)^\d\d?\. {3,7}(.*?)(?:\n\n\d\d?\. |\Z)' but then the last paragraph is not selected
Because otherwise It will end at .\n\n\uf0b7 in the first paragraph which is not correct
@math \uf0b7 is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9] instead of \d.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.