Python regex matching multiline string

Question

my_str :

PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'

my code

regex = re.compile(r'(Applicants:)( )?(.*)', re.MULTILINE)
print(regex.findall(text))

my output :

[('Applicants:', ' ', 'Silixa Ltd.')]

what I need is to get the string between 'Applicants:' and '\nInventors:'

'Silixa Ltd.' & 'Chevron U.S.A. Inc. (Incorporated
in USA - California)'

Thanks in advance for your help

If you want the regex to stop at Inventors, why are you using .*? — John Gordon
– John Gordon, Commented Jun 29, 2020 at 15:43
Should Inventors: and Applicants: always be at the start of the line? — The fourth bird
– The fourth bird, Commented Jun 29, 2020 at 15:46

xpqz · Accepted Answer · 2020-06-29 16:06:15Z

2

Try using re.DOTALL instead:

import re

text='''PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'''

regex = re.compile(r'Applicants:(.*?)Inventors:', re.DOTALL)
print(regex.findall(text))

gives me

$ python test.py
[' Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n\n']

The reason this works is that MULTILINE doesn't let the dot (.) match newlines, whereas DOTALL will.

edited Jun 29, 2020 at 16:06

answered Jun 29, 2020 at 15:42

xpqz

3,73713 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Greg Ward Over a year ago

You might want to clarify in your answer that MULTILINE is not needed here. It's not in your code, but you might explain why in the text. (Spoiler alert: MULTILINE affects the behaviour of ^ and $ around newlines, but the OP's regex does not use ^ or $.)

Houssam Hsm Over a year ago

Thank you so much for you help

André Santos · Accepted Answer · 2020-06-29 15:57:32Z

1

If what you want is the contents between Applicants: and \nInventors:, your regex should reflect that:

>>> regex = re.compile(r'Applicants: (.*)Inventors:', re.S)
>>> print(regex.findall(s))
['Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n']

re.S is the "dot matches all" option, so our (.*) will also match new lines. Note that this is different from re.MULTILINE, because re.MULTILINE only says that our expression should apply to multiple lines, but doesn't change the fact . will not match newlines. If . doesn't match newlines, a match like (.*) will still stop at newlines, not achieving the multiline effect you want.

Also note that if you are not interested in Applicants: or Inventors: you may not want to put that between (), as in (Inventors:) in your regex, because the match will try to create a matching group for it. That's the reason you got 3 elements in your output instead of just 1.

edited Jun 29, 2020 at 15:57

answered Jun 29, 2020 at 15:44

André Santos

3901 gold badge4 silver badges12 bronze badges

1 Comment

Houssam Hsm Over a year ago

Thank you so much for the solution and explantations.. You made it look simple !!

The fourth bird · Accepted Answer · 2020-06-29 16:16:56Z

1

If you want to match all the text between \nApplicants: and \nInventors:, you could also get the match without using re.DOTALL preventing unnecessary backtracking.

Match Applicants: and capture in group 1 the rest of that same line and all lines that follow that do not start with Inventors:

Then match Inventors.

^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:

^ Start of string (Or use \b if it does not have to be at the start)
Applicants: Match literally
( Capture group 1
- .* Match the rest of the line
- (?:\r?\n(?!Inventors:).*)* Match all lines that do not start with Inverntors:
) Close group
\r?\nInventors: Match a newline and Inventors:

Regex demo | Python demo

Example code

import re
text = ("PCT Filing Date: 2 December 2015\n"
    "Applicants: Silixa Ltd.\n"
    "Chevron U.S.A. Inc. (Incorporated\n"
    "in USA - California)\n"
    "Inventors: Farhadiroushan,\n"
    "Mahmoud\n"
    "Gillies, Arran\n"
    "Parker, Tom'")
regex = re.compile(r'^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:', re.MULTILINE)
print(regex.findall(text))

Output

['Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)']

edited Jun 29, 2020 at 16:16

answered Jun 29, 2020 at 16:03

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

2 Comments

Houssam Hsm Over a year ago

Thank you so much. I just had to remove ^ and then it's works perfectly : regex = re.compile(r'Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:')

The fourth bird Over a year ago

You can do that if it does not have to be at the start of the string. I would suggest prepending a word boundary \b in that case \bApplicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors: regex101.com/r/r7W0Ig/1

dawg · Accepted Answer · 2020-06-29 16:13:29Z

Here is a more general approach to parse a string like that into a dict of all the keys and values in it (ie, any string at the start of a line followed by a : is a key and the string following that key is data):

import re 

txt="""\
PCT Filing Date: 2 December 2015
Applicants: Silixa Ltd.
Chevron U.S.A. Inc. (Incorporated
in USA - California)
Inventors: Farhadiroushan,
Mahmoud
Gillies, Arran
Parker, Tom'"""

pat=re.compile(r'(^[^\n:]+):[ \t]*([\s\S]*?(?=(?:^[^\n:]*:)|\Z))', flags=re.M)
data={m.group(1):m.group(2) for m in pat.finditer(txt)}

Result:

>>> data
{'PCT Filing Date': '2 December 2015\n', 'Applicants': 'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n', 'Inventors': "Farhadiroushan,\nMahmoud\nGillies, Arran\nParker, Tom'"}

>>> data['Applicants']
'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n'

Demo of the regex

Collectives™ on Stack Overflow

Python regex matching multiline string

4 Answers 4

2 Comments

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related