1

my_str :

PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'

my code

regex = re.compile(r'(Applicants:)( )?(.*)', re.MULTILINE)
print(regex.findall(text))

my output :

[('Applicants:', ' ', 'Silixa Ltd.')]

what I need is to get the string between 'Applicants:' and '\nInventors:'

'Silixa Ltd.' & 'Chevron U.S.A. Inc. (Incorporated
in USA - California)'

Thanks in advance for your help

2
  • If you want the regex to stop at Inventors, why are you using .*? Commented Jun 29, 2020 at 15:43
  • Should Inventors: and Applicants: always be at the start of the line? Commented Jun 29, 2020 at 15:46

4 Answers 4

2

Try using re.DOTALL instead:

import re

text='''PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'''

regex = re.compile(r'Applicants:(.*?)Inventors:', re.DOTALL)
print(regex.findall(text))

gives me

$ python test.py
[' Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n\n']

The reason this works is that MULTILINE doesn't let the dot (.) match newlines, whereas DOTALL will.

Sign up to request clarification or add additional context in comments.

2 Comments

You might want to clarify in your answer that MULTILINE is not needed here. It's not in your code, but you might explain why in the text. (Spoiler alert: MULTILINE affects the behaviour of ^ and $ around newlines, but the OP's regex does not use ^ or $.)
Thank you so much for you help
1

If what you want is the contents between Applicants: and \nInventors:, your regex should reflect that:

>>> regex = re.compile(r'Applicants: (.*)Inventors:', re.S)
>>> print(regex.findall(s))
['Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n']

re.S is the "dot matches all" option, so our (.*) will also match new lines. Note that this is different from re.MULTILINE, because re.MULTILINE only says that our expression should apply to multiple lines, but doesn't change the fact . will not match newlines. If . doesn't match newlines, a match like (.*) will still stop at newlines, not achieving the multiline effect you want.

Also note that if you are not interested in Applicants: or Inventors: you may not want to put that between (), as in (Inventors:) in your regex, because the match will try to create a matching group for it. That's the reason you got 3 elements in your output instead of just 1.

1 Comment

Thank you so much for the solution and explantations.. You made it look simple !!
1

If you want to match all the text between \nApplicants: and \nInventors:, you could also get the match without using re.DOTALL preventing unnecessary backtracking.

Match Applicants: and capture in group 1 the rest of that same line and all lines that follow that do not start with Inventors:

Then match Inventors.

^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:
  • ^ Start of string (Or use \b if it does not have to be at the start)
  • Applicants: Match literally
  • ( Capture group 1
    • .* Match the rest of the line
    • (?:\r?\n(?!Inventors:).*)* Match all lines that do not start with Inverntors:
  • ) Close group
  • \r?\nInventors: Match a newline and Inventors:

Regex demo | Python demo

Example code

import re
text = ("PCT Filing Date: 2 December 2015\n"
    "Applicants: Silixa Ltd.\n"
    "Chevron U.S.A. Inc. (Incorporated\n"
    "in USA - California)\n"
    "Inventors: Farhadiroushan,\n"
    "Mahmoud\n"
    "Gillies, Arran\n"
    "Parker, Tom'")
regex = re.compile(r'^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:', re.MULTILINE)
print(regex.findall(text))

Output

['Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)']

2 Comments

Thank you so much. I just had to remove ^ and then it's works perfectly : regex = re.compile(r'Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:')
You can do that if it does not have to be at the start of the string. I would suggest prepending a word boundary \b in that case \bApplicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors: regex101.com/r/r7W0Ig/1
0

Here is a more general approach to parse a string like that into a dict of all the keys and values in it (ie, any string at the start of a line followed by a : is a key and the string following that key is data):

import re 

txt="""\
PCT Filing Date: 2 December 2015
Applicants: Silixa Ltd.
Chevron U.S.A. Inc. (Incorporated
in USA - California)
Inventors: Farhadiroushan,
Mahmoud
Gillies, Arran
Parker, Tom'"""

pat=re.compile(r'(^[^\n:]+):[ \t]*([\s\S]*?(?=(?:^[^\n:]*:)|\Z))', flags=re.M)
data={m.group(1):m.group(2) for m in pat.finditer(txt)}

Result:

>>> data
{'PCT Filing Date': '2 December 2015\n', 'Applicants': 'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n', 'Inventors': "Farhadiroushan,\nMahmoud\nGillies, Arran\nParker, Tom'"}

>>> data['Applicants']
'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n'

Demo of the regex

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.