How build a list in python from a text using a regex?

Question

I have the following bunch of text:

text = """SECTION 1. CHAPTER 1. Chapter title. Art. 1.- Lorem ipsum, blah, blah. Art 2.- More meaningless text. Art 3.- A little more text. CHAPTER 2. Another chapter. Art 4.- Lorem ipsum blah, blah, blah. Art. 5.- It's getting boring. SECTION 2. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 3. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 4. CHAPTER 1. Another chapter in another section. Art. 6.- The last text."""

I want to split it as follows:

RE = r'(SECTION.*?SECTION)'
m = re.findall(RE, text, re.DOTALL)
sections = []
if m:
   for match in m:
        sections.append(match)

hoping that it will result in a list with 4 elements, but I ended up with only 2 elements.

['SECTION 1. .....', 'SECTION 3. .....']  # only showing the first letters of each element

Afterwards, I would like do the same for chapters and articles.

Any ideas?

Hmm.. you only have two matches for that regex in your input. Note that your matches must start AND end with 'SECTION' — user2390182
– user2390182, Commented Nov 28, 2015 at 21:24
@schwobaseggl your suggestion only returns [SECTION, SECTION, SECTION, SECTION]} — nanounanue
– nanounanue, Commented Nov 28, 2015 at 21:30

Marco Bonelli · Accepted Answer · 2015-11-28 21:22:44Z

Assuming that the word SECTION only appears when there is a new "section" in your string, you can always use the default .split method, which is way easier than using regexps.

Here's an example:

text = """SECTION 1. CHAPTER 1. Chapter title. Art. 1.- Lorem ipsum, blah, blah. Art 2.- More meaningless text. Art 3.- A little more text. CHAPTER 2. Another chapter. Art 4.- Lorem ipsum blah, blah, blah. Art. 5.- It's getting boring. SECTION 2. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 3. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 4. CHAPTER 1. Another chapter in another section. Art. 6.- The last text."""

delimiter = 'SECTION'
sections = [delimiter + s for s in text.split(delimiter)[1:]]

The result will be:

>>> sections
['SECTION 1. ...', 'SECTION 2. ...', 'SECTION 3. ...', 'SECTION 4. ...']

Wiktor Stribiżew · Accepted Answer · 2015-11-29 00:08:53Z

3

The problem you have with your regex is that you consume the second SECTION. Once the first SECTION is found, the lazy dot matching construct consumes as few characters as possible up to the next SECTION, and the match returned contains the two words and all in between. Thus, having 4 SECTIONs, you can only get two matches.

This can be solved with a regex two ways (see demo of all 3 regexps below at IDEONE).

Lazy dot matching with a lookahead (less efficient, not recommended)

print(re.findall(r"SECTION.*?(?=$|SECTION)", text, re.DOTALL))

When the regex engine finds the first SECTION it starts consuming characters checking for the end of string ($) or leftmost SECTION.

Unroll-the-loop method to replace the lazy quantifier (much more efficient, requires no DOTALL modifier to match newline symbols)

print(re.findall(r"SECTION[^S]*(?:S(?!ECTION)[^S]*)*", text))

Here, no lazy quantifier or lookahead with alternatives are necessary since the SECTION consumes the first SECTION substring, and then [^S]*(?:S(?!ECTION)[^S]*)* matches any substring that is not equal to SECTION (up to the next SECTION if present, or just anything else up to the end of string).

A safer similar expression that makes sure there is whitespace and digits followed by a dot after SECTION:

print(re.findall(r"SECTION\s+\d+\.[^S]*(?:S(?!ECTION\s+\d+\.)[^S]*)*", text))

A regex explanation:

SECTION - matches SECTION literally
\s+ - 1 or more whitespace
\d+ - 1 or more digits
\. - literal dot
[^S]* - any character but S
(?:S(?!ECTION\s+\d+\.)[^S]*)* - 0 or more sequences of....
- S(?!ECTION\s+\d+\.) - S that is not followed by ECTION + 1 or more whitespaces + 1 or more digits + a dot
- [^S]* - any character but S

UPDATE

To obtain a dictionary in the form of {'SECTION 1' : '...', 'SECTION 2' : '...'}, you need to add 2 capturing groups around the key and value patterns, and then use the dict command. This works because re.findall returns tuples of captured texts if capturing groups (i.e. parentheses) are specified in the regex pattern (If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.):

print(dict(re.findall(r"(SECTION\s+\d+)\.\s*([^S]*(?:S(?!ECTION\s+\d+\.)[^S]*)‌*)", text)))

See IDEONE demo

edited Nov 29, 2015 at 0:08

answered Nov 28, 2015 at 21:57

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

5 Comments

nanounanue Over a year ago

This is very elegant, indeed!

nanounanue Over a year ago

Just one question, what if I want in a dictionary instead? I mean: `{'SECTION 1' : '.....', 'SECTION 2' : '.....'}? Is it possible?

Wiktor Stribiżew Over a year ago

Add capturing groups around the key-values, and use dict: print(dict(re.findall(r"(SECTION\s+\d+)\.\s*([^S]*(?:S(?!ECTION\s+\d+\.)[^S]*)*)", text))).

nanounanue Over a year ago

Could you explain, why this last instruction works? Why the re.findall returns a list, and now returns a key-vaue?

nanounanue Over a year ago

Oh! I see! You put a parenthesis...Neat!

Collectives™ on Stack Overflow

How build a list in python from a text using a regex?

2 Answers 2

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related