3

I have the following bunch of text:

text = """SECTION 1. CHAPTER 1. Chapter title. Art. 1.- Lorem ipsum, blah, blah. Art 2.- More meaningless text. Art 3.- A little more text. CHAPTER 2. Another chapter. Art 4.- Lorem ipsum blah, blah, blah. Art. 5.- It's getting boring. SECTION 2. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 3. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 4. CHAPTER 1. Another chapter in another section. Art. 6.- The last text."""

I want to split it as follows:

RE = r'(SECTION.*?SECTION)'
m = re.findall(RE, text, re.DOTALL)
sections = []
if m:
   for match in m:
        sections.append(match)

hoping that it will result in a list with 4 elements, but I ended up with only 2 elements.

['SECTION 1. .....', 'SECTION 3. .....']  # only showing the first letters of each element

Afterwards, I would like do the same for chapters and articles.

Any ideas?

3
  • 1
    Hmm.. you only have two matches for that regex in your input. Note that your matches must start AND end with 'SECTION' Commented Nov 28, 2015 at 21:24
  • @schwobaseggl your suggestion only returns [SECTION, SECTION, SECTION, SECTION]} Commented Nov 28, 2015 at 21:30
  • I just saw it myself. It is not trivial! Commented Nov 28, 2015 at 21:33

2 Answers 2

4

Assuming that the word SECTION only appears when there is a new "section" in your string, you can always use the default .split method, which is way easier than using regexps.

Here's an example:

text = """SECTION 1. CHAPTER 1. Chapter title. Art. 1.- Lorem ipsum, blah, blah. Art 2.- More meaningless text. Art 3.- A little more text. CHAPTER 2. Another chapter. Art 4.- Lorem ipsum blah, blah, blah. Art. 5.- It's getting boring. SECTION 2. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 3. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 4. CHAPTER 1. Another chapter in another section. Art. 6.- The last text."""

delimiter = 'SECTION'
sections = [delimiter + s for s in text.split(delimiter)[1:]]

The result will be:

>>> sections
['SECTION 1. ...', 'SECTION 2. ...', 'SECTION 3. ...', 'SECTION 4. ...']
Sign up to request clarification or add additional context in comments.

Comments

3

The problem you have with your regex is that you consume the second SECTION. Once the first SECTION is found, the lazy dot matching construct consumes as few characters as possible up to the next SECTION, and the match returned contains the two words and all in between. Thus, having 4 SECTIONs, you can only get two matches.

This can be solved with a regex two ways (see demo of all 3 regexps below at IDEONE).

  1. Lazy dot matching with a lookahead (less efficient, not recommended)

    print(re.findall(r"SECTION.*?(?=$|SECTION)", text, re.DOTALL))

When the regex engine finds the first SECTION it starts consuming characters checking for the end of string ($) or leftmost SECTION.

  1. Unroll-the-loop method to replace the lazy quantifier (much more efficient, requires no DOTALL modifier to match newline symbols)

    print(re.findall(r"SECTION[^S]*(?:S(?!ECTION)[^S]*)*", text))

Here, no lazy quantifier or lookahead with alternatives are necessary since the SECTION consumes the first SECTION substring, and then [^S]*(?:S(?!ECTION)[^S]*)* matches any substring that is not equal to SECTION (up to the next SECTION if present, or just anything else up to the end of string).

A safer similar expression that makes sure there is whitespace and digits followed by a dot after SECTION:

print(re.findall(r"SECTION\s+\d+\.[^S]*(?:S(?!ECTION\s+\d+\.)[^S]*)*", text))

A regex explanation:

  • SECTION - matches SECTION literally
  • \s+ - 1 or more whitespace
  • \d+ - 1 or more digits
  • \. - literal dot
  • [^S]* - any character but S
  • (?:S(?!ECTION\s+\d+\.)[^S]*)* - 0 or more sequences of....
    • S(?!ECTION\s+\d+\.) - S that is not followed by ECTION + 1 or more whitespaces + 1 or more digits + a dot
    • [^S]* - any character but S

UPDATE

To obtain a dictionary in the form of {'SECTION 1' : '...', 'SECTION 2' : '...'}, you need to add 2 capturing groups around the key and value patterns, and then use the dict command. This works because re.findall returns tuples of captured texts if capturing groups (i.e. parentheses) are specified in the regex pattern (If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.):

print(dict(re.findall(r"(SECTION\s+\d+)\.\s*([^S]*(?:S(?!ECTION\s+\d+\.)[^S]*)‌​*)", text)))

See IDEONE demo

5 Comments

This is very elegant, indeed!
Just one question, what if I want in a dictionary instead? I mean: `{'SECTION 1' : '.....', 'SECTION 2' : '.....'}? Is it possible?
Could you explain, why this last instruction works? Why the re.findall returns a list, and now returns a key-vaue?
Oh! I see! You put a parenthesis...Neat!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.