Extract text between digits - Python

Question

(Using Python 2.7)

Imagine a contract that has, among other text, text blocks separated by section numbers. I am trying to extract each section's text and put it into a new document. So, if a two hundred page contract had thirty sections separated by section numbers I want these thirty sections in a new document.

I looked at this answer Extracting parts of text between specific delimiters from a large text file with custom delimiters and writing it to another file using Python but it didn't seem to do what I want to do.

An example of what I'm trying to extract would be the text between the numbered sections (the section header adjacent to the numbered section would be a great bonus), i.e.:

1.2.3.4. A section

Some text. Some other text, too. And stuff. And even more text on the next line.

1.2.3.5. The Next section

So much more text, with commas and stuff. Even newlines and whatnot.

1.2.3.6. Some sections are really great

Welcome to this section. Which is probably better than others. And I can't even begin to explain how great it is.

1.2.3.7. What? A new section?

Dang right it's a new section! Aren't you even ready for it? So many new sections can be used for text you'll never read.

Ideally, I will read in a single file and output a single file. Thus far I have tried variations of the code below to no avail. I realize this lacks the write-to-output part (haven't gotten there yet):

import codecs
import re

regex = r'\D(?!\d)'

# read a contract in
with codecs.open("/Users/someuser/x/y/blah.txt", "r","utf-8") as ins:
    text = ins.read()

# perform magics
output = re.findall(regex, text)

output

Couldn't you just read the file line by line, and if the line starts with r(\d\.)[4], you replace that piece of text with an empty string and move on? — Maurice Reeves
– Maurice Reeves, Commented Jun 9, 2016 at 19:10
@MauriceReeves So the contracts have lots of other text not bracketed by numbered sections. Think about something like a lease...you have lots of text describing the arrangement, parties, etc. but very specific, numbered-section language as well (I just want the latter). I think if I took the replacement option you describe I'd end up with every bit of text in the document, which isn't what I'm aiming for. — nacc
– nacc, Commented Jun 9, 2016 at 19:15
Okay, fair enough, but after the you hit the last numbered section, you're still going to get everything that follows after it, regardless. You might be better off making two passes on the document. — Maurice Reeves
– Maurice Reeves, Commented Jun 9, 2016 at 19:21

sajattack · Accepted Answer · 2016-06-09 20:47:18Z

1

Ok, so if I understand correctly, you want to capture everything between the section numbers.

Here's the regex string I came up with: regex = r'(?:\d\.){4}.(.+?)(?:\d\.){4}'

Let's break that down a little:

(?:\d\.){4} this is our 4 numbers followed by a period. the (?:) makes it a non-capturing group, so we can look for this pattern to count it 4 times, but not add it to our matches.

(.+?) This is the part we want to capture. When parentheses are used without ?:, it makes a capture group, and it is what we are matching. .+? means one or more of any character, non-greedy. The question mark is the non-greedy part, and it means we don't keep matching characters forever, we stop when we get to the next part of the expression.

(?:\d\.){4} We end with our section pattern again because we want to capture between two sections

Here is the code we use to grab what we want:

p = re.compile(regex, flags=re.DOTALL)

The DOTALL flag allows us to keep newlines, typically . matches any character except newline.

sections = p.findall(text) where text is your string to search through

The findall method returns a list of the capturing groups we matched.

['A section\n\nSome text. Some other text, too. And stuff. And even more text on the next line.\n\n', "Some sections are really great\n\nWelcome to this section. Which is probably better than others. And I can't even begin to explain how great it is.\n\n"]

edited Jun 9, 2016 at 20:47

answered Jun 9, 2016 at 20:10

sajattack

8131 gold badge9 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Maurice Reeves Over a year ago

It looks like in your case your solution drops the last section: "1.2.3.7. What? A new section? Dang right it's a new section! Aren't you even ready for it? So many new sections can be used for text you'll never read." It appears like he only wants things that have section headers, and he wants them stripped. Unfortunately it doesn't look like what will have a section header and what won't isn't well defined in the documents.

sajattack Over a year ago

Oh yeah I didn't think of that.

Maurice Reeves Over a year ago

I asked him for more details just to see if we could get to a better solution, but haven't heard back yet. Your solution is very close, so long as we can figure out what the next piece is after this particular section. Maybe there's a clean break, and then with a small modification your solution would be complete.

sajattack Over a year ago

Perhaps if sections end with 2 newlines or some such we can match that at the end instead of the next section number.

Maurice Reeves Over a year ago

That's what I'm wondering too. If needs more help or can't figure it out, he'll be back. :-D

|

Maurice Reeves · Accepted Answer · 2016-06-09 19:13:37Z

1

Wouldn't this just work?

import codecs
import re

# find anything that matches the header number pattern
regex = r'\d\.\d\.\d\.\d\.\s'

# read a contract in
with codecs.open("/Users/someuser/x/y/blah.txt", "r","utf-8") as ins:
    text = ins.read()

# perform magics, replace with empty string
output = re.sub(regex, '', text)

# output

answered Jun 9, 2016 at 19:13

Maurice Reeves

1,58313 silver badges19 bronze badges

Collectives™ on Stack Overflow

Extract text between digits - Python

2 Answers 2

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related