1

(Using Python 2.7)

Imagine a contract that has, among other text, text blocks separated by section numbers. I am trying to extract each section's text and put it into a new document. So, if a two hundred page contract had thirty sections separated by section numbers I want these thirty sections in a new document.

I looked at this answer Extracting parts of text between specific delimiters from a large text file with custom delimiters and writing it to another file using Python but it didn't seem to do what I want to do.

An example of what I'm trying to extract would be the text between the numbered sections (the section header adjacent to the numbered section would be a great bonus), i.e.:

1.2.3.4. A section

Some text. Some other text, too. And stuff. And even more text on the next line.

1.2.3.5. The Next section

So much more text, with commas and stuff. Even newlines and whatnot.

1.2.3.6. Some sections are really great

Welcome to this section. Which is probably better than others. And I can't even begin to explain how great it is.

1.2.3.7. What? A new section?

Dang right it's a new section! Aren't you even ready for it? So many new sections can be used for text you'll never read.

Ideally, I will read in a single file and output a single file. Thus far I have tried variations of the code below to no avail. I realize this lacks the write-to-output part (haven't gotten there yet):

import codecs
import re

regex = r'\D(?!\d)'

# read a contract in
with codecs.open("/Users/someuser/x/y/blah.txt", "r","utf-8") as ins:
    text = ins.read()

# perform magics
output = re.findall(regex, text)

output
3
  • Couldn't you just read the file line by line, and if the line starts with r(\d\.)[4], you replace that piece of text with an empty string and move on? Commented Jun 9, 2016 at 19:10
  • @MauriceReeves So the contracts have lots of other text not bracketed by numbered sections. Think about something like a lease...you have lots of text describing the arrangement, parties, etc. but very specific, numbered-section language as well (I just want the latter). I think if I took the replacement option you describe I'd end up with every bit of text in the document, which isn't what I'm aiming for. Commented Jun 9, 2016 at 19:15
  • Okay, fair enough, but after the you hit the last numbered section, you're still going to get everything that follows after it, regardless. You might be better off making two passes on the document. Commented Jun 9, 2016 at 19:21

2 Answers 2

1

Ok, so if I understand correctly, you want to capture everything between the section numbers.

Here's the regex string I came up with: regex = r'(?:\d\.){4}.(.+?)(?:\d\.){4}'

Let's break that down a little:

(?:\d\.){4} this is our 4 numbers followed by a period. the (?:) makes it a non-capturing group, so we can look for this pattern to count it 4 times, but not add it to our matches.

(.+?) This is the part we want to capture. When parentheses are used without ?:, it makes a capture group, and it is what we are matching. .+? means one or more of any character, non-greedy. The question mark is the non-greedy part, and it means we don't keep matching characters forever, we stop when we get to the next part of the expression.

(?:\d\.){4} We end with our section pattern again because we want to capture between two sections

Here is the code we use to grab what we want:

p = re.compile(regex, flags=re.DOTALL)

The DOTALL flag allows us to keep newlines, typically . matches any character except newline.

sections = p.findall(text) where text is your string to search through

The findall method returns a list of the capturing groups we matched.

['A section\n\nSome text. Some other text, too. And stuff. And even more text on the next line.\n\n', "Some sections are really great\n\nWelcome to this section. Which is probably better than others. And I can't even begin to explain how great it is.\n\n"]

Sign up to request clarification or add additional context in comments.

7 Comments

It looks like in your case your solution drops the last section: "1.2.3.7. What? A new section? Dang right it's a new section! Aren't you even ready for it? So many new sections can be used for text you'll never read." It appears like he only wants things that have section headers, and he wants them stripped. Unfortunately it doesn't look like what will have a section header and what won't isn't well defined in the documents.
Oh yeah I didn't think of that.
I asked him for more details just to see if we could get to a better solution, but haven't heard back yet. Your solution is very close, so long as we can figure out what the next piece is after this particular section. Maybe there's a clean break, and then with a small modification your solution would be complete.
Perhaps if sections end with 2 newlines or some such we can match that at the end instead of the next section number.
That's what I'm wondering too. If needs more help or can't figure it out, he'll be back. :-D
|
1

Wouldn't this just work?

import codecs
import re

# find anything that matches the header number pattern
regex = r'\d\.\d\.\d\.\d\.\s'

# read a contract in
with codecs.open("/Users/someuser/x/y/blah.txt", "r","utf-8") as ins:
    text = ins.read()

# perform magics, replace with empty string
output = re.sub(regex, '', text)

# output

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.