(Using Python 2.7)
Imagine a contract that has, among other text, text blocks separated by section numbers. I am trying to extract each section's text and put it into a new document. So, if a two hundred page contract had thirty sections separated by section numbers I want these thirty sections in a new document.
I looked at this answer Extracting parts of text between specific delimiters from a large text file with custom delimiters and writing it to another file using Python but it didn't seem to do what I want to do.
An example of what I'm trying to extract would be the text between the numbered sections (the section header adjacent to the numbered section would be a great bonus), i.e.:
1.2.3.4. A section
Some text. Some other text, too. And stuff. And even more text on the next line.
1.2.3.5. The Next section
So much more text, with commas and stuff. Even newlines and whatnot.
1.2.3.6. Some sections are really great
Welcome to this section. Which is probably better than others. And I can't even begin to explain how great it is.
1.2.3.7. What? A new section?
Dang right it's a new section! Aren't you even ready for it? So many new sections can be used for text you'll never read.
Ideally, I will read in a single file and output a single file. Thus far I have tried variations of the code below to no avail. I realize this lacks the write-to-output part (haven't gotten there yet):
import codecs
import re
regex = r'\D(?!\d)'
# read a contract in
with codecs.open("/Users/someuser/x/y/blah.txt", "r","utf-8") as ins:
text = ins.read()
# perform magics
output = re.findall(regex, text)
output
r(\d\.)[4], you replace that piece of text with an empty string and move on?