1

I am working on text files like this:

Chapter 01

Lorem ipsum

dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt

Chapter 02

consectetur adipiscing

sed do eiusmod tempor

Chapter 03

et dolore magna aliqua.

with delimiters like "chapter", "Chapter", "CHAPTER", etc... and 1 or 2 digits ("Chapter 1" or "Chapter 01").

I managed to open and read the file in Python, with .open() and .read()

mytext = myfile.read()

Now I need to split my string, in order to get text for "Chapter XX".

For Chapter 02, that would be :

consectetur adipiscing

sed do eiusmod tempor

I'm new to Python, I read about regex, match, map, or split, but... well...

(I'm writing a Gimp Python-fu plugin, so I use Python version bundled in Gimp, which is 2.7.15).

3 Answers 3

2

You can use regular expressions like so:

import re

split_text = re.split("Chapter [0-9]+\n",  # splits on "Chapter " + numbers + newline
                      mytext, 
                      flags=re.IGNORECASE) # splits on "CHAPTER"/"chapter"/"Chapter" etc
>>> split_text
['', '\nLorem ipsum\n\ndolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt\n\n', '\nconsectetur adipiscing\n\nsed do eiusmod tempor\n\n', '\net dolore magna aliqua.']

You can now choose the text from each chapter by the index of split_text e.g.:

print(split_text[2])

>>> 
consectetur adipiscing

sed do eiusmod tempor
Sign up to request clarification or add additional context in comments.

Comments

0

you can try this bellow

chapter = [""]
for i in range(1,4):

  nb1=text.find("Chapter "+ "%02d" % (i,))
  nb2=text.find("Chapter "+ "%02d" % (i+1,))

  chapter.append(text[nb1:nb2])

for i in range(1,4):
    print(chapter[i])

or with regular expressions :

import re

chapter = re.split("Chapter [0-4]+\n", text)

for i in range(1,4):
    print(chapter[i])

1 Comment

with delimiters like chapter, Chapter, CHAPTER, etc... and 1 or 2 digits (Chapter 1 or Chapter 01) This doesn't account for the variability in case in 'Chapter', or for chapter numbers out of the example's scope, or for numbers less than 10 without leading 0's (in the first code block, the regex expression does capture this last case).
0
import re

# removing void strings.
splitted_str = list(filter(lambda x: x != '', re.split("Chapter [0-9]+", my_text)))
print(splitted_str)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.