0

I'm reading in text from a PDF and am looking to split a string based on (anumber) and keep that value in the split string. So the string:

Some sentence. (1) Another Sentence. (2) Final Sentence.

Would turn into

Some sentence.
(1) Another Sentence.
(2) Final Sentence.

I've tried to do this with thestring.split('(') as a workaround, but there are parentheses found in some of the sentences leading to issues. Thanks!

3 Answers 3

2

I would split on the regex pattern \s+(?=\(\d+\)):

inp = "Some sentence. (1) Another Sentence. (2) Final Sentence."
parts = re.split(r'\s+(?=\(\d+\))', inp)
print(parts)

This prints:

['Some sentence.', '(1) Another Sentence.', '(2) Final Sentence.']

The regex pattern used here says to split on one or more whitespace characters which are followed by something like (1), that is, a number contained within parentheses.

Sign up to request clarification or add additional context in comments.

5 Comments

This worked too! Just out of curiosity, what would splitting on number. look like. (Number followed by a period). Thanks so much for your time!
Split on \s+(?=\d+\.)
Hey Tim, thanks again for the reply. This didn't seem to work for some reason. Lets say you have A sentence.1. Another sentence.2. Final Sentence).3. Another sentence. How would you go about splitting that?
I figured it out! Thanks again. One last question. So I have numbers that range from 1-50. (?=\d+\.) splits on one digit and a period, and (?=\d+\d+\.) splits on two digits. Is there a way to split the string correctly using one split? Or will I have to use two?
My answer already covers splitting on two digits. Are you saying you want to split on 50 but not 51?
2

You can use (?<=\.)\s which means "space preceded by a dot"

value = "Some sentence. (1) Another Sentence. (2) Final Sentence."
res = re.split(r"(?<=\.)\s", value)
print(res)  # ['Some sentence.', '(1) Another Sentence.', '(2) Final Sentence.']

1 Comment

This worked! Just out of curiosity, what would splitting on number. look like. (Number followed by a period). Thanks so much for your time!
1
import re
m = re.search('\([0-9]\).*\.', str)
# regex : escape the parens, capture a ONE DIGIT number from 0-9,
# escape paren, any sequence of characters, end with an escaped dot
# process the match object however you want

For all regex forming, I would use Rubular

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.