Split text into list based on specific pattern python

Question

I want to split my text into list based on certain pattern. For example my text is:

134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book 136. It has survived not only five centuries

I want to convert it into a list based on the unique number as below:

    [134. Lorem Ipsum is simply dummy text of the printing and typesetting industry, 
     135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
     136. It has survived not only five centuries]

I already tried using:

import re
xx = re.split(pattern="d{1,3}. ", string=file_read)
list = []

for xy in xx:
    xy = re.sub(pattern="\s+", repl=" ", string=xy)
    list.append(xy)

But the output is:

[134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s 136. It has survived not only five centuries]

In regex, dot . means "any character". If you want it to be interpreted as a period, you have to escape it with a backslash, like \. — Ben
– Ben, Commented Mar 25, 2022 at 3:38
I can't work out the regex for the entire line, but re.findall(pattern="\d{1,3}. \w+", string=file_read) gets the number and then the first word. — Henry
– Henry, Commented Mar 25, 2022 at 4:31

Cary Swoveland · Accepted Answer · 2022-03-25 05:15:34Z

6

You can write:

str = "134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book 136. It has survived not only five centuries"

rgx = r' +(?=\d+\. +[A-Z])'
re.split(rgx, str)
  #=> ['134. Lorem Ipsum is simply dummy text of the printing and typesetting industry',
  #    "135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book",
  #    '136. It has survived not only five centuries']

Python demo^_<-_\(ツ)/^_->Regex demo

As seen, the string is split on matches of one or more spaces. The regular expression reads, "match one or more spaces immediately followed by one or more digits followed by a period, followed by one or more spaces, followed by a capital letter".

edited Mar 25, 2022 at 5:15

answered Mar 25, 2022 at 5:00

Cary Swoveland

111k6 gold badges69 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

YasserKhalil Over a year ago

Great. How can I get a list of results? I tried print(list(s)) but this returns me a list of letters. Can you spot on the explanation of the pattern, please?

Cary Swoveland Over a year ago

Sorry, I mis-read the question. It should be fixed now. I see you want re.split rather than re.sub. You could also use re.findall.

spickermann Over a year ago

@CarySwoveland: Congratulations on reaching a reputation of 100000 points!

Cary Swoveland Over a year ago

@spickermann, thanks. It took a mere decade, a speck in time. I see you are well on your way too.

The fourth bird · Accepted Answer · 2022-03-25 09:01:18Z

The other way around could be matching what you want using for example re.findall

Note that to match a digit, you have to escape the d like \d{1,3}

\b\d{1,3}\. .*?(?=\b\d{1,3}\. |$)

The pattern matches:

\b\d{1,3}\. A word boundary, match 1-3 digits, a dot and space
.*? Match as least as possible characters
(?= Positive lookahead to assert to the right
- \b\d{1,3}\. |$ Match the number pattern to the right or the end of string
) Close lookahead

See a regex demo and a Python demo.

Example

import re

pattern = r"\b\d{1,3}\. .*?(?=\b\d{1,3}\. |$)"
s = "134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book 136. It has survived not only five centuries"

print(re.findall(pattern, s))

Output

[
'134. Lorem Ipsum is simply dummy text of the printing and typesetting industry ',
"135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book ",
'136. It has survived not only five centuries'
]

Collectives™ on Stack Overflow

Split text into list based on specific pattern python

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related