0

I want to split my text into list based on certain pattern. For example my text is:

134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book 136. It has survived not only five centuries

I want to convert it into a list based on the unique number as below:

    [134. Lorem Ipsum is simply dummy text of the printing and typesetting industry, 
     135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
     136. It has survived not only five centuries]

I already tried using:

import re
xx = re.split(pattern="d{1,3}. ", string=file_read)
list = []

for xy in xx:
    xy = re.sub(pattern="\s+", repl=" ", string=xy)
    list.append(xy)

But the output is:

[134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s 136. It has survived not only five centuries]
3
  • 1
    In regex, dot . means "any character". If you want it to be interpreted as a period, you have to escape it with a backslash, like \. Commented Mar 25, 2022 at 3:38
  • still got the same result Commented Mar 25, 2022 at 4:11
  • I can't work out the regex for the entire line, but re.findall(pattern="\d{1,3}. \w+", string=file_read) gets the number and then the first word. Commented Mar 25, 2022 at 4:31

2 Answers 2

6

You can write:

str = "134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book 136. It has survived not only five centuries"
rgx = r' +(?=\d+\. +[A-Z])'
re.split(rgx, str)
  #=> ['134. Lorem Ipsum is simply dummy text of the printing and typesetting industry',
  #    "135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book",
  #    '136. It has survived not only five centuries']

Python demo<-\(ツ)/->Regex demo

As seen, the string is split on matches of one or more spaces. The regular expression reads, "match one or more spaces immediately followed by one or more digits followed by a period, followed by one or more spaces, followed by a capital letter".

Sign up to request clarification or add additional context in comments.

4 Comments

Great. How can I get a list of results? I tried print(list(s)) but this returns me a list of letters. Can you spot on the explanation of the pattern, please?
Sorry, I mis-read the question. It should be fixed now. I see you want re.split rather than re.sub. You could also use re.findall.
@CarySwoveland: Congratulations on reaching a reputation of 100000 points!
@spickermann, thanks. It took a mere decade, a speck in time. I see you are well on your way too.
2

The other way around could be matching what you want using for example re.findall

Note that to match a digit, you have to escape the d like \d{1,3}

\b\d{1,3}\. .*?(?=\b\d{1,3}\. |$)

The pattern matches:

  • \b\d{1,3}\. A word boundary, match 1-3 digits, a dot and space
  • .*? Match as least as possible characters
  • (?= Positive lookahead to assert to the right
    • \b\d{1,3}\. |$ Match the number pattern to the right or the end of string
  • ) Close lookahead

See a regex demo and a Python demo.

Example

import re

pattern = r"\b\d{1,3}\. .*?(?=\b\d{1,3}\. |$)"
s = "134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book 136. It has survived not only five centuries"

print(re.findall(pattern, s))

Output

[
'134. Lorem Ipsum is simply dummy text of the printing and typesetting industry ',
"135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book ",
'136. It has survived not only five centuries'
]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.