3

I am trying to extract matching groups from a Python string but facing issues.

The string looks like below.

1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc

And i would need anything starting with a number and capital letters as the title and extract the contents in that title.

This is the output I am expecting.

1. TITLE ABC Contents of title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title cdc

I tried with the below regex

(\d\.\s[A-Z\s]*\s)

and get the below.

1. TITLE ABC 
2. TITLE BCD 
3. TITLE CDC

If i try adding .* at the end of the regex the matching groups are affected. I think I am missing something simple here. Tried with whatever I knew but couldn't solve it.

Any help here is appreciated.

1
  • you're missing lowercase letters in your character class group Commented Sep 17, 2019 at 1:49

4 Answers 4

2

Use (\d+\.[\da-z]* [A-Z]+[\S\s]*?(?=\d+\.|$))

Below is the relevant code

import re
text = """1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""

result = re.findall('('
                    '\d+\.'   # Match a number and a '.' character
                    '[\da-z]*' # If present include any additional numbers/letters
                    '(?:\.[\da-z])*' # Match additional subpoints.
                                     # Each of these subpoints must start with a '.'
                                     # And then have any combination of numbers/letters
                    ' '   # Match a space. This is how we know to stop looking for subpoints, 
                          # and to start looking for capital letters
                    '[A-Z]+'  # Match at least one capital letter. 
                              # Use [A-Z]{2,} to match 2 or more capital letters
                    '[\S\s]*?'  # Match everything including newlines.
                                # Use .*? if you don't care about matching newlines
                    '(?=\d+\.|$)'  # Stop matching at a number and a '.' character, 
                                   # or stop matching at the end of the string,
                                   # and don't include this match in the results.
                    ')'
                    , text)

diagram of the regex explanation

And here's a more detailed explanation of each regex character used

Sign up to request clarification or add additional context in comments.

4 Comments

This solution is really good and works for most of the case. But if the contents has digits in it, then it is having issues. For example, if the text is "1. TITLE ABC Contents of title ABC and some other text for 14 days" then there is an issue.
I've edited my answer to work when there are numbers in the title
Thanks for the solution. If the content is having subpoints like 2.1 etc then it is not working. 1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 2.2 text part 2.3 text part 3. TITLE CDC Contents of title cdc Any pointers on this?
I've also made it work with multiple subpoints, so 1., 1.1, 1.a, 1.2.a, 1.2.3.4.5 will all be valid.
1

In your regex you're missing the lowercase letters in character group so it matches only the uppercase words

You can simply use this

(\d\.[\s\S]+?)(?=\d+\.|$)

enter image description here

Sample code

import re
text = """1. TITLE ABC Contents of 14 title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = new_s = re.findall('(\d\.[\s\S]+?)(?=\d+\.|$)', text)
print(result)

output


['1. TITLE ABC Contents of 14 title ABC and some other text ', '2. TITLE BCD This would have contents on \ntitle BCD and maybe 
something else ', '3. TITLE CDC Contents of title cdc']

Regex demo

Note:- You can even replace [\s\S]+? with .*? as if you're are using single line flag so . will match newline characters too

Comments

0
import re
a=r'1. TITLE ABC Contents of 2title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc'
res = re.findall('(\d\.\s[A-Za-z0-9\s]*\s)', a)
for e in map(str, res):
    print(e)

output

1. TITLE ABC Contents of 2title ABC and some other text 
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title 

1 Comment

I think you meant 'don't need'. Got it.
0

You can use re.findall with re.split:

import re
s = "1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc"
t, c = re.findall('\d+\.\s[A-Z]+', s), list(filter(None, re.split('\d+\.\s[A-Z]+', s)))
result = [f'{a}{b}' for a, b in zip(t, c)]

Output:

['1. TITLE ABC Contents of title ABC and some other text ', '2. TITLE BCD This would have contents on title BCD and maybe something else ', '3. TITLE CDC Contents of title cdc']

1 Comment

The string would not be having TITLE. Anything starting with a number followed by a text of all caps is assumed as title. And the data is just a demo. The number of titles can be as long as 1000 for my case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.