Extract matching groups from string python regex

Question

I am trying to extract matching groups from a Python string but facing issues.

The string looks like below.

1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc

And i would need anything starting with a number and capital letters as the title and extract the contents in that title.

This is the output I am expecting.

1. TITLE ABC Contents of title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title cdc

I tried with the below regex

(\d\.\s[A-Z\s]*\s)

and get the below.

1. TITLE ABC 
2. TITLE BCD 
3. TITLE CDC

If i try adding .* at the end of the regex the matching groups are affected. I think I am missing something simple here. Tried with whatever I knew but couldn't solve it.

Any help here is appreciated.

you're missing lowercase letters in your character class group — Code Maniac
– Code Maniac, Commented Sep 17, 2019 at 1:49

hostingutilities.com · Accepted Answer · 2019-09-18 01:38:08Z

2

Use (\d+\.[\da-z]* [A-Z]+[\S\s]*?(?=\d+\.|$))

Below is the relevant code

import re
text = """1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""

result = re.findall('('
                    '\d+\.'   # Match a number and a '.' character
                    '[\da-z]*' # If present include any additional numbers/letters
                    '(?:\.[\da-z])*' # Match additional subpoints.
                                     # Each of these subpoints must start with a '.'
                                     # And then have any combination of numbers/letters
                    ' '   # Match a space. This is how we know to stop looking for subpoints, 
                          # and to start looking for capital letters
                    '[A-Z]+'  # Match at least one capital letter. 
                              # Use [A-Z]{2,} to match 2 or more capital letters
                    '[\S\s]*?'  # Match everything including newlines.
                                # Use .*? if you don't care about matching newlines
                    '(?=\d+\.|$)'  # Stop matching at a number and a '.' character, 
                                   # or stop matching at the end of the string,
                                   # and don't include this match in the results.
                    ')'
                    , text)

And here's a more detailed explanation of each regex character used

edited Sep 18, 2019 at 1:38

answered Sep 17, 2019 at 1:59

hostingutilities.com

9,6594 gold badges44 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ashok KS Over a year ago

This solution is really good and works for most of the case. But if the contents has digits in it, then it is having issues. For example, if the text is "1. TITLE ABC Contents of title ABC and some other text for 14 days" then there is an issue.

hostingutilities.com Over a year ago

I've edited my answer to work when there are numbers in the title

Ashok KS Over a year ago

Thanks for the solution. If the content is having subpoints like 2.1 etc then it is not working.

1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on  title BCD and maybe something else 2.2 text part 2.3 text part 3. TITLE CDC Contents of title cdc

Any pointers on this?

hostingutilities.com Over a year ago

I've also made it work with multiple subpoints, so 1., 1.1, 1.a, 1.2.a, 1.2.3.4.5 will all be valid.

Code Maniac · Accepted Answer · 2019-09-17 02:16:01Z

1

In your regex you're missing the lowercase letters in character group so it matches only the uppercase words

You can simply use this

(\d\.[\s\S]+?)(?=\d+\.|$)

Sample code

import re
text = """1. TITLE ABC Contents of 14 title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = new_s = re.findall('(\d\.[\s\S]+?)(?=\d+\.|$)', text)
print(result)

output

['1. TITLE ABC Contents of 14 title ABC and some other text ', '2. TITLE BCD This would have contents on \ntitle BCD and maybe 
something else ', '3. TITLE CDC Contents of title cdc']

Regex demo

Note:- You can even replace [\s\S]+? with .*? as if you're are using single line flag so . will match newline characters too

edited Sep 17, 2019 at 2:16

answered Sep 17, 2019 at 1:55

Code Maniac

37.9k5 gold badges44 silver badges65 bronze badges

Comments

moys · Accepted Answer · 2019-09-17 02:16:10Z

0

import re
a=r'1. TITLE ABC Contents of 2title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc'
res = re.findall('(\d\.\s[A-Za-z0-9\s]*\s)', a)
for e in map(str, res):
    print(e)

output

1. TITLE ABC Contents of 2title ABC and some other text 
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title

edited Sep 17, 2019 at 2:16

answered Sep 17, 2019 at 2:09

moys

8,1173 gold badges19 silver badges51 bronze badges

1 Comment

moys Over a year ago

I think you meant 'don't need'. Got it.

Ajax1234 · Accepted Answer · 2019-09-17 02:25:28Z

0

You can use re.findall with re.split:

import re
s = "1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc"
t, c = re.findall('\d+\.\s[A-Z]+', s), list(filter(None, re.split('\d+\.\s[A-Z]+', s)))
result = [f'{a}{b}' for a, b in zip(t, c)]

Output:

['1. TITLE ABC Contents of title ABC and some other text ', '2. TITLE BCD This would have contents on title BCD and maybe something else ', '3. TITLE CDC Contents of title cdc']

edited Sep 17, 2019 at 2:25

answered Sep 17, 2019 at 1:51

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

1 Comment

Ashok KS Over a year ago

The string would not be having TITLE. Anything starting with a number followed by a text of all caps is assumed as title. And the data is just a demo. The number of titles can be as long as 1000 for my case.

Collectives™ on Stack Overflow

Extract matching groups from string python regex

4 Answers 4

4 Comments

Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related