python regex to return empty string

Question

I want to extract part of a string in a list which does not have a space followed by number in python.

# INPUT
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
# EXPECTED OUTPUT
output = ['bits', 'scrap', 'bits and pieces', 'junk']

I managed to do this using re.sub or re.split:

output = [re.sub(" [0-9].*", "", t) for t in text]
# OR
output = [re.split(' \d',t)[0] for t in text]

When I tried to use re.search and re.findall, it return me empty list or empty result.

[re.search('(.*) \d', t) for t in text]
#[None, <_sre.SRE_Match object; span=(0, 7), match='scrap 1'>, None, <_sre.SRE_Match object; span=(0, 6), match='junk 3'>]

[re.findall('(.*?) \d', t) for t in text]
#[[], ['scrap'], [], ['junk']]

Can anyone help me with the regex that can return expected output for re.search and re.findall?

Wiktor Stribiżew · Accepted Answer · 2018-02-13 09:03:17Z

4

You may remove the digit-and-dot substrings at the end of the string only with

import re
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
print([re.sub(r'\s+\d+(?:\.\d+)*$', '', x) for x in text])
# => output = ['bits', 'scrap', 'bits and pieces', 'junk']

See the Python demo

The pattern is

\s+ - 1+ whitespaces (note: if those digits can be "glued" to some other text, replace + (one or more occurrences) with * quantifier (zero or more occurrences))
\d+ - 1 or more digits
(?:\.\d+)* - 0 or more sequences of
- \. - a dot
- \d+ - 1 or more digits
$ - end of string.

See the regex demo.

To do the same with re.findall, you can use

# To get 'abc 5.6 def' (not 'abc') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d[\d.]*)?$', x) # 
# To get 'abc' (not 'abc 5.6 def') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d.*)?$', x) #

See this regex demo.

However, this regex is not efficient enough due to the .*? construct. Here,

^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars (use re.DOTALL to match all) as few as possible (so that the next optional group could be tested at each position)
(?: \d[\d.]*)? -an optional non-capturing group matching
- - a space
- \d - a digit
- [\d.]* - zero or more digits or . chars
- (OR) .* - any 0+ chars other than line break chars, as many as possible
$ - end of string.

edited Feb 13, 2018 at 9:03

answered Feb 13, 2018 at 8:39

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

addicted Over a year ago

I can do it with re.sub. I can't do it with re.search or re.findall

addicted Over a year ago

re.sub is faster than re.findall with (.*?) ? What is it with .*? construct since I use it a lot in my python codes.

Wiktor Stribiżew Over a year ago

@addicted I added a re.findall compatible regex to the answer. The .*? is a lazily quantified dot, it matches in a way opposite to backtracking: it is skipped first, then the rest of the subsequent subpatterns are tried. If they fail, the .*? matches the first char where it was supposed to be tested, and then again all the subsequent subpatterns are tried, etc. In effect, the match is found much slower than it could be if the match is far to the right from the first .*? matching location. On the other hand, .* would cause backtracking issues in the opposite situation.

Arne Over a year ago

@addicted Are you asking what the question mark means in this context?

Wiktor Stribiżew Over a year ago

@addicted No, these methods output cannot be changed. Use extra programming logic to get the type of result you need. re.findall will always return [] if nothing is found and re.search only returns None upon no match.

|

Collectives™ on Stack Overflow

python regex to return empty string

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related