1

I want to extract part of a string in a list which does not have a space followed by number in python.

# INPUT
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
# EXPECTED OUTPUT
output = ['bits', 'scrap', 'bits and pieces', 'junk']

I managed to do this using re.sub or re.split:

output = [re.sub(" [0-9].*", "", t) for t in text]
# OR
output = [re.split(' \d',t)[0] for t in text]

When I tried to use re.search and re.findall, it return me empty list or empty result.

[re.search('(.*) \d', t) for t in text]
#[None, <_sre.SRE_Match object; span=(0, 7), match='scrap 1'>, None, <_sre.SRE_Match object; span=(0, 6), match='junk 3'>]

[re.findall('(.*?) \d', t) for t in text]
#[[], ['scrap'], [], ['junk']]

Can anyone help me with the regex that can return expected output for re.search and re.findall?

1 Answer 1

4

You may remove the digit-and-dot substrings at the end of the string only with

import re
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
print([re.sub(r'\s+\d+(?:\.\d+)*$', '', x) for x in text])
# => output = ['bits', 'scrap', 'bits and pieces', 'junk']

See the Python demo

The pattern is

  • \s+ - 1+ whitespaces (note: if those digits can be "glued" to some other text, replace + (one or more occurrences) with * quantifier (zero or more occurrences))
  • \d+ - 1 or more digits
  • (?:\.\d+)* - 0 or more sequences of
    • \. - a dot
    • \d+ - 1 or more digits
  • $ - end of string.

See the regex demo.

To do the same with re.findall, you can use

# To get 'abc 5.6 def' (not 'abc') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d[\d.]*)?$', x) # 
# To get 'abc' (not 'abc 5.6 def') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d.*)?$', x) # 

See this regex demo.

However, this regex is not efficient enough due to the .*? construct. Here,

  • ^ - start of string
  • (.*?) - Group 1: any zero or more chars other than line break chars (use re.DOTALL to match all) as few as possible (so that the next optional group could be tested at each position)
  • (?: \d[\d.]*)? -an optional non-capturing group matching
    • - a space
    • \d - a digit
    • [\d.]* - zero or more digits or . chars
    • (OR) .* - any 0+ chars other than line break chars, as many as possible
  • $ - end of string.
Sign up to request clarification or add additional context in comments.

9 Comments

I can do it with re.sub. I can't do it with re.search or re.findall
re.sub is faster than re.findall with (.*?) ? What is it with .*? construct since I use it a lot in my python codes.
@addicted I added a re.findall compatible regex to the answer. The .*? is a lazily quantified dot, it matches in a way opposite to backtracking: it is skipped first, then the rest of the subsequent subpatterns are tried. If they fail, the .*? matches the first char where it was supposed to be tested, and then again all the subsequent subpatterns are tried, etc. In effect, the match is found much slower than it could be if the match is far to the right from the first .*? matching location. On the other hand, .* would cause backtracking issues in the opposite situation.
@addicted Are you asking what the question mark means in this context?
@addicted No, these methods output cannot be changed. Use extra programming logic to get the type of result you need. re.findall will always return [] if nothing is found and re.search only returns None upon no match.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.