Python Regex remove numbers and numbers with punctaution

Question

I have the following string

 line = "1234567 7852853427.111 https://en.wikipedia.org/wiki/Dictionary_(disambiguation)"

I would like to remove the numbers 1234567 7852853427.111 using regular expresisions

I have this re

nline = re.sub("^\d+\s|\s\d+\s|\s\d\w\d|\s\d+$", " ", line)

but it is not doing what i hoped it would be doing.

Can anyone point me in the right direction?

A few loose remarks on why your attempt did not work: the start anchor seems correct, but that end anchor does not. It's not the end of the string, by far! Also, all of those | split the entire regex into distinct parts - that is, the first part matches the start of the string but the second one does not. You may want to read up on creating groups with parentheses. — Jongware
– Jongware, Commented Sep 19, 2016 at 22:17
Most of the current suggestions more or less kill every sequence of digits inside the string. Can you be reasonably sure that there never will be digits in the part you want to keep? How about removing "the first two words"? Or "everything before http://"? Your title mentions punctuation - should 1..2 at the beginning be removed? — Jongware
– Jongware, Commented Sep 19, 2016 at 22:36
If your regex requirements are not strict, better to use built-in solution. For current line line.split()[-1], which is much easier. — godaygo
– godaygo, Commented Sep 19, 2016 at 23:07

anubhava · Accepted Answer · 2016-09-20 18:16:15Z

6

You can use:

>>> line = "1234567 7852853427.111 https://en.wikipedia.org/wiki/Dictionary_(disambiguation)" 
>>> print re.sub(r'\b\d+(?:\.\d+)?\s+', '', line)

https://en.wikipedia.org/wiki/Dictionary_(disambiguation)

Regex \b\d+(?:\.\d+)?\s+ will match an integer or decimal number followed by 1 or more spaces. \b is for word boundary.

edited Sep 20, 2016 at 18:16

answered Sep 19, 2016 at 22:12

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jongware Over a year ago

An anchor at the start would be a bit safer :)

anubhava Over a year ago

\b will work but ^ won't because 7852853427.111 is not at the start.

Jongware Over a year ago

Ah, because you consider each number separately. I was wondering why - perhaps OP needs to clarify or add more than one example. I was actually thinking of something as straightforward as ^[\d.\s]+ ...

Morpheus Over a year ago

@RadLexus think this solution will strip the numbers which appear in the URL. I need the numbers in the url to be intact

anubhava Over a year ago

@Morpheus: Try my updated answer now. It shouldn't affect numbers in URL.

Moses Koledoye · Accepted Answer · 2016-09-19 22:18:30Z

2

Here's a non-regex approach, if your regex requirement is not entirely strict, using itertools.dropwhile:

>>> ''.join(dropwhile(lambda x: not x.isalpha(), line))
'https://en.wikipedia.org/wiki/Dictionary_(disambiguation)'

answered Sep 19, 2016 at 22:18

Moses Koledoye

78.8k8 gold badges139 silver badges141 bronze badges

Comments

B. Farkas · Accepted Answer · 2016-09-19 22:29:47Z

0

I think this is what you want:

nline = re.sub("\d+\s\d+\.\d+", "", line)

It removes the numbers from line. If you want to keep the space in front of "http..." your second parameter should of course be " ".

If you also want to record the individual number strings you could put them in groups like this:

>>> result = re.search("(\d+)\s(\d+\.\d+)", line)
>>> print(result.group(0))
1234567 7852853427.111
>>> print(result.group(1))
1234567
>>> print(result.group(2))
7852853427.111

A great way to learn and practice regular expressions is regex101.

answered Sep 19, 2016 at 22:29

B. Farkas

11 bronze badge

Comments

Community · Accepted Answer · 2017-05-23 11:51:46Z

0

Though you are asking for a regular expression, a better solution would be to use str.split, assuming that your string will always be in the format {number} {number} {hyperlink}.

As @godaygo said, you can use this:

line = line.split()[-1]

The string will be split on whitespace, and we select the last substring.

If you want to access all parts (assuming there's always three), you can use this instead:

num1, num2, url = line.split()

edited May 23, 2017 at 11:51

CommunityBot

11 silver badge

answered Sep 20, 2016 at 18:23

mbomb007

4,3913 gold badges51 silver badges79 bronze badges

Collectives™ on Stack Overflow

Python Regex remove numbers and numbers with punctaution

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related