5

I have the following string

 line = "1234567 7852853427.111 https://en.wikipedia.org/wiki/Dictionary_(disambiguation)"

I would like to remove the numbers 1234567 7852853427.111 using regular expresisions

I have this re

nline = re.sub("^\d+\s|\s\d+\s|\s\d\w\d|\s\d+$", " ", line)

but it is not doing what i hoped it would be doing.

Can anyone point me in the right direction?

5
  • 1
    A few loose remarks on why your attempt did not work: the start anchor seems correct, but that end anchor does not. It's not the end of the string, by far! Also, all of those | split the entire regex into distinct parts - that is, the first part matches the start of the string but the second one does not. You may want to read up on creating groups with parentheses. Commented Sep 19, 2016 at 22:17
  • Where is this string coming from? HTML parsing?.. Commented Sep 19, 2016 at 22:19
  • 1
    Most of the current suggestions more or less kill every sequence of digits inside the string. Can you be reasonably sure that there never will be digits in the part you want to keep? How about removing "the first two words"? Or "everything before http://"? Your title mentions punctuation - should 1..2 at the beginning be removed? Commented Sep 19, 2016 at 22:36
  • 1
    If your regex requirements are not strict, better to use built-in solution. For current line line.split()[-1], which is much easier. Commented Sep 19, 2016 at 23:07
  • 1
    @RadLexus I think there will be digits in the url Commented Sep 20, 2016 at 18:05

4 Answers 4

6

You can use:

>>> line = "1234567 7852853427.111 https://en.wikipedia.org/wiki/Dictionary_(disambiguation)" 
>>> print re.sub(r'\b\d+(?:\.\d+)?\s+', '', line)

https://en.wikipedia.org/wiki/Dictionary_(disambiguation)

Regex \b\d+(?:\.\d+)?\s+ will match an integer or decimal number followed by 1 or more spaces. \b is for word boundary.

Sign up to request clarification or add additional context in comments.

5 Comments

An anchor at the start would be a bit safer :)
\b will work but ^ won't because 7852853427.111 is not at the start.
Ah, because you consider each number separately. I was wondering why - perhaps OP needs to clarify or add more than one example. I was actually thinking of something as straightforward as ^[\d.\s]+ ...
@RadLexus think this solution will strip the numbers which appear in the URL. I need the numbers in the url to be intact
@Morpheus: Try my updated answer now. It shouldn't affect numbers in URL.
2

Here's a non-regex approach, if your regex requirement is not entirely strict, using itertools.dropwhile:

>>> ''.join(dropwhile(lambda x: not x.isalpha(), line))
'https://en.wikipedia.org/wiki/Dictionary_(disambiguation)'

Comments

0

I think this is what you want:

nline = re.sub("\d+\s\d+\.\d+", "", line)

It removes the numbers from line. If you want to keep the space in front of "http..." your second parameter should of course be " ".

If you also want to record the individual number strings you could put them in groups like this:

>>> result = re.search("(\d+)\s(\d+\.\d+)", line)
>>> print(result.group(0))
1234567 7852853427.111
>>> print(result.group(1))
1234567
>>> print(result.group(2))
7852853427.111

A great way to learn and practice regular expressions is regex101.

Comments

0

Though you are asking for a regular expression, a better solution would be to use str.split, assuming that your string will always be in the format {number} {number} {hyperlink}.

As @godaygo said, you can use this:

line = line.split()[-1]

The string will be split on whitespace, and we select the last substring.

If you want to access all parts (assuming there's always three), you can use this instead:

num1, num2, url = line.split()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.