4

This is actually a follow up question of here. I had not been clear in my previous question, and since it has been answered, I felt it was better to post a new question instead.

I have a dataframe like below:

Column1    Column2    Column3    Column4                     Column5
5FQ        1.047      S$55.3     UG44.2 as of 02/Jun/2016    S$8.2 mm
600        (1.047)    S$23.3     AG5.6 as of 02/Jun/2016     S$58 mm
KI2        1.695      S$5.35     RR59.5 as of 02/Jun/2016    S$705 mm
88G        0.0025     S$(5.3)    NW44.2 as of 02/Jun/2016    S$112 mm
60G        5.63       S$78.4     UG21.2 as of 02/Jun/2016    S$6.21 mm
90F        (5.562)    S$(88.3)   IG46.2 as of 02/Jun/2016    S$8 mm

I am trying to use regex to drop all the words and letters, only keeping the numbers. However, if the number is enclosed within a (), I would like to make it a negative number instead.

Desired output

Column1    Column2    Column3    Column4       Column5
5          1.047      55.3       44.2          8.2
600        -1.047     23.3       5.6           58
2          1.695      5.35       59.5          705
88         0.0025     -5.3       44.2          112
60         5.63       78.4       21.2          6.21
90         -5.562     -88.3      46.2          8

Would this be possible? I've tried playing around with this code, but was not sure what the appropriate regex combination should be.

df.apply(lambda x: x.astype(str).str.extract(r'(\d+\.?\d*)', expand=True).astype(np.float))
1
  • I will write a comment as my descriptive answer gets downvoted: (\d+.?\d*) matches all digits with arbitrary number of decimals, including 02 and 2016 of the date. Additionally you are missing the sign. I would first replace all '(' (backslashed) by '-' Then delete everything with the date format Then remove (replace by zero string) anything that is not space or digit or dot. Something like [^0-9 .]* (you need to look it up, as regex syntax varies from one env to another. After that you have your result, seperated by spaces, just match ((\d+.?\d*) ), the result is in the inter group Commented Sep 4, 2016 at 9:10

3 Answers 3

3
r1 = r'\((\d+\.?\d*)\)'
r2 = r'(-?\d+\.?\d*)'
df.stack().str.replace(r1, r'-\1', 1) \
          .str.extract(r2, expand=False).unstack()

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks so much for this! Just wondering, I noticed that if the values have a comma in between them, EG $1,005A. It drops everything except 1. Is there a way to keep it as just 1005?
2

UPDATE: $1,005A --> 1005 (example in 1st row, column Column3)

In [131]: df
Out[131]:
  Column1  Column2   Column3                   Column4    Column5
0     5FQ    1.047   $1,005A  UG44.2 as of 02/Jun/2016   S$8.2 mm
1     600  (1.047)    S$23.3   AG5.6 as of 02/Jun/2016    S$58 mm
2     KI2    1.695    S$5.35  RR59.5 as of 02/Jun/2016   S$705 mm
3     88G   0.0025   S$(5.3)  NW44.2 as of 02/Jun/2016   S$112 mm
4     60G     5.63    S$78.4  UG21.2 as of 02/Jun/2016  S$6.21 mm
5     90F  (5.562)  S$(88.3)  IG46.2 as of 02/Jun/2016     S$8 mm

In [132]: to_replace = [r'\(([\d\.]+)\)', r'.*?([\d\.\,\-]+).*', ',']

In [133]: vals = [r'-\1', r'\1', '']

In [134]: df.replace(to_replace=to_replace, value=vals, regex=True)
Out[134]:
  Column1 Column2 Column3 Column4 Column5
0       5   1.047    1005    44.2     8.2
1     600  -1.047    23.3     5.6      58
2       2   1.695    5.35    59.5     705
3      88  0.0025    -5.3    44.2     112
4      60    5.63    78.4    21.2    6.21
5      90  -5.562   -88.3    46.2       8

OLD answer:

Yet another solution, which uses only DataFrame.replace() method:

In [28]: to_replace = [r'\(([\d\.]+)\)', r'.*?([\d\.-]+).*']

In [29]: vals = [r'-\1', r'\1']

In [30]: df.replace(to_replace=to_replace, value=vals, regex=True)
Out[30]:
  Column1 Column2 Column3 Column4 Column5
0       5   1.047    55.3    44.2     8.2
1     600  -1.047    23.3     5.6      58
2       2   1.695    5.35    59.5     705
3      88  0.0025    -5.3    44.2     112
4      60    5.63    78.4    21.2    6.21
5      90  -5.562   -88.3    46.2       8

1 Comment

Thanks for this MaxU. Was also wondering. If the columns have a comma within the values, EG: $1,005A, this code would drop everything and just keep the value 1. Is there a way to modify the code such that it would only show 1005?
2

You could come up with:

import re

def onlynumbers(value):
    if value.startswith('('):
        return '-' + value
    rx = re.compile(r'\d+[\d.]*')
    try:
        return rx.search(value).group(0)
    except:
        return value

df.applymap(onlynumbers)

This returns: enter image description here

4 Comments

And how exactly did you get rid of the date? You should eliminate that one first, as described in my answer below.
@chrisvp: No, I should not - rx.search() returns only the first match which is not the date.
Ok, but column 5 would be 02 and column 6 would be 2016 and only column 7 would be 8.2. So you'd need to skip 5 and 6, which boils down to eliminate the date. r'\d+[\d.]* ' could be shorted to r'[\d.]+'
Not necessarily: make yourself clear the differences between [.\d]+ (yours), \d[.\d]* (mine) and the even more secure \d[.\d]*\d. Sometimes shortening comes for the prize of inaccuracy.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.