Manipulation of values in Pandas via Regex

Question

This is actually a follow up question of here. I had not been clear in my previous question, and since it has been answered, I felt it was better to post a new question instead.

I have a dataframe like below:

Column1    Column2    Column3    Column4                     Column5
5FQ        1.047      S$55.3     UG44.2 as of 02/Jun/2016    S$8.2 mm
600        (1.047)    S$23.3     AG5.6 as of 02/Jun/2016     S$58 mm
KI2        1.695      S$5.35     RR59.5 as of 02/Jun/2016    S$705 mm
88G        0.0025     S$(5.3)    NW44.2 as of 02/Jun/2016    S$112 mm
60G        5.63       S$78.4     UG21.2 as of 02/Jun/2016    S$6.21 mm
90F        (5.562)    S$(88.3)   IG46.2 as of 02/Jun/2016    S$8 mm

I am trying to use regex to drop all the words and letters, only keeping the numbers. However, if the number is enclosed within a (), I would like to make it a negative number instead.

Desired output

Column1    Column2    Column3    Column4       Column5
5          1.047      55.3       44.2          8.2
600        -1.047     23.3       5.6           58
2          1.695      5.35       59.5          705
88         0.0025     -5.3       44.2          112
60         5.63       78.4       21.2          6.21
90         -5.562     -88.3      46.2          8

Would this be possible? I've tried playing around with this code, but was not sure what the appropriate regex combination should be.

df.apply(lambda x: x.astype(str).str.extract(r'(\d+\.?\d*)', expand=True).astype(np.float))

I will write a comment as my descriptive answer gets downvoted: (\d+.?\d*) matches all digits with arbitrary number of decimals, including 02 and 2016 of the date. Additionally you are missing the sign. I would first replace all '(' (backslashed) by '-' Then delete everything with the date format Then remove (replace by zero string) anything that is not space or digit or dot. Something like [^0-9 .]* (you need to look it up, as regex syntax varies from one env to another. After that you have your result, seperated by spaces, just match ((\d+.?\d*) ), the result is in the inter group — chrisvp
– chrisvp, Commented Sep 4, 2016 at 9:10

piRSquared · Accepted Answer · 2016-09-04 09:15:49Z

3

r1 = r'\((\d+\.?\d*)\)'
r2 = r'(-?\d+\.?\d*)'
df.stack().str.replace(r1, r'-\1', 1) \
          .str.extract(r2, expand=False).unstack()

answered Sep 4, 2016 at 9:15

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jake wong Over a year ago

Thanks so much for this! Just wondering, I noticed that if the values have a comma in between them, EG $1,005A. It drops everything except 1. Is there a way to keep it as just 1005?

MaxU - stand with Ukraine · Accepted Answer · 2016-09-04 14:20:28Z

2

UPDATE: $1,005A --> 1005 (example in 1st row, column Column3)

In [131]: df
Out[131]:
  Column1  Column2   Column3                   Column4    Column5
0     5FQ    1.047   $1,005A  UG44.2 as of 02/Jun/2016   S$8.2 mm
1     600  (1.047)    S$23.3   AG5.6 as of 02/Jun/2016    S$58 mm
2     KI2    1.695    S$5.35  RR59.5 as of 02/Jun/2016   S$705 mm
3     88G   0.0025   S$(5.3)  NW44.2 as of 02/Jun/2016   S$112 mm
4     60G     5.63    S$78.4  UG21.2 as of 02/Jun/2016  S$6.21 mm
5     90F  (5.562)  S$(88.3)  IG46.2 as of 02/Jun/2016     S$8 mm

In [132]: to_replace = [r'\(([\d\.]+)\)', r'.*?([\d\.\,\-]+).*', ',']

In [133]: vals = [r'-\1', r'\1', '']

In [134]: df.replace(to_replace=to_replace, value=vals, regex=True)
Out[134]:
  Column1 Column2 Column3 Column4 Column5
0       5   1.047    1005    44.2     8.2
1     600  -1.047    23.3     5.6      58
2       2   1.695    5.35    59.5     705
3      88  0.0025    -5.3    44.2     112
4      60    5.63    78.4    21.2    6.21
5      90  -5.562   -88.3    46.2       8

OLD answer:

Yet another solution, which uses only DataFrame.replace() method:

In [28]: to_replace = [r'\(([\d\.]+)\)', r'.*?([\d\.-]+).*']

In [29]: vals = [r'-\1', r'\1']

In [30]: df.replace(to_replace=to_replace, value=vals, regex=True)
Out[30]:
  Column1 Column2 Column3 Column4 Column5
0       5   1.047    55.3    44.2     8.2
1     600  -1.047    23.3     5.6      58
2       2   1.695    5.35    59.5     705
3      88  0.0025    -5.3    44.2     112
4      60    5.63    78.4    21.2    6.21
5      90  -5.562   -88.3    46.2       8

edited Sep 4, 2016 at 14:20

answered Sep 4, 2016 at 9:50

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

jake wong Over a year ago

Thanks for this MaxU. Was also wondering. If the columns have a comma within the values, EG: $1,005A, this code would drop everything and just keep the value 1. Is there a way to modify the code such that it would only show 1005?

Jan · Accepted Answer · 2016-09-04 08:54:37Z

2

You could come up with:

import re

def onlynumbers(value):
    if value.startswith('('):
        return '-' + value
    rx = re.compile(r'\d+[\d.]*')
    try:
        return rx.search(value).group(0)
    except:
        return value

df.applymap(onlynumbers)

This returns:

answered Sep 4, 2016 at 8:54

Jan

43.3k11 gold badges57 silver badges87 bronze badges

4 Comments

chrisvp Over a year ago

And how exactly did you get rid of the date? You should eliminate that one first, as described in my answer below.

Jan Over a year ago

@chrisvp: No, I should not - rx.search() returns only the first match which is not the date.

chrisvp Over a year ago

Ok, but column 5 would be 02 and column 6 would be 2016 and only column 7 would be 8.2. So you'd need to skip 5 and 6, which boils down to eliminate the date. r'\d+[\d.]* ' could be shorted to r'[\d.]+'

Jan Over a year ago

Not necessarily: make yourself clear the differences between [.\d]+ (yours), \d[.\d]* (mine) and the even more secure \d[.\d]*\d. Sometimes shortening comes for the prize of inaccuracy.

Collectives™ on Stack Overflow

Manipulation of values in Pandas via Regex

3 Answers 3

1 Comment

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related