Regular expression python dataframe element

Question

I had a question (answered excellently) here: Python parse dataframe element

Unfortunately, my data source has other conditions which need handled.

Current pattern is

pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'

trans_field_attr = df['Data Type'].str.extract(pattern, expand=True).iloc[:, [0, 2, 3]]

This handles the (precision,scale) version perfectly e.g NUMBER(22,4). Unfortunately it does not select any values in brackets where there is only a single value.

For example:

0        VARCHAR2(1)
1        VARCHAR2(1)
2        VARCHAR2(1)
3        VARCHAR2(1)
4        VARCHAR2(1)
5            DATE(7)
6            DATE(7)
7            DATE(7)
8            DATE(7)
9        VARCHAR2(1)
10           DATE(7)
11       VARCHAR2(3)
12       VARCHAR2(3)
13               NaN
14       VARCHAR2(3)
15      NUMBER(22,4)

How could the pattern be improved to pickup single values as well?

Apologies but I really struggled to take it further from piRSquared's answer...

R Nar · Accepted Answer · 2016-05-31 18:22:33Z

1

Add a non-capturing group for the second number and the comma and then add a ? zero or one token after it, like below.

([^\(]+)(\(([^,]*)(?:,(.*))?\))?
                  (?:     )? <= this part means that the comma and everything following it
                                is optional, alike to the ? token at the very end.

answered May 31, 2016 at 18:22

R Nar

5,5231 gold badge20 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

R Nar Over a year ago

just a note that it is worth looking at Oz123's answer for if you want something that is a lot more scalable. (say you want more than 2 values in between brackets.)

oz123 · Accepted Answer · 2016-05-31 18:29:31Z

1

If you are just trying to extract the number between the brackets you can use a much simple version:

In [2]: rgx=re.compile("\w+\((?P<num>\d*\,*\d*)")
In [5]: m=rgx.match("VARCHAR(22,22)")
In [10]: m.groupdict()
Out[10]: {'num': '22,22'}

In [16]: m=rgx.match("VARCHAR(22)")
In [17]: m.groupdict()['num']
Out[17]: '22'

answered May 31, 2016 at 18:29

oz123

29.1k30 gold badges133 silver badges196 bronze badges

1 Comment

R Nar Over a year ago

might be worth mentioning that this would require additional parsing (something like .groupdict()['num]).split(',')) since it seems that OP wants to split each element up.

Collectives™ on Stack Overflow

Regular expression python dataframe element

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related