Modify data column with regex

Question

I have a dataset called data. Theres a column called networkDomain that looks like this, data['networkDomain']:

0                amazonaws.com
1               vodafone-ip.de
2             ask4internet.com
3                   actcorp.in
4                    (not set)
5                    (not set)
6                   druknet.bt
7              unknown.unknown
8         alliancebroadband.in
9                  vsnl.net.in
10          grandenetworks.net
11             superonline.net
12                   (not set)
13             unknown.unknown
14             unknown.unknown
15                  fidnet.com
16                   (not set)
17             telepacific.net
18                    pldt.net
19        networkbackup.com.au

I would like to filter all the values ending with '.com' or '.net' using regex and assign all other values as 0.

I've tried data['networkDomain'][data['networkDomain'].str.contains(".com$|.net$", regex=True)] which returns:

0                  amazonaws.com
2               ask4internet.com
10            grandenetworks.net
11               superonline.net
15                    fidnet.com
17               telepacific.net
18                      pldt.net
22                       tdc.net
24                     qwest.net
26                     hinet.net
27                     ztomy.com
29                netvigator.com
30                    level3.net
31                   virginm.net
32                        rr.com
41                 sbcglobal.net
49                      pldt.net
51                  1asiacom.net
56                     yesup.net
59                 btireland.net
60                     avast.com

How can I set all the other values in data[networkDomain] which aren't '.net' or '.com' to be 0?

'0', NULL, or do you mean that you want to delete those values? — Luuk
– Luuk, Commented Jul 20, 2019 at 14:46

user459872 · Accepted Answer · 2019-07-21 10:17:27Z

1

You can use DataFrame.apply, which will apply a function along an axis of the DataFrame.

>>> import re
>>> import pandas as pd
>>> regex = re.compile(r".com$|.net$")
>>>
>>> def my_func(row):
...     if regex.search(row):
...         return row
...     return 0  # default
...
>>> df = pd.DataFrame(
...     [
...         {"Domain": " amazonaws.com"},
...         {"Domain": " amazonaws2.com"},
...         {"Domain": " amazonaws.net"},
...         {"Domain": "(not set)"},
...     ]
... )
>>>
>>> df["Domain"] = df["Domain"].apply(my_func)
>>> print(df)
            Domain
0    amazonaws.com
1   amazonaws2.com
2    amazonaws.net
3                0

edited Jul 21, 2019 at 10:17

answered Jul 20, 2019 at 15:16

user459872

25.9k4 gold badges51 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ComplicatedPhenomenon · Accepted Answer · 2019-07-20 15:24:22Z

1

Determine the row which doesn't satisfy the condition and modify the value of this row

import re
for i, j in enumerate(data.loc[:,'networkDomain']):
    if len(re.findall(r'\.com$|\.net$', j))==0:
        data.loc[i,'networkDomain'] = 0
print(data)

answered Jul 20, 2019 at 15:24

ComplicatedPhenomenon

4,2113 gold badges26 silver badges52 bronze badges

Comments

Tom Bailey · Accepted Answer · 2019-07-20 15:25:59Z

1

Use DataFrame.apply() to apply a function to every row in the series, note args argument must be passed as a tuple:

from pandas import DataFrame
import re

d={'col': [1,2,3], 'col2': ['a.net',2,3]}

df=DataFrame(columns=d.keys(), data=d)

def mask0(s, pattern):

    s =str(s)

if re.match(pattern, s):
    return s
else:
    return 0

pat = re.compile('.+[\.net|\.com]')
df['col2'] = df['col2'].apply(mask0, args=(pat,))

print(df)

answered Jul 20, 2019 at 15:25

Tom Bailey

3292 silver badges11 bronze badges

Collectives™ on Stack Overflow

Modify data column with regex

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related