Multiple conditions for new column using Regex on Pandas dataframe

Question

So, I have a df similar to the one below in Pandas:

Name        URL
X           http://www.x.com/abc/xyz/url.html
X           http://www.x.com/yyz/hue/end.html
Othername   http://website.othername.com/abc.html
Othername   http://home.othername.com/someword/word.html
Example     http://www.example.com/999/something/index.html

I wanted to, using regex (I guess) add an "Extract" column, as below:

Name        URL                                              Extract
X           http://www.x.com/abc/xyz/url.html                abc
X           http://www.x.com/yyz/hue/end.html                yyz 
Othername   http://website.othername.com/abc.html            website
Othername   http://home.othername.com/someword/word.html     home
Example     http://www.example.com/999/something/index.html  999

As you may see, the parts I want to extract vary according to the website. So, for the value 'X' under 'Name', I'd have to apply one regex pattern. For 'Othername', another pattern.

I have 6 different (and 6 different patterns) for this.

I tried using 'where', but I could make it work only for one of the websites, not considering multiple conditions. As follows:

df['Extract'] = np.where(df['Name'] == 'X', df.URL.str.extract(r'www\.x\.com\/(.*?)/'),'')

I also tried creating a function for this:

def ext(c):
    if c['Name'] == 'X':
        c.URL.str.extract(r'www\.x\.com\/(.*?)/')
    elif c['Name'] == 'Example':
        c.URL.str.extract(r'www\.example\.com\/(.*?)/')
    (...)
    else:
        return ''

df['Extract'] = df.apply(ext)
df

How can I make this work for the different str I have under 'Name'?

MaxU - stand with Ukraine · Accepted Answer · 2017-11-24 22:10:31Z

1

Try this:

In [87]: df['Extract'] = (df.URL.replace([r'http[s]?://www\.[^/]*\/', r'http[s]?://'], ['',''], regex=True)
    ...:                    .str.extract(r'([^/.]*)', expand=False))
    ...:

In [88]: df
Out[88]:
        Name                                              URL  Extract
0          X                http://www.x.com/abc/xyz/url.html      abc
1          X                http://www.x.com/yyz/hue/end.html      yyz
2  Othername            http://website.othername.com/abc.html  website
3  Othername     http://home.othername.com/someword/word.html     home
4    Example  http://www.example.com/999/something/index.html      999

answered Nov 24, 2017 at 22:10

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jan · Accepted Answer · 2017-11-24 22:37:56Z

1

You can use a conditional regex:

import re
rx = re.compile(r'https?://(www)?(?(1)[^/+]+/([^/]+)|([^.]+))')
def extract(col):
    m = rx.match(col)
    if m is not None:
        return m.group(3) if m.group(3) is not None else m.group(2)
    else:
        return ''

df['Extract'] = df['URL'].apply(extract)

This assumes that you are looking for the first part after / when the subdomain is www else for the subdomain itself.

Broken down this says:

https?://   # match http:// or https.//
(www)?      # capture www into group 1 if it is there
(?(1)       # check if it was matched
    [^/+]+/ # ... and if so fast forward ...
    ([^/]+) # capture it into group 2
|           # else
    ([^.]+) # otherwise capture the part directly after http://
)           # into group 3

See a demo on regex101.com.

edited Nov 24, 2017 at 22:37

answered Nov 24, 2017 at 22:10

Jan

43.3k11 gold badges57 silver badges87 bronze badges

1 Comment

xtian Over a year ago

the df.col.apply(extract) solution helped me apply different regex's to the same column without replacing the previously extracted values--woot!

Collectives™ on Stack Overflow

Multiple conditions for new column using Regex on Pandas dataframe

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related