1

So, I have a df similar to the one below in Pandas:

Name        URL
X           http://www.x.com/abc/xyz/url.html
X           http://www.x.com/yyz/hue/end.html
Othername   http://website.othername.com/abc.html
Othername   http://home.othername.com/someword/word.html
Example     http://www.example.com/999/something/index.html

I wanted to, using regex (I guess) add an "Extract" column, as below:

Name        URL                                              Extract
X           http://www.x.com/abc/xyz/url.html                abc
X           http://www.x.com/yyz/hue/end.html                yyz 
Othername   http://website.othername.com/abc.html            website
Othername   http://home.othername.com/someword/word.html     home
Example     http://www.example.com/999/something/index.html  999

As you may see, the parts I want to extract vary according to the website. So, for the value 'X' under 'Name', I'd have to apply one regex pattern. For 'Othername', another pattern.

I have 6 different (and 6 different patterns) for this.

I tried using 'where', but I could make it work only for one of the websites, not considering multiple conditions. As follows:

df['Extract'] = np.where(df['Name'] == 'X', df.URL.str.extract(r'www\.x\.com\/(.*?)/'),'')

I also tried creating a function for this:

def ext(c):
    if c['Name'] == 'X':
        c.URL.str.extract(r'www\.x\.com\/(.*?)/')
    elif c['Name'] == 'Example':
        c.URL.str.extract(r'www\.example\.com\/(.*?)/')
    (...)
    else:
        return ''

df['Extract'] = df.apply(ext)
df

How can I make this work for the different str I have under 'Name'?

2 Answers 2

1

Try this:

In [87]: df['Extract'] = (df.URL.replace([r'http[s]?://www\.[^/]*\/', r'http[s]?://'], ['',''], regex=True)
    ...:                    .str.extract(r'([^/.]*)', expand=False))
    ...:

In [88]: df
Out[88]:
        Name                                              URL  Extract
0          X                http://www.x.com/abc/xyz/url.html      abc
1          X                http://www.x.com/yyz/hue/end.html      yyz
2  Othername            http://website.othername.com/abc.html  website
3  Othername     http://home.othername.com/someword/word.html     home
4    Example  http://www.example.com/999/something/index.html      999
Sign up to request clarification or add additional context in comments.

Comments

1

You can use a conditional regex:

import re
rx = re.compile(r'https?://(www)?(?(1)[^/+]+/([^/]+)|([^.]+))')
def extract(col):
    m = rx.match(col)
    if m is not None:
        return m.group(3) if m.group(3) is not None else m.group(2)
    else:
        return ''

df['Extract'] = df['URL'].apply(extract)

This assumes that you are looking for the first part after / when the subdomain is www else for the subdomain itself.


Broken down this says:

https?://   # match http:// or https.//
(www)?      # capture www into group 1 if it is there
(?(1)       # check if it was matched
    [^/+]+/ # ... and if so fast forward ...
    ([^/]+) # capture it into group 2
|           # else
    ([^.]+) # otherwise capture the part directly after http://
)           # into group 3

See a demo on regex101.com.

1 Comment

the df.col.apply(extract) solution helped me apply different regex's to the same column without replacing the previously extracted values--woot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.