separating and extracting part of strings of URLs using regex?

Question

I have a df with variable named url. Each url string in url has a unique six character alphanumeric ID in the URL string. Ive been trying to extract a specific part of each string, the article_id from all urls, and then add it to the df as a new variable.

For example, xwpd7w is the article_id for https://www.vice.com/en_us/article/xwpd7w/how-a-brooklyn-gang-may-have-gotten-crazy-rich-dealing-for-el-chapo

How do I extract article_ids from all urls in the df based on their position next to /article/? Using any method, regex or not?

I have so far done the following:

df.url.str.split()

ex output: [https://www.vice.com/en_au/article/j539yy/smo...

df['cutcurls'] = df.url.str.join(sep=' ')
ex output: h t t p s : / / w w w . v i c e . c o m / e n

Any ideas?

kantal · Accepted Answer · 2019-10-10 17:27:04Z

1

Apply the "str.extract" method.

df=pd.DataFrame({"url":["https://www.vice.com/en_us/article/xwpd7w/how-a-brooklyn-gang-may-have-gotten-crazy-rich-dealing-for-el-chapo","https://www.www.www//en_us/article/idId2019/buzzwords"]}) 

df["articel_id"]= df.url.str.extract(r"/article/([^/]+)")

    Out:
        url articel_id
        0  https://www.vice.com/en_us/article/xwpd7w/how-...     xwpd7w
        1  https://www.www.www//en_us/article/idId2019/bu...   idId2019

([^/]+): groups consecutive non '/' characters

answered Oct 10, 2019 at 17:27

kantal

2,4072 gold badges10 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

separating and extracting part of strings of URLs using regex?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related