0

I have a df with variable named url. Each url string in url has a unique six character alphanumeric ID in the URL string. Ive been trying to extract a specific part of each string, the article_id from all urls, and then add it to the df as a new variable.

For example, xwpd7w is the article_id for https://www.vice.com/en_us/article/xwpd7w/how-a-brooklyn-gang-may-have-gotten-crazy-rich-dealing-for-el-chapo

How do I extract article_ids from all urls in the df based on their position next to /article/? Using any method, regex or not?

I have so far done the following:

df.url.str.split()

ex output: [https://www.vice.com/en_au/article/j539yy/smo...

df['cutcurls'] = df.url.str.join(sep=' ')
ex output: h t t p s : / / w w w . v i c e . c o m / e n

Any ideas?

1 Answer 1

1

Apply the "str.extract" method.

df=pd.DataFrame({"url":["https://www.vice.com/en_us/article/xwpd7w/how-a-brooklyn-gang-may-have-gotten-crazy-rich-dealing-for-el-chapo","https://www.www.www//en_us/article/idId2019/buzzwords"]}) 

df["articel_id"]= df.url.str.extract(r"/article/([^/]+)")

    Out:
        url articel_id
        0  https://www.vice.com/en_us/article/xwpd7w/how-...     xwpd7w
        1  https://www.www.www//en_us/article/idId2019/bu...   idId2019

([^/]+): groups consecutive non '/' characters

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.