Any way to create pandas dataframe by parsing/splitting list of urls?

Question

I want to create pandas dataframe from list of urls where I want to split each url by hierarchy and create new columns for it. More specifically, I want to break up url by domain, protocol, query, fragment, paths. I think it's doable by using pandas, and I learned this solution but didn't get expected one.

example data snippet

Here is example data snippet in csv file and here is my attempt to do this:

import pandas as pd

df=pd.read_csv('example data snippet.csv')
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse.urlsplit))

above attempt wasn't successful because it's ouput doesn't meet with my expectation, so I am wondering is there better way to make this happen with pandas. Can anyone point me out how to make this work? Anyway to get this done easily? Any idea?

desired output

I want to split url and create new column for each component, the columns of my final pandas dataframe would be like this:

df.columns=['id', 'title', 'news source', 'topic', 'news category']

for example, in this url, I could say:

'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/

news source =['variety.com','variety.com']
topic = ['tax-march-donald-trump-protest','list-2018-oscar-nominations']
new category = ['biz', 'film']

how can I do this kind of parsing for given urls list and add them into new column in pandas dataframe? anyway to get this done? thanks in advance

kztd · Accepted Answer · 2019-03-23 02:03:54Z

1

how many do you have?

I think I would go 1 by 1 because you're ignoring a random amount of stuff and you'll need to write rules for what to ignore.

if you use url.split("/") you'll get a list, but then you need to remove what you don't need to keep what you want.

once you have what you want, it will be in a nice shape where you can put it into a dataframe:

import pandas as pd

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']

cols = ['c1', 'c2', 'c3', 'c4']
make_me = []
for url in urls:
    lst = url.split("/")
    # your business rules go here
    make_me.append([x for x in lst if not x.isdigit() and not x == ""])

df = pd.DataFrame(make_me, columns=cols)
df


    c1          c2    c3    c4
0   variety.com biz   news  tax-march-donald-trump-protest-1202031487
1   variety.com film  news  list-2018-oscar-nominations-1202668757

Then you could reference each column as you like:

df.c1

>
0    variety.com
1    variety.com
Name: c1, dtype: object

and still have it all together and indexed. I think the rules might get tough and you might need to make them domain specific.

edited Mar 23, 2019 at 2:03

answered Mar 23, 2019 at 1:53

kztd

3,4232 gold badges24 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

woody Over a year ago

thanks for your solution. How can I make bit modification on c4 column as tax-march-donald-trump-protest instead of tax-march-donald-trump-protest-1202031487? Thank you

kztd Over a year ago

again, split it on "-" and get rid of the last one, but that might only work for variety.com

woody Over a year ago

thanks, is that doable to keep text only for c4? could you point me out how to do?

kztd Over a year ago

check out the string function isdigit() link

woody Over a year ago

I tried like this: for i in df.c4: lst=i.split("-") res.append([''.join(x) for x in lst if not x.isdigit()]) but somehow digit in the middle was gone also, I just want to get rid of digit on text tail. Any better idea? Thank you

|

Collectives™ on Stack Overflow

Any way to create pandas dataframe by parsing/splitting list of urls?

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related