1

I want to create pandas dataframe from list of urls where I want to split each url by hierarchy and create new columns for it. More specifically, I want to break up url by domain, protocol, query, fragment, paths. I think it's doable by using pandas, and I learned this solution but didn't get expected one.

example data snippet

Here is example data snippet in csv file and here is my attempt to do this:

import pandas as pd

df=pd.read_csv('example data snippet.csv')
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse.urlsplit))

above attempt wasn't successful because it's ouput doesn't meet with my expectation, so I am wondering is there better way to make this happen with pandas. Can anyone point me out how to make this work? Anyway to get this done easily? Any idea?

desired output

I want to split url and create new column for each component, the columns of my final pandas dataframe would be like this:

df.columns=['id', 'title', 'news source', 'topic', 'news category']

for example, in this url, I could say:

'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/

news source =['variety.com','variety.com']
topic = ['tax-march-donald-trump-protest','list-2018-oscar-nominations']
new category = ['biz', 'film']

how can I do this kind of parsing for given urls list and add them into new column in pandas dataframe? anyway to get this done? thanks in advance

1 Answer 1

1

how many do you have?

I think I would go 1 by 1 because you're ignoring a random amount of stuff and you'll need to write rules for what to ignore.

if you use url.split("/") you'll get a list, but then you need to remove what you don't need to keep what you want.

once you have what you want, it will be in a nice shape where you can put it into a dataframe:

import pandas as pd

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']

cols = ['c1', 'c2', 'c3', 'c4']
make_me = []
for url in urls:
    lst = url.split("/")
    # your business rules go here
    make_me.append([x for x in lst if not x.isdigit() and not x == ""])

df = pd.DataFrame(make_me, columns=cols)
df


    c1          c2    c3    c4
0   variety.com biz   news  tax-march-donald-trump-protest-1202031487
1   variety.com film  news  list-2018-oscar-nominations-1202668757

Then you could reference each column as you like:

df.c1

>
0    variety.com
1    variety.com
Name: c1, dtype: object

and still have it all together and indexed. I think the rules might get tough and you might need to make them domain specific.

Sign up to request clarification or add additional context in comments.

6 Comments

thanks for your solution. How can I make bit modification on c4 column as tax-march-donald-trump-protest instead of tax-march-donald-trump-protest-1202031487? Thank you
again, split it on "-" and get rid of the last one, but that might only work for variety.com
thanks, is that doable to keep text only for c4? could you point me out how to do?
check out the string function isdigit() link
I tried like this: for i in df.c4: lst=i.split("-") res.append([''.join(x) for x in lst if not x.isdigit()]) but somehow digit in the middle was gone also, I just want to get rid of digit on text tail. Any better idea? Thank you
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.