Extracting Path from URLs in DataFrame

Question

I'm sure the answer to this is simple - I just can't it for some reason.

I'd like to extract the URL Path from a DataFrame of URLs without using a for loop - as i'll be running this against 1M+ rows and loops are too slow.

from urllib.parse import urlparse
d = {'urls': ['https://www.example.com/ex/1','https://www.example.com/1/ex']}
df = pd.DataFrame(data=d)
df
df['urls'].apply(urlparse)

Above is where i'm at, which returns an object of all parts of the URL returned by urllib

The desired end result is a DataFrame like the below:

d = {'urls': ['https://www.example.com/ex/1','https://www.example.com/1/ex'], 'url_path': ['/ex/1', '/1/ex']}

If anyone knows how to solve this - i'd appreciate the help!

Thanks!

Why do you need a dataframe? What you've shown is just a dictionary and looping over a dictionary or strings of a dataframe would be about the same execution time — OneCricketeer
– OneCricketeer, Commented Dec 13, 2021 at 3:25

tozCSS · Accepted Answer · 2021-12-13 17:38:29Z

1

The docstring of urlparse clearly says that its result is a named 6-tuple with such fields: <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

So the solution is two commands:

get the tuple at index 2 of urlparse result
to convert the df into your desired format, pass orient='list' arg to the to_dict DataFrame method

df['paths'] = df['urls'].apply(lambda x: urlparse(x)[2])
df.to_dict(orient='list')

Results in

{'urls': ['https://www.example.com/ex/1', 'https://www.example.com/1/ex'],
 'paths': ['/ex/1', '/1/ex']}

edited Dec 13, 2021 at 17:38

answered Dec 13, 2021 at 3:27

tozCSS

6,2343 gold badges37 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extracting Path from URLs in DataFrame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related