0

I'm sure the answer to this is simple - I just can't it for some reason.

I'd like to extract the URL Path from a DataFrame of URLs without using a for loop - as i'll be running this against 1M+ rows and loops are too slow.

from urllib.parse import urlparse
d = {'urls': ['https://www.example.com/ex/1','https://www.example.com/1/ex']}
df = pd.DataFrame(data=d)
df
df['urls'].apply(urlparse)

Above is where i'm at, which returns an object of all parts of the URL returned by urllib

The desired end result is a DataFrame like the below:

d = {'urls': ['https://www.example.com/ex/1','https://www.example.com/1/ex'], 'url_path': ['/ex/1', '/1/ex']}

If anyone knows how to solve this - i'd appreciate the help!

Thanks!

1
  • Why do you need a dataframe? What you've shown is just a dictionary and looping over a dictionary or strings of a dataframe would be about the same execution time Commented Dec 13, 2021 at 3:25

1 Answer 1

1

The docstring of urlparse clearly says that its result is a named 6-tuple with such fields: <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

So the solution is two commands:

  1. get the tuple at index 2 of urlparse result
  2. to convert the df into your desired format, pass orient='list' arg to the to_dict DataFrame method
df['paths'] = df['urls'].apply(lambda x: urlparse(x)[2])
df.to_dict(orient='list')

Results in

{'urls': ['https://www.example.com/ex/1', 'https://www.example.com/1/ex'],
 'paths': ['/ex/1', '/1/ex']}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.