Python how to parse 2 URLs from a string and then map it back?

Question

I have a column in a pandas dataframe where some of the values are in this format: "From https://....com?gclid=... to https://...com". What I would like is to parse only the first URL so that the gclid and other IDs would vanish and I would like to map back that into the dataframe e.g.: "From https://....com to https://...com"

I know that there is a python module called urllib but if I apply that to this string a call a path() on it, it just parses the first URL and then I lose the other part which is as important as the first one.

Could somebody please help me? Thank you!

I'm not entirely clear what you're asking. Can you post some code showing the specific behavior you want? — larsks
– larsks, Commented Apr 30, 2020 at 13:41
if you has so simple string "From https://....com to https://...com" then you can text.replace("From ", "").replace(" to ", ' ').split(" ") to get list ["https://....com", "https://...com"] — furas
– furas, Commented Apr 30, 2020 at 13:46
if I do this: t = urllib.parse.urlparse("From https:///?gclid=... to https://"), this is what I get back: ParseResult(scheme='', netloc='', path='From https://', params='', query='gclid=.. to https://', fragment=''). My problem is that the second URL is going to the query part so if I call t.path(), I get back only the From https:// part, instead of the first URL parsed and the second URL as well (I would like to delete the ID and other unique identifiers from the first URL and then map it back in the place of the original value — Szabolcs Magyar
– Szabolcs Magyar, Commented Apr 30, 2020 at 13:48
if you want to remove ?gclid= ... then you try to use regex to replace it. — furas
– furas, Commented Apr 30, 2020 at 13:49
if you will have list ["https://....com?gclid=", "https://...com"] then you can get first element from list and split('?') to remove it. — furas
– furas, Commented Apr 30, 2020 at 13:51

furas · Accepted Answer · 2020-04-30 14:12:07Z

1

If you use DataFrame then use replace() which can use regex to find text like "?.... " (which starts with ? and ends with space - or which starts with ? and have only chars different then space - '\?[^ ]+')

import pandas as pd

df = pd.DataFrame({'text': ["From https://....com?gclid=... to https://...com"]})

df['text'] = df['text'].str.replace('\?[^ ]+', '')

Result

                                     text
0  From https://....com to https://...com

BTW: you can also try more complex regex to make sure it is part of url which starts with http.

df['text'] = df['text'].str.replace('(http[^?]+)\?[^ ]+', '\\1')

I use (...) to catch this url before ?... and I put it back using \\1 (already without ?...)

edited Apr 30, 2020 at 14:12

answered Apr 30, 2020 at 13:56

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Szabolcs Magyar Over a year ago

wow, thank you, the BTW part was exactly what I needed! Have a nice day kind Stranger!

Collectives™ on Stack Overflow

Python how to parse 2 URLs from a string and then map it back?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related