0

I have a column in a pandas dataframe where some of the values are in this format: "From https://....com?gclid=... to https://...com". What I would like is to parse only the first URL so that the gclid and other IDs would vanish and I would like to map back that into the dataframe e.g.: "From https://....com to https://...com"

I know that there is a python module called urllib but if I apply that to this string a call a path() on it, it just parses the first URL and then I lose the other part which is as important as the first one.

Could somebody please help me? Thank you!

5
  • I'm not entirely clear what you're asking. Can you post some code showing the specific behavior you want? Commented Apr 30, 2020 at 13:41
  • if you has so simple string "From https://....com to https://...com" then you can text.replace("From ", "").replace(" to ", ' ').split(" ") to get list ["https://....com", "https://...com"] Commented Apr 30, 2020 at 13:46
  • if I do this: t = urllib.parse.urlparse("From https:///?gclid=... to https://"), this is what I get back: ParseResult(scheme='', netloc='', path='From https://', params='', query='gclid=.. to https://', fragment=''). My problem is that the second URL is going to the query part so if I call t.path(), I get back only the From https:// part, instead of the first URL parsed and the second URL as well (I would like to delete the ID and other unique identifiers from the first URL and then map it back in the place of the original value Commented Apr 30, 2020 at 13:48
  • if you want to remove ?gclid= ... then you try to use regex to replace it. Commented Apr 30, 2020 at 13:49
  • if you will have list ["https://....com?gclid=", "https://...com"] then you can get first element from list and split('?') to remove it. Commented Apr 30, 2020 at 13:51

1 Answer 1

1

If you use DataFrame then use replace() which can use regex to find text like "?.... " (which starts with ? and ends with space - or which starts with ? and have only chars different then space - '\?[^ ]+')

import pandas as pd

df = pd.DataFrame({'text': ["From https://....com?gclid=... to https://...com"]})

df['text'] = df['text'].str.replace('\?[^ ]+', '')

Result

                                     text
0  From https://....com to https://...com

BTW: you can also try more complex regex to make sure it is part of url which starts with http.

df['text'] = df['text'].str.replace('(http[^?]+)\?[^ ]+', '\\1')

I use (...) to catch this url before ?... and I put it back using \\1 (already without ?...)

Sign up to request clarification or add additional context in comments.

1 Comment

wow, thank you, the BTW part was exactly what I needed! Have a nice day kind Stranger!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.