I need to clean up some urls to remove the unique tracking codes so that in reporting they can be counted in a group rather than 1000's of individual pages.
the code to remove is in the middle of the url and varies in length.
example url is
https://www.website.co.uk/product/?commcodeABBB/home-page/
I am trying to get this
https://www.website.co.uk/product/home-page/
I have similar code working for removing the end of a url string:
df["URL"] = df["URL"].str.replace('\/id.*','/',regex=True)
I have tried to modify it for my new scenario.
df["URL"] = df["URL"].str.replace('\/\?commcode.{0,5}','/',regex=True)
In this scenario the regex \/\?commcode.{0,5} does select ?commcodeABBB/ however the length of code string in my URLs vary so it won't work on everything.
I cannot work out how to write it so that it takes everything from ?commcode up to and including the next /. I looked at \w \W for 'in-between' however it doesn't recognise / only alphanumeric characters.
I have read many many other posts about similar issues but nothing quite addresses this that I can find. I cannot use code that counts from start or end of the string as length changes, as does the number of / in the url so I cannot use 'between 2nd and 3rd / method.
Any ideas please?