2

I need to clean up some urls to remove the unique tracking codes so that in reporting they can be counted in a group rather than 1000's of individual pages.

the code to remove is in the middle of the url and varies in length.

example url is

https://www.website.co.uk/product/?commcodeABBB/home-page/

I am trying to get this

https://www.website.co.uk/product/home-page/

I have similar code working for removing the end of a url string:

df["URL"] = df["URL"].str.replace('\/id.*','/',regex=True)

I have tried to modify it for my new scenario.

df["URL"] = df["URL"].str.replace('\/\?commcode.{0,5}','/',regex=True)

In this scenario the regex \/\?commcode.{0,5} does select ?commcodeABBB/ however the length of code string in my URLs vary so it won't work on everything.

I cannot work out how to write it so that it takes everything from ?commcode up to and including the next /. I looked at \w \W for 'in-between' however it doesn't recognise / only alphanumeric characters.

I have read many many other posts about similar issues but nothing quite addresses this that I can find. I cannot use code that counts from start or end of the string as length changes, as does the number of / in the url so I cannot use 'between 2nd and 3rd / method.

Any ideas please?

2 Answers 2

2

Use

df["URL"] = df["URL"].str.replace(r'/\?commcode[^/]*', '')

See proof.

Explanation

--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \?                       '?'
--------------------------------------------------------------------------------
  commcode                 'commcode'
--------------------------------------------------------------------------------
  [^/]*                    any character except: '/' (0 or more times
                           (matching the most amount possible))
Sign up to request clarification or add additional context in comments.

3 Comments

that is brilliant! i've spent all day trying to work this out. thank you. I also like regex101.com much more than Pythex which I was using as you can hover over the characters. i used your proof link to test adding extra characters and numbers to the string and it worked for all of the scenarios. I don't understand why the greedy match * comes last (after the match single character /) but it works
@MizzH See explanation, [^/] matches any single character that is not /, and * allows matching zero or more of such consecutive characters.
this also works if other characters such as = - and ? are included in the string
1

You can do:

'\/\?commcode[A-Za-z0-9]*'

to specify which character groups you want included.

1 Comment

thank you. this works well also. and answers my question. I also tested it in regex101

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.