remove part of URL string with regex in column of pandas dataframe

Question

I need to clean up some urls to remove the unique tracking codes so that in reporting they can be counted in a group rather than 1000's of individual pages.

the code to remove is in the middle of the url and varies in length.

example url is

https://www.website.co.uk/product/?commcodeABBB/home-page/

I am trying to get this

https://www.website.co.uk/product/home-page/

I have similar code working for removing the end of a url string:

df["URL"] = df["URL"].str.replace('\/id.*','/',regex=True)

I have tried to modify it for my new scenario.

df["URL"] = df["URL"].str.replace('\/\?commcode.{0,5}','/',regex=True)

In this scenario the regex \/\?commcode.{0,5} does select ?commcodeABBB/ however the length of code string in my URLs vary so it won't work on everything.

I cannot work out how to write it so that it takes everything from ?commcode up to and including the next /. I looked at \w \W for 'in-between' however it doesn't recognise / only alphanumeric characters.

I have read many many other posts about similar issues but nothing quite addresses this that I can find. I cannot use code that counts from start or end of the string as length changes, as does the number of / in the url so I cannot use 'between 2nd and 3rd / method.

Any ideas please?

Ryszard Czech · Accepted Answer · 2020-09-28 20:42:07Z

2

Use

df["URL"] = df["URL"].str.replace(r'/\?commcode[^/]*', '')

See proof.

Explanation

--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \?                       '?'
--------------------------------------------------------------------------------
  commcode                 'commcode'
--------------------------------------------------------------------------------
  [^/]*                    any character except: '/' (0 or more times
                           (matching the most amount possible))

answered Sep 28, 2020 at 20:42

Ryszard Czech

18.7k4 gold badges27 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mizz H Over a year ago

that is brilliant! i've spent all day trying to work this out. thank you. I also like regex101.com much more than Pythex which I was using as you can hover over the characters. i used your proof link to test adding extra characters and numbers to the string and it worked for all of the scenarios. I don't understand why the greedy match * comes last (after the match single character /) but it works

Ryszard Czech Over a year ago

@MizzH See explanation, [^/] matches any single character that is not /, and * allows matching zero or more of such consecutive characters.

Mizz H Over a year ago

this also works if other characters such as = - and ? are included in the string

whege · Accepted Answer · 2020-09-28 20:43:30Z

1

You can do:

'\/\?commcode[A-Za-z0-9]*'

to specify which character groups you want included.

answered Sep 28, 2020 at 20:43

whege

1,4411 gold badge9 silver badges14 bronze badges

1 Comment

Mizz H Over a year ago

thank you. this works well also. and answers my question. I also tested it in regex101

Collectives™ on Stack Overflow

remove part of URL string with regex in column of pandas dataframe

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related