Remove utm_* parameters from URL in Python

Question

I've been trying to remove all utm_* parameters from a list of URLs. The closest thing I have found is this: https://gist.github.com/626834.

Any ideas?

Jon Clements · Accepted Answer · 2012-07-24 23:03:47Z

9

It's a bit long but uses the url* modules, and avoids re's.

from urllib import urlencode
from urlparse import urlparse, parse_qs, urlunparse

url = 'http://whatever.com/somepage?utm_one=3&something=4&utm_two=5&utm_blank&something_else'

parsed = urlparse(url)
qd = parse_qs(parsed.query, keep_blank_values=True)
filtered = dict( (k, v) for k, v in qd.iteritems() if not k.startswith('utm_'))
newurl = urlunparse([
    parsed.scheme,
    parsed.netloc,
    parsed.path,
    parsed.params,
    urlencode(filtered, doseq=True), # query string
    parsed.fragment
])

print newurl
# 'http://whatever.com/somepage?something=4&something_else'

answered Jul 24, 2012 at 23:03

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Giles Thomas Over a year ago

A problem is that this will change the ordering of the query params, and will add "=" to params with no values. This shouldn't be a problem, but (I've found while trying to write something like this) there are sites out there that rely on stuff like that. For example, on one particular site, example.com/?32423&foo=bar&a=b cannot be rewritten as example.com/?a=b&foo=bar&32423= Yes, the site is stupid and wrong and it shouldn't rely on the query param ordering. But if it's a real-world site (and it wasn't a small one) then you can't necessarily ignore it :-(

mVChr · Accepted Answer · 2012-07-24 23:00:08Z

2

import re
from urlparse import urlparse, urlunparse

url = 'http://www.someurl.com/page.html?foo=bar&utm_medium=qux&baz=qoo'
parsed_url = list(urlparse(url))
parsed_url[4] = '&'.join(
    [x for x in parsed_url[4].split('&') if not re.match(r'utm_', x)])
utmless_url = urlunparse(parsed_url)

print utmless_url  # 'http://www.someurl.com/page.html?foo=bar&baz=qoo'

answered Jul 24, 2012 at 23:00

mVChr

50.3k11 gold badges111 silver badges105 bronze badges

2 Comments

jadkik94 Over a year ago

Why use re here? A simple x.startswith('utm_') would do it, and better.

mVChr Over a year ago

Yep, the re expression can be replaced with startswith, which I didn't know about until I saw Jon Clement's answer. :)

jadkik94 · Accepted Answer · 2012-07-24 23:12:11Z

Simple, and works, and based on the link you posted, BUT it's re... so, not sure it won't break for some reason that I can't think of :)

import re

def trim_utm(url):
    if "utm_" not in url:
        return url
    matches = re.findall('(.+\?)([^#]*)(.*)', url)
    if len(matches) == 0:
        return url
    match = matches[0]
    query = match[1]
    sanitized_query = '&'.join([p for p in query.split('&') if not p.startswith('utm_')])
    return match[0]+sanitized_query+match[2]

if __name__ == "__main__":
    tests = [   "http://localhost/index.php?a=1&utm_source=1&b=2",
                "http://localhost/index.php?a=1&utm_source=1&b=2#hash",
                "http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
                "http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
                "http://localhost/index.php?utm_a=a",
                "http://localhost/index.php?a=utm_a",
                "http://localhost/index.php?a=1&b=2",
                "http://localhost/index.php",
                "http://localhost/index.php#hash2"
            ]

    for t in tests:
        trimmed = trim_utm(t)
        print t
        print trimmed
        print

Adders · Accepted Answer · 2017-10-29 09:03:46Z

1

How about this. Nice and simple:

url = 'https://searchengineland.com/amazon-q3-ad-revenues-surpass-1-billion-roughly-2x-early-2016-285763?utm_source=feedburner&utm_medium=feed&utm_campaign=feed-main'

print url[:url.find('?utm')]

https://searchengineland.com/amazon-q3-ad-revenues-surpass-1-billion-roughly-2x-early-2016-285763

answered Oct 29, 2017 at 9:03

Adders

6658 silver badges32 bronze badges

1 Comment

Pat Myron Over a year ago

this would remove other parameters as well

PHPGuyZim · Accepted Answer · 2021-02-15 21:49:48Z

Using regex

import re
def clean_url(url):
    return re.sub(r'(?<=[?&])utm_[^&]+&?', '', url)

What's going on? We are using regular expressions to find all instances of strings that look like utm_somekey=somevalue which is preceded by either "?" or "&".

Testing it:

tests = [   "http://localhost/index.php?a=1&utm_source=1&b=2",
            "http://localhost/index.php?a=1&utm_source=1&b=2#hash",
            "http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
            "http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
            "http://localhost/index.php?utm_a=a",
            "http://localhost/index.php?a=utm_a",
            "http://localhost/index.php?a=1&b=2",
            "http://localhost/index.php",
            "http://localhost/index.php#hash2"
        ]

for t in tests:
    print(clean_url(t))

http://localhost/index.php?a=1&b=2
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2&
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?
http://localhost/index.php?a=utm_a
http://localhost/index.php?a=1&b=2
http://localhost/index.php
http://localhost/index.php#hash2

dw1 · Accepted Answer · 2023-11-23 21:18:48Z

The popular url modules answer modifies and rearranges the parameters which can break poorly-designed sites (see the comment) so I settled on regex, but there are problems with those I tried, too. My final result is this:

import re

def removeURLTracking(url):
    url = re.sub(r'(?<=[?&])utm_[^&#]+&?', '', url)
    url = url.replace('&#', '#').rstrip('?&')
    return url

tests = ["http://localhost/index.php?a=1&utm_source=1&b=2",
         "http://localhost/index.php?a=1&utm_source=1&b=2#hash",
         "http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
         "http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
         "http://localhost/index.php?utm_a=a",
         "http://localhost/index.php?a=utm_a",
         "http://localhost/index.php?a=1&b=2",
         "http://localhost/index.php",
         "http://localhost/index.php#hash2"
         ]

for t in tests:
    print(removeURLTracking(t))

"""
http://localhost/index.php?a=1&b=2
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php
http://localhost/index.php?a=utm_a
http://localhost/index.php?a=1&b=2
http://localhost/index.php
http://localhost/index.php#hash2
"""

Collectives™ on Stack Overflow

Remove utm_* parameters from URL in Python

6 Answers 6

1 Comment

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related