8

I've been trying to remove all utm_* parameters from a list of URLs. The closest thing I have found is this: https://gist.github.com/626834.

Any ideas?

0

6 Answers 6

9

It's a bit long but uses the url* modules, and avoids re's.

from urllib import urlencode
from urlparse import urlparse, parse_qs, urlunparse

url = 'http://whatever.com/somepage?utm_one=3&something=4&utm_two=5&utm_blank&something_else'

parsed = urlparse(url)
qd = parse_qs(parsed.query, keep_blank_values=True)
filtered = dict( (k, v) for k, v in qd.iteritems() if not k.startswith('utm_'))
newurl = urlunparse([
    parsed.scheme,
    parsed.netloc,
    parsed.path,
    parsed.params,
    urlencode(filtered, doseq=True), # query string
    parsed.fragment
])

print newurl
# 'http://whatever.com/somepage?something=4&something_else'
Sign up to request clarification or add additional context in comments.

1 Comment

A problem is that this will change the ordering of the query params, and will add "=" to params with no values. This shouldn't be a problem, but (I've found while trying to write something like this) there are sites out there that rely on stuff like that. For example, on one particular site, example.com/?32423&foo=bar&a=b cannot be rewritten as example.com/?a=b&foo=bar&32423= Yes, the site is stupid and wrong and it shouldn't rely on the query param ordering. But if it's a real-world site (and it wasn't a small one) then you can't necessarily ignore it :-(
2
import re
from urlparse import urlparse, urlunparse

url = 'http://www.someurl.com/page.html?foo=bar&utm_medium=qux&baz=qoo'
parsed_url = list(urlparse(url))
parsed_url[4] = '&'.join(
    [x for x in parsed_url[4].split('&') if not re.match(r'utm_', x)])
utmless_url = urlunparse(parsed_url)

print utmless_url  # 'http://www.someurl.com/page.html?foo=bar&baz=qoo'

2 Comments

Why use re here? A simple x.startswith('utm_') would do it, and better.
Yep, the re expression can be replaced with startswith, which I didn't know about until I saw Jon Clement's answer. :)
1

Simple, and works, and based on the link you posted, BUT it's re... so, not sure it won't break for some reason that I can't think of :)

import re

def trim_utm(url):
    if "utm_" not in url:
        return url
    matches = re.findall('(.+\?)([^#]*)(.*)', url)
    if len(matches) == 0:
        return url
    match = matches[0]
    query = match[1]
    sanitized_query = '&'.join([p for p in query.split('&') if not p.startswith('utm_')])
    return match[0]+sanitized_query+match[2]

if __name__ == "__main__":
    tests = [   "http://localhost/index.php?a=1&utm_source=1&b=2",
                "http://localhost/index.php?a=1&utm_source=1&b=2#hash",
                "http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
                "http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
                "http://localhost/index.php?utm_a=a",
                "http://localhost/index.php?a=utm_a",
                "http://localhost/index.php?a=1&b=2",
                "http://localhost/index.php",
                "http://localhost/index.php#hash2"
            ]

    for t in tests:
        trimmed = trim_utm(t)
        print t
        print trimmed
        print 

Comments

1

How about this. Nice and simple:

url = 'https://searchengineland.com/amazon-q3-ad-revenues-surpass-1-billion-roughly-2x-early-2016-285763?utm_source=feedburner&utm_medium=feed&utm_campaign=feed-main'

print url[:url.find('?utm')]

https://searchengineland.com/amazon-q3-ad-revenues-surpass-1-billion-roughly-2x-early-2016-285763

1 Comment

this would remove other parameters as well
0

Using regex

import re
def clean_url(url):
    return re.sub(r'(?<=[?&])utm_[^&]+&?', '', url)

What's going on? We are using regular expressions to find all instances of strings that look like utm_somekey=somevalue which is preceded by either "?" or "&".

Testing it:

tests = [   "http://localhost/index.php?a=1&utm_source=1&b=2",
            "http://localhost/index.php?a=1&utm_source=1&b=2#hash",
            "http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
            "http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
            "http://localhost/index.php?utm_a=a",
            "http://localhost/index.php?a=utm_a",
            "http://localhost/index.php?a=1&b=2",
            "http://localhost/index.php",
            "http://localhost/index.php#hash2"
        ]

for t in tests:
    print(clean_url(t))

http://localhost/index.php?a=1&b=2
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2&
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?
http://localhost/index.php?a=utm_a
http://localhost/index.php?a=1&b=2
http://localhost/index.php
http://localhost/index.php#hash2

Comments

0

The popular url modules answer modifies and rearranges the parameters which can break poorly-designed sites (see the comment) so I settled on regex, but there are problems with those I tried, too. My final result is this:

import re

def removeURLTracking(url):
    url = re.sub(r'(?<=[?&])utm_[^&#]+&?', '', url)
    url = url.replace('&#', '#').rstrip('?&')
    return url

tests = ["http://localhost/index.php?a=1&utm_source=1&b=2",
         "http://localhost/index.php?a=1&utm_source=1&b=2#hash",
         "http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
         "http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
         "http://localhost/index.php?utm_a=a",
         "http://localhost/index.php?a=utm_a",
         "http://localhost/index.php?a=1&b=2",
         "http://localhost/index.php",
         "http://localhost/index.php#hash2"
         ]

for t in tests:
    print(removeURLTracking(t))

"""
http://localhost/index.php?a=1&b=2
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php
http://localhost/index.php?a=utm_a
http://localhost/index.php?a=1&b=2
http://localhost/index.php
http://localhost/index.php#hash2
"""

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.