0

I have a dataframe with a column of URL's that I would like to parse into new columns with rows based on the value of a specified parameter if it is present in the URL. I am using a function that is looping through each row in the dataframe column and parsing the specified URL parameter, but when I try to select the column after the function has finished I am getting a keyError. Should I be setting the value to this new column in a different manner? Is there a more effective approach than looping through the values in my table and running this process?

Error:

KeyError: 'utm_source'

Example URLs (df['landing_page_url']):

https://lp.example.com/test/lp
https://lp.example.com/test/ny/?utm_source=facebook&ref=test&utm_campaign=ny-newyork_test&utm_term=nice
https://lp.example.com/test/ny/?utm_source=facebook
NaN
https://lp.example.com/test/la/?utm_term=lp-test&utm_source=facebook

Code:

import pandas as pd
import numpy as np
import math
from urllib.parse import parse_qs, urlparse

def get_query_field(url, field):
    if isinstance(url, str):
        try:
            return parse_qs(urlparse(url).query)[field][0]
        except KeyError:
            return ''
    else:
        return ''


for i in df['landing_page_url']:
    print(i) // returns URL
    print(get_query_field(i, 'utm_source')) // returns proper values
    df['utm_source'] == get_query_field(i, 'utm_source')
    df['utm_campaign'] == get_query_field(i, 'utm_campaign')
    df['utm_term'] == get_query_field(i, 'utm_term')

2 Answers 2

1

I don't think your for loop will work. It looks like each time it will overwrite the entire column you are trying to set. I wanted to test the speed against my method, but I'm nearly certain this will be faster that iterating.

#Simplify the function here as recommended by Nick
def get_query_field(url, field):
    if isinstance(url, str):
        return parse_qs(urlparse(url).query).get(field, [''])[0]
    return ''

#Use apply to create new columns based on the url
df['utm_source'] = df['landing_page_url'].apply(get_query_field, args=['utm_source'])
df['utm_campaign'] = df['landing_page_url'].apply(get_query_field, args=['utm_campaign'])
df['utm_term'] = df['landing_page_url'].apply(get_query_field, args=['utm_term'])
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your answer. Incredibly faster approach and good to know about the .apply() approach!
1

Instead of

try:
   return parse_qs(urlparse(url).query)[field][0]
except KeyError:
   return ''

You can just do:

return parse_qs(urlparse(url).query).get(field, [''])[0]

The trick here is my_dict.get(key, default) instead of my_dict[key]. The default will be returned if the key doesn't exist

Is there a more effective approach than looping through the values in my table and running this process?

Not really. Looping through each url is going to have to be done either way. Right now though, you are overriding the dataframe for every url. Meaning that if two different URLs have different sources in the query, the last one in the list will win. I have no idea if this is intentional or not.

Also note: this line

df['utm_source'] == get_query_field(i, 'utm_source')

Is not actually doing anything. == is a comparison operator, "does left side match right side'. You probably meant to use = or df.append({'utm_source': get_query_field(..)})

3 Comments

Hey Nick thanks for that great shortcut! And to your question about the overriding, it is not intentional. I want to add the parsed parameter values as rows in the right index order in their respective columns. Should I use a .append() method instead of what I am doing?
Hmm, I received the error when I tried .append(..) TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid, and setting it to a single = (great catch), overwrites it with the last value as you mentioned
@cphill try using this syntax: df.append({'utm_source': get_query_field(..)})

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.