Parsing urls from a dataframe

Question

I am trying to parse urls from a dataframe to get the 'path'. My dataframe has 3 columns: ['url'], ['impressions'], ['clicks']. I want to replace all the urls by their Path. Here is my code:

import csv
from urllib.parse import urlparse

    fic_in = 'file.csv'

    df = pd.read_csv(fic_in)
    obj = urlparse(df['url'])
    df['url'] = obj.path
    print(df)

The csv file contains thousands of urls and 2 other columns of informations about the urls. For a technical reason, I can't parse the urls manipulating the csv, but I have to parse them in the dataframe. When I execute this code, I have the following error that I don't really understand:

File "/Users/adamn/Desktop/test_lambda.py", line 33, in <module>obj = urlparse(df['url'])
File"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 389, in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
File"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 125, in _coerce_args
    return _decode_args(args) + (_encode_result,)
File"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 109, in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
File"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 109, in <genexpr>
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
File"/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/generic.py", line 1442, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I do get there is an error so what am I doing that is not possible to do? And how can I resolve it or just use another way to get this done?

Thanks for helping.

have you tried regular expression filter? If that's not working for you! — Nischay Namdev
– Nischay Namdev, Commented May 25, 2021 at 16:24
Can you provide the whole stack trace around that error message? It might help troubleshoot this, if the existing answer doesn't already solve your problem. — joanis
– joanis, Commented May 25, 2021 at 20:20
@NischayNamdev Well no I haven't, I thought it would be easier with urllib because the library was made for it. — AdamD97
– AdamD97, Commented May 25, 2021 at 21:39
@joanis Yes of course, I will add the full error message in comment — AdamD97
– AdamD97, Commented May 25, 2021 at 21:42

Tom McLean · Accepted Answer · 2021-05-26 07:42:34Z

1

urlparse only takes one string at a time, not a series.

try:

df["URL"] =df["URL"].astype(str).apply(lambda x: urlparse(x).path)

edited May 26, 2021 at 7:42

answered May 25, 2021 at 16:27

Tom McLean

6,6332 gold badges23 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

AdamD97 Over a year ago

I just tried this and I got a new error that I do not really understand either :

File "/Users/adamndubois/Desktop/test_lambda.py", line 33, in <module>     df['url'] =df['url'].apply(lambda x: urlparse(x).path) AttributeError: 'float' object has no attribute 'decode'

Tom McLean Over a year ago

@AdamD97 I modied the code to enforce the column "URL" being a string, however if it was not already a string you should check the data in the column is correct

AdamD97 Over a year ago

Yes, the type of my 'url' column was 'object'. It works that way, thank you very much !

Collectives™ on Stack Overflow

Parsing urls from a dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related