Python - dataframe url parsing issue

Question

I am trying to get domain names from the url from a column into another column. Its working on a string like object, when I apply to dataframe it doesn't work. How to do I apply this to a data frame?

Tried:

from urllib.parse import urlparse
import pandas as pd
id1 = [1,2,3]
ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
df = pd.DataFrame({'id':id1,'url':ls})
df
# urlparse(df['url']) # ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# df['url'].map(urlparse) # AttributeError: 'float' object has no attribute 'decode'

working on string:

string = 'https://google.com/tensoflow'
parsed_uri = urlparse(string)
result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
result

looking for a column:

col3
https://google.com/
https://math.com/
nan

Errror

Please post the exact full error messages you're getting.

ForceBru
– ForceBru

2019-05-01 19:51:07 +00:00
Commented May 1, 2019 at 19:51 — ForceBru
– ForceBru, Commented May 1, 2019 at 19:51
@ForceBru just added the error

sharp
– sharp

2019-05-01 19:58:08 +00:00
Commented May 1, 2019 at 19:58 — sharp
– sharp, Commented May 1, 2019 at 19:58

hygull · Accepted Answer · 2019-05-01 20:56:08Z

You can try something like this.

Here I have used pandas.Series.apply() to solve.

» Initialization and imports

>>> from urllib.parse import urlparse
>>> import pandas as pd
>>> id1 = [1,2,3]
>>> import numpy as np
>>> ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
>>> ls
['https://google.com/tensoflow', 'https://math.com/some/website', nan]
>>>

» Inspect the newly created DataFrame.

>>> df = pd.DataFrame({'id':id1,'url':ls})
>>> df
   id                            url
0   1   https://google.com/tensoflow
1   2  https://math.com/some/website
2   3                            NaN
>>> 
>>> df["url"]
0     https://google.com/tensoflow
1    https://math.com/some/website
2                              NaN
Name: url, dtype: object
>>>

» Applying a function using pandas.Series.apply(func) on url column..

>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else np.nan)
0    https://google.com/
1      https://math.com/
2                    NaN
Name: url, dtype: object
>>> 
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
0    https://google.com/
1      https://math.com/
2                    nan
Name: url, dtype: object
>>> 
>>>

» Store the above result in a variable (not mandatory, just to simply).

>>> s = df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
>>> s
0    https://google.com/
1      https://math.com/
2                    nan
Name: url, dtype: object
>>>

» Finally

>>> df2 = pd.DataFrame({"col3": s})
>>> df2
                  col3
0  https://google.com/
1    https://math.com/
2                  nan
>>>

» To make sure, what is s and what is df2, check types (again, not mandatory).

>>> type(s)
<class 'pandas.core.series.Series'>
>>> 
>>> 
>>> type(df2)
<class 'pandas.core.frame.DataFrame'>
>>>

Reference links:

Collectives™ on Stack Overflow

Python - dataframe url parsing issue

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related