Pandas apply: multiple conditions and multiple function arguments

Ask Question

Asked 6 years, 7 months ago

Modified 6 years, 7 months ago

Viewed 2k times

In a nutshell, I am trying to apply my function to select row. I have it working by subsetting the dataframe, run the function and then merge the subset back to the main dataframe. However, that is cumbersome and there has to be more efficient solution that escapes me. I found several useful posts (here, here and here) that helped improve my code.

Here is a sample dataframe:

data = {'firm': ['Smith', 'Jones', 'Smith New York', 'Jones International', 'Winter'], 
        'id': [np.nan, 732, 216, np.nan, 1714], 
        'url1': ['url', np.nan, 'url', 'url', 'url'],
        'url2': ['url', 'url', 'url', np.nan, 'url'],
        'text': ['foo', 'bar', np.nan, np.nan, 'foo bar']}
df = pd.DataFrame(data)

The below function will parse the website whereby the user can set the keyword to search already downloaded files and use that stored data if present. If the last crawl happened a while a go a new crawl for an updated website is needed.

def fetch(id, url, **kwargs):
    if backup == 'Yes':
        print('Fetching {} from {}'.format(id, url))
        # Actual fetching code
    else:
        print('Loading stored data for {}'.format(id))
        # Actual loading code

The function works as I tested it on individual URLs, but I run into problems when I try to apply it. I have multiple conditions when to run it. Currently I use them to subset the dataframe. Note: if two urls are present, url1 is preferred. Following Pandas documentation keyword arguments can be submitted. Initially I tried np.where. There are 4 conditions in total, below are two:

df['content'] = np.where(df['text'].isna() & df['url1'].notnull() &
                            df['url2'].notnull() & df['firm'].str.contains('Smith'),
                         df['url1'].apply(fetch, args=df['id'], backup='Yes'),
                         np.where(df['text'].isna() & df['url1'].notnull() & 
                                    df['url2'].isna() & df['firm'].str.contains('Smith'),
                                  df['url1'].apply(**fetch, backup='Yes'**),
                                  pd.np.nan))
TypeError: fetch() takes 2 positional arguments but --some other number-- were given

Hence, adding pandas series does not work. And I cannot figure out how to add it as a scalar. Another failed approach with only two of the columns/series:

df[['id', 'url1']][fd['text'].isna() & df['url1'].notnull() &
    df['url2'].notnull() & df['firm'].str.contains('Smith')].apply(fetch) # Should fetch nothing
df[['id', 'url1']][fd['text'].isna() & df['url1'].notnull() &
    df['url2'].isna() & df['firm'].str.contains('Smith')].apply(fetch) # Should fetch one
TypeError: ("fetch() missing 1 required positional argument: 'url1'", 'occurred at index id')

And finally I tried lambda:

df['text'].where(fd['text'].isna() & df['url1'].notnull() & df['url2'].isna()
   & df['fidm'].str.contains('Smith'), df[['id', 'url1']].apply(lambda x,y: get_XML(x,y)))
TypeError: ("<lambda>() missing 1 required positional argument: 'y'", 'occurred at index id')

I assume I am missing something simple, but obviously crucial. Any pointers are appreciated.

Edit - Solution

I took comments from Damien Ayers (see below) to heart and simplified the code. This then also put me on the path to the solution:

def get_ft(text, xml, id, url1, url2, firm, backup= 'Yes'):
    if pd.notnull(id):
        if pd.isna(text) and pd.notnull(url1) and (pd.notnull(url2) or pd.isna(url2)):
            if 'Smith' in firm:
                return fetch(id, url1, backup)
            ... code continues

And here the proper use of apply and lambda thanks to this discussion:

df['text_new'] = df.apply(lambda x: x['text'], x['id'], x['url1'],
                                    x['url2'], x['firm'], backup), axis=1)

Much cleaner and more importantly it works.

edited Apr 23, 2019 at 15:40

asked Apr 22, 2019 at 22:57

raummensch

6762 gold badges9 silver badges16 bronze badges

In your np.where attempt there's two spots where url1 in df['url] is missing a closing quote mark. But maybe that's an error that happened when writing the question?

Damien Ayers
– Damien Ayers

2019-04-22 23:18:35 +00:00
Commented Apr 22, 2019 at 23:18
Thanks @DamienAyers. I fixed it here. It did not do the trick, unfortunately.

raummensch
– raummensch

2019-04-22 23:21:07 +00:00
Commented Apr 22, 2019 at 23:21
When the conditionals start getting this complicated, it's often worth writing it out in long form as separate functions or named variables, rather than combining so much into a single expression.

Damien Ayers
– Damien Ayers

2019-04-22 23:21:15 +00:00
Commented Apr 22, 2019 at 23:21
The TypeError: ("fetch() missing 1 required positional argument: 'url'", 'occurred at index id') error is due to fetch expecting a keyword argument url, but pandas is calling it with the keyward url1, since that's what's in the dataframe.

Damien Ayers
– Damien Ayers

2019-04-22 23:32:01 +00:00
Commented Apr 22, 2019 at 23:32
There's also some more typos in the code examples, that make it a bit harder to run. fd instead of df, and fetch hasn't defined the backup variable and has a = instead of ==.

Damien Ayers
– Damien Ayers

2019-04-22 23:34:00 +00:00
Commented Apr 22, 2019 at 23:34

| Show 4 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Pandas apply: multiple conditions and multiple function arguments

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked