0

In a nutshell, I am trying to apply my function to select row. I have it working by subsetting the dataframe, run the function and then merge the subset back to the main dataframe. However, that is cumbersome and there has to be more efficient solution that escapes me. I found several useful posts (here, here and here) that helped improve my code.

Here is a sample dataframe:

data = {'firm': ['Smith', 'Jones', 'Smith New York', 'Jones International', 'Winter'], 
        'id': [np.nan, 732, 216, np.nan, 1714], 
        'url1': ['url', np.nan, 'url', 'url', 'url'],
        'url2': ['url', 'url', 'url', np.nan, 'url'],
        'text': ['foo', 'bar', np.nan, np.nan, 'foo bar']}
df = pd.DataFrame(data)

The below function will parse the website whereby the user can set the keyword to search already downloaded files and use that stored data if present. If the last crawl happened a while a go a new crawl for an updated website is needed.

def fetch(id, url, **kwargs):
    if backup == 'Yes':
        print('Fetching {} from {}'.format(id, url))
        # Actual fetching code
    else:
        print('Loading stored data for {}'.format(id))
        # Actual loading code 

The function works as I tested it on individual URLs, but I run into problems when I try to apply it. I have multiple conditions when to run it. Currently I use them to subset the dataframe. Note: if two urls are present, url1 is preferred. Following Pandas documentation keyword arguments can be submitted. Initially I tried np.where. There are 4 conditions in total, below are two:

df['content'] = np.where(df['text'].isna() & df['url1'].notnull() &
                            df['url2'].notnull() & df['firm'].str.contains('Smith'),
                         df['url1'].apply(fetch, args=df['id'], backup='Yes'),
                         np.where(df['text'].isna() & df['url1'].notnull() & 
                                    df['url2'].isna() & df['firm'].str.contains('Smith'),
                                  df['url1'].apply(**fetch, backup='Yes'**),
                                  pd.np.nan))
TypeError: fetch() takes 2 positional arguments but --some other number-- were given

Hence, adding pandas series does not work. And I cannot figure out how to add it as a scalar. Another failed approach with only two of the columns/series:

df[['id', 'url1']][fd['text'].isna() & df['url1'].notnull() &
    df['url2'].notnull() & df['firm'].str.contains('Smith')].apply(fetch) # Should fetch nothing
df[['id', 'url1']][fd['text'].isna() & df['url1'].notnull() &
    df['url2'].isna() & df['firm'].str.contains('Smith')].apply(fetch) # Should fetch one
TypeError: ("fetch() missing 1 required positional argument: 'url1'", 'occurred at index id')

And finally I tried lambda:

df['text'].where(fd['text'].isna() & df['url1'].notnull() & df['url2'].isna()
   & df['fidm'].str.contains('Smith'), df[['id', 'url1']].apply(lambda x,y: get_XML(x,y)))
TypeError: ("<lambda>() missing 1 required positional argument: 'y'", 'occurred at index id')

I assume I am missing something simple, but obviously crucial. Any pointers are appreciated.


Edit - Solution


I took comments from Damien Ayers (see below) to heart and simplified the code. This then also put me on the path to the solution:

def get_ft(text, xml, id, url1, url2, firm, backup= 'Yes'):
    if pd.notnull(id):
        if pd.isna(text) and pd.notnull(url1) and (pd.notnull(url2) or pd.isna(url2)):
            if 'Smith' in firm:
                return fetch(id, url1, backup)
            ... code continues

And here the proper use of apply and lambda thanks to this discussion:

df['text_new'] = df.apply(lambda x: x['text'], x['id'], x['url1'],
                                    x['url2'], x['firm'], backup), axis=1)

Much cleaner and more importantly it works.

9
  • In your np.where attempt there's two spots where url1 in df['url] is missing a closing quote mark. But maybe that's an error that happened when writing the question? Commented Apr 22, 2019 at 23:18
  • Thanks @DamienAyers. I fixed it here. It did not do the trick, unfortunately. Commented Apr 22, 2019 at 23:21
  • When the conditionals start getting this complicated, it's often worth writing it out in long form as separate functions or named variables, rather than combining so much into a single expression. Commented Apr 22, 2019 at 23:21
  • The TypeError: ("fetch() missing 1 required positional argument: 'url'", 'occurred at index id') error is due to fetch expecting a keyword argument url, but pandas is calling it with the keyward url1, since that's what's in the dataframe. Commented Apr 22, 2019 at 23:32
  • There's also some more typos in the code examples, that make it a bit harder to run. fd instead of df, and fetch hasn't defined the backup variable and has a = instead of ==. Commented Apr 22, 2019 at 23:34

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.