Using Python, Pandas and Apply/Lambda, how can I write a function which creates multiple new columns?

Question

Apologies for the messy title: Problem as follows:

I have some data frame of the form:

df1 = 
    Entries
0    "A Level"
1    "GCSE"
2    "BSC"

I also have a data frame of the form:

df2 = 
    Secondary    Undergrad
0    "A Level"    "BSC"
1    "GCSE"       "BA"
2    "AS Level"   "MSc"

I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):

def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []    
    for i in words:    
    match_object = re.search(i,df)
    if match_object:
        Words_Present.append(i)
        Yes = 1    
    else:
        No = 0
if Yes == 1:
    Attribute = 1

return Attribute

I apply this function over all entries in df1, and all columns in df2, using the following iteration:

for i in df2:

    terms = df2[i].values.tolist()

    df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))

This yields an output df which looks something like:

df1 = 
    Entries    Secondary     undergrad
0   "A Level"     1              0
1    "GCSE"       1              0
2    "AS Level"   1              0

I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:

Desired dataframe:

    Entries        Secondary  Words Found          undergrad  Words Found
0   "A Level"     1            "A Level"                0
1    "GCSE"       1            "GCSE"                   0
2    "AS Level"   1            "AS Level"               0

If I do:

def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []    
    for i in words:    
    match_object = re.search(i,df)
    if match_object:
        Words_Present.append(i)
        Yes = 1    
    else:
        No = 0
if Yes == 1:
    Attribute = 1
if Yes == 0:
    Attribute = 0

return Attribute,Words_Present

My function therefore now has multiple outputs. So applying the following:

for i in df2:

    terms = df2[i].values.tolist()

    df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))

My Output Looks like this:

    Entries        Secondary                   undergrad
0   "A Level"     [1,"A Level"]                0
1    "GCSE"       [1, "GCSE"]                  0
2    "AS Level"   [1, "AS Level"]              0

The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.

Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?

I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:

Applying function with multiple arguments to create a new pandas column

Create multiple columns in Pandas Dataframe from one function

Apply pandas function to column to create multiple new columns?

For example, I have also tried:

for i in df2:

    terms = df2[i].values.tolist()

    [df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])

but this simply yields errors such as:

raise ValueError('Length of values does not match length of ' 'index')

Is there a way to use apply, but still extract the extra information directly into multiple columns?

Many thanks, apologies for the length.

I've answered your question, but I'd like to make a side remark about expressions like df1['course'][0:1]: This is called chained indexing, which should be avoided in pandas because it's slow, and often leads to problems when you want to change an element in the dataframe. Instead, df1.loc[0:1,'course'] is the correct way to refer to the same thing. — Ken Wei
– Ken Wei, Commented Dec 2, 2016 at 16:01
(Also, to nitpick, your word_search() function will return an error if no matching word is found, because Attribute is not assigned unless Yes == 1!) — Ken Wei
– Ken Wei, Commented Dec 2, 2016 at 16:04
Your first point: Thank you, I did not know, and was writing this code based on a file I inherited. Your second point: I made a mistake in copying my code: I had meant to include in both parts of the if statement. I missed it in the rush to get this up. Thank you for pointing both out. [Both are now updated] — Chuck
– Chuck, Commented Dec 2, 2016 at 16:06

Ken Wei · Accepted Answer · 2016-12-02 16:35:23Z

1

The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().

However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.

It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:

df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1

EDIT: In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.

However, if you have exact matches, you can do everything by a combination of pivots and joins:

keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
    df[colname] = (df.Entries == colname) * 1 # Your indicator variable
    df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']

The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.

Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!

For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:

secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
    if s.Entries.isin(secondary_words):
        return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
    elif s.Entries.isin(undergrad_words ):
        return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
    else:
        return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})

This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.

edited Dec 2, 2016 at 16:35

answered Dec 2, 2016 at 15:31

Ken Wei

3,1381 gold badge12 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Chuck Over a year ago

Ken You are correct in your deduction of "It looks like you are trying to classify Entries into Secondary or Undergrad". That is entirely correct. The number of categories is however much longer (Secondary, undergrad, Postgrad... etc. ) and even more so for the associated words for each category. That was my reason for using the for loop. There actually may be a case where there are multiple key words registered - I want to save all of these words in a list of strings for each entry in df1. How might this change what you have written? Thank you so much for your help :)

Chuck Over a year ago

In fact, there are so many entries, I will have to do it automatically, as I do not have time to write out ...str.extract('(A Level|GCSE|AS Level)') for each of the texts in each category. Though, I should be able to make this more automated and include some of the Re module to make this work?

Ken Wei Over a year ago

Do you expect to get exact matches? i.e. the entries are exactly equal to 'A level' or 'GCSE', etc., and so cases like, say, 'GCSE-O' would not have a match (assuming 'GCSE-O' is not in your table of names)? In that case, there's an even cleaner way of doing things.

Ken Wei Over a year ago

I've edited my answer. You should start by cleaning the data: do df1.Entries.unique() to keep track of all the possibilities, then use a dictionary to define your replacements, e.g. repl_dict = {'A-Level':'A Level', 'ALevel': 'A Level'} will replace any occurrence of 'A-Level' or 'ALevel' with 'A Level'. Then do df1.replace({'Entries':repl_dict}, inplace = True).

Ken Wei Over a year ago

A pragmatic alternative is to move everything into the same form: e.g. start with df['Entries'] = df.Entries.str.lower().str.replace({'-':''}), which for example, replaces 'A-level', 'A level' and 'Alevel' with 'alevel'. Then just make one replacement of 'alevel' with 'A Level' and you're done. (I'd recommend this--try removing all non-alphabetic characters, and see how many unique entries there are. These are the number of replacements you need to type out, which will most likely be fewer than explicitly identifying every variation.)

|

Collectives™ on Stack Overflow

Using Python, Pandas and Apply/Lambda, how can I write a function which creates multiple new columns?

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related