0

Apologies for the messy title: Problem as follows:

I have some data frame of the form:

df1 = 
    Entries
0    "A Level"
1    "GCSE"
2    "BSC"

I also have a data frame of the form:

df2 = 
    Secondary    Undergrad
0    "A Level"    "BSC"
1    "GCSE"       "BA"
2    "AS Level"   "MSc"

I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):

def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []    
    for i in words:    
    match_object = re.search(i,df)
    if match_object:
        Words_Present.append(i)
        Yes = 1    
    else:
        No = 0
if Yes == 1:
    Attribute = 1

return Attribute

I apply this function over all entries in df1, and all columns in df2, using the following iteration:

for i in df2:

    terms = df2[i].values.tolist()

    df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))

This yields an output df which looks something like:

df1 = 
    Entries    Secondary     undergrad
0   "A Level"     1              0
1    "GCSE"       1              0
2    "AS Level"   1              0

I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:

Desired dataframe:

    Entries        Secondary  Words Found          undergrad  Words Found
0   "A Level"     1            "A Level"                0
1    "GCSE"       1            "GCSE"                   0
2    "AS Level"   1            "AS Level"               0

If I do:

def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []    
    for i in words:    
    match_object = re.search(i,df)
    if match_object:
        Words_Present.append(i)
        Yes = 1    
    else:
        No = 0
if Yes == 1:
    Attribute = 1
if Yes == 0:
    Attribute = 0

return Attribute,Words_Present

My function therefore now has multiple outputs. So applying the following:

for i in df2:

    terms = df2[i].values.tolist()

    df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))

My Output Looks like this:

    Entries        Secondary                   undergrad
0   "A Level"     [1,"A Level"]                0
1    "GCSE"       [1, "GCSE"]                  0
2    "AS Level"   [1, "AS Level"]              0

The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.

Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?

I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:

Applying function with multiple arguments to create a new pandas column

Create multiple columns in Pandas Dataframe from one function

Apply pandas function to column to create multiple new columns?

For example, I have also tried:

for i in df2:

    terms = df2[i].values.tolist()

    [df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])

but this simply yields errors such as:

raise ValueError('Length of values does not match length of ' 'index')

Is there a way to use apply, but still extract the extra information directly into multiple columns?

Many thanks, apologies for the length.

3
  • 1
    I've answered your question, but I'd like to make a side remark about expressions like df1['course'][0:1]: This is called chained indexing, which should be avoided in pandas because it's slow, and often leads to problems when you want to change an element in the dataframe. Instead, df1.loc[0:1,'course'] is the correct way to refer to the same thing. Commented Dec 2, 2016 at 16:01
  • 1
    (Also, to nitpick, your word_search() function will return an error if no matching word is found, because Attribute is not assigned unless Yes == 1!) Commented Dec 2, 2016 at 16:04
  • Your first point: Thank you, I did not know, and was writing this code based on a file I inherited. Your second point: I made a mistake in copying my code: I had meant to include in both parts of the if statement. I missed it in the rush to get this up. Thank you for pointing both out. [Both are now updated] Commented Dec 2, 2016 at 16:06

1 Answer 1

1

The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().

However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.

It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:

df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1

EDIT: In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.

However, if you have exact matches, you can do everything by a combination of pivots and joins:

keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
    df[colname] = (df.Entries == colname) * 1 # Your indicator variable
    df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']

The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.

Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!

For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:

secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
    if s.Entries.isin(secondary_words):
        return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
    elif s.Entries.isin(undergrad_words ):
        return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
    else:
        return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})

This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.

Sign up to request clarification or add additional context in comments.

10 Comments

Ken You are correct in your deduction of "It looks like you are trying to classify Entries into Secondary or Undergrad". That is entirely correct. The number of categories is however much longer (Secondary, undergrad, Postgrad... etc. ) and even more so for the associated words for each category. That was my reason for using the for loop. There actually may be a case where there are multiple key words registered - I want to save all of these words in a list of strings for each entry in df1. How might this change what you have written? Thank you so much for your help :)
In fact, there are so many entries, I will have to do it automatically, as I do not have time to write out ...str.extract('(A Level|GCSE|AS Level)') for each of the texts in each category. Though, I should be able to make this more automated and include some of the Re module to make this work?
Do you expect to get exact matches? i.e. the entries are exactly equal to 'A level' or 'GCSE', etc., and so cases like, say, 'GCSE-O' would not have a match (assuming 'GCSE-O' is not in your table of names)? In that case, there's an even cleaner way of doing things.
I've edited my answer. You should start by cleaning the data: do df1.Entries.unique() to keep track of all the possibilities, then use a dictionary to define your replacements, e.g. repl_dict = {'A-Level':'A Level', 'ALevel': 'A Level'} will replace any occurrence of 'A-Level' or 'ALevel' with 'A Level'. Then do df1.replace({'Entries':repl_dict}, inplace = True).
A pragmatic alternative is to move everything into the same form: e.g. start with df['Entries'] = df.Entries.str.lower().str.replace({'-':''}), which for example, replaces 'A-level', 'A level' and 'Alevel' with 'alevel'. Then just make one replacement of 'alevel' with 'A Level' and you're done. (I'd recommend this--try removing all non-alphabetic characters, and see how many unique entries there are. These are the number of replacements you need to type out, which will most likely be fewer than explicitly identifying every variation.)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.