0

In Pandas, How can one column be derived from multiple other columns?

For example, lets say I wanted to annotate my dataset with the correct form of address for each subject. Perhaps to label some plots with -- so I can tell who the results are for.

Take a dataset:

data = [('male', 'Homer', 'Simpson'), ('female', 'Marge', 'Simpson'), ('male', 'Bart', 'Simpson'),('female', 'Lisa', 'Simpson'),('infant', 'Maggie', 'Simpson')]
people = pd.DataFrame(data, columns=["gender", "first_name", "last_name"])

So we have:

   gender first_name last_name
0    male      Homer   Simpson
1  female      Marge   Simpson
2    male       Bart   Simpson
3  female       Lisa   Simpson
4  infant     Maggie   Simpson

And a function, which I want to apply to each row, storing the result into a new column.

def get_address(gender, first, last):
    title=""
    if gender=='male':
        title='Mr'
    elif gender=='female':
        title='Ms'

    if title=='':
        return first + ' '+ last
    else:
        return title + ' ' + first[0] + '. ' + last

Currently my method is:

people['address'] = map(lambda row: get_address(*row),people.get_values())



   gender first_name last_name         address
0    male      Homer   Simpson   Mr H. Simpson
1  female      Marge   Simpson   Ms M. Simpson
2    male       Bart   Simpson   Mr B. Simpson
3  female       Lisa   Simpson   Ms L. Simpson
4  infant     Maggie   Simpson  Maggie Simpson

Which works, but it is not elegant. It also feels bad converting to a unindexed list, then assigning back into a indexed column.

2
  • You can use apply with the axis=1 argument to apply by row. Commented Aug 1, 2014 at 5:48
  • That seems to be the answer. Would you like to make it one so I can accept it? Commented Aug 1, 2014 at 6:13

2 Answers 2

2

What you are looking for is apply(func,axis=1) This will apply a function row wise through your dataframe.

In your example modify your method get_address to...

def get_address(row):#row is a pandas series with col names as indexes
    title=""
    gender = row['gender']     #extract gender from pandas series
    first = row['first_name']  #extract firstname from pandas series
    second = row['last_name']  #extract lastname from pandas series

    if gender=='male':
        title='Mr'
    elif gender=='female':
        title='Ms'

    if title=='':
        return first + ' '+ last
    else:
        return title + ' ' + first[0] + '. ' + last

then call people.apply(get_address,axis=1) which returns a new column (Actually this is a pandas series, with the correct indexes, which is how the dataframe knows how to add it as a column correctly) to add it to your dataframe add this code...

people['address'] = people.apply(get_address,axis=1)
Sign up to request clarification or add additional context in comments.

Comments

1

You can do this without any explicit looping:

In [70]: df
Out[70]:
   gender first_name last_name
0    male      Homer   Simpson
1  female      Marge   Simpson
2    male       Bart   Simpson
3  female       Lisa   Simpson
4  infant     Maggie   Simpson

In [71]: title = df.gender.replace({'male': 'Mr', 'female': 'Ms', 'infant': ''})

In [72]: initial = np.where(df.gender != 'infant', df.first_name.str[0] + '. ', df.first_name + ' ')
In [73]: initial
Out[73]: array(['H. ', 'M. ', 'B. ', 'L. ', 'Maggie '], dtype=object)

In [74]: address = (title + ' ' + Series(initial) + df.last_name).str.strip()

In [75]: address
Out[75]:
0     Mr H. Simpson
1     Ms M. Simpson
2     Mr B. Simpson
3     Ms L. Simpson
4    Maggie Simpson
dtype: object

Check out the documentation for Series.str methods, they're pretty rad. Most methods from str are implemented in addition to goodies like extract.

6 Comments

The String manipulation was just an example. While knowing about these string methods is good to know, it doesn't help me with my actual problem, that can not be done with concatenation. (My actual problem involves parsing strings into lists, then checking for presents of a 1 or 0 in one list and if so marking the cosponsoring element in the other list with a asterix, but I didn't want to put that in my example and it is long and harder to follow. I suspect I could do something with the str methods, but i think it would be even more hard to follow
The more general apply will be slower. It's better to find a way to vectorize the operations. When you have more data a general apply will not scale very well especially the row by row version since each row is converted to a series of uniform type which if you have mixed types will be very annoying to use and inefficient.
You should post our original problem which I think can be solved with isin and where
For your reference String operations are not really very vectorisable (because they don't come up against the BLAS libraries). the strings functions you mention appear to be largely be implemented with for-loops. github.com/pydata/pandas/blob/master/pandas/core/strings.py They are more readable though.
Actually most of those are implemented in Cython which speeds up loops considerably. By vectorization I simply meant applying operations on whole sequences rather than single elements at a time, which is unrelated to the use of BLAS. What I'm saying is that spending a bit of time trying to avoid apply will probably yield reusable and more performant code.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.