Splitting a string in a Python DataFrame

Question

I have a DataFrame in Python with a column with names (such as Joseph Haydn, Wolfgang Amadeus Mozart, Antonio Salieri and so forth).

I want to get a new column with the last names: Haydn, Mozart, Salieri and so forth.

I know how to split a string, but I could not find a way to apply it to a series, or a Data Frame column.

column.str.split. Add some example code, and you will likely get an answer. — cel
– cel, Commented Sep 6, 2015 at 15:53

Community · Accepted Answer · 2017-05-23 11:45:11Z

32

if you have:

import pandas
data = pandas.DataFrame({"composers": [ 
    "Joseph Haydn", 
    "Wolfgang Amadeus Mozart", 
    "Antonio Salieri",
    "Eumir Deodato"]})

assuming you want only the first name (and not the middle name like Amadeus):

data.composers.str.split('\s+').str[0]

will give:

0      Joseph
1    Wolfgang
2     Antonio
3       Eumir
dtype: object

you can assign this to a new column in the same dataframe:

data['firstnames'] = data.composers.str.split('\s+').str[0]

Last names would be:

data.composers.str.split('\s+').str[-1]

which gives:

0      Haydn
1     Mozart
2    Salieri
3    Deodato
dtype: object

(see also Python Pandas: selecting element in array column for accessing elements in an 'array' column)

For all but the last names you can apply " ".join(..) to all but the last element ([:-1]) of each row:

data.composers.str.split('\s+').str[:-1].apply(lambda parts: " ".join(parts))

which gives:

0              Joseph
1    Wolfgang Amadeus
2             Antonio
3               Eumir
dtype: object

edited May 23, 2017 at 11:45

CommunityBot

11 silver badge

answered Sep 6, 2015 at 16:02

Andre Holzner

18.8k6 gold badges59 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rene Decol Over a year ago

Thanks Andre. I have almost arrived to the same solution, but yours is more elegant. In any case I was intrigued by the double use of "str" in "data.composers.str.split('\s+').str[-1]". Would never be able to deduce that by logic alone. Thanks anyway.

Andre Holzner Over a year ago

I arrived at this solution iteratively, e.g. by googling 'pandas dataframe strings' I found pandas.pydata.org/pandas-docs/stable/text.html where I searched for split (incidentally, you'll also find an example about split when you do help(data.composers) (after the variable data has been defined as above). The second part (accessing elements of columns whose entries are lists themselves) I found in the linked answer stackoverflow.com/questions/26069235/…

ggorlen Over a year ago

I don't think you need '\s+'. That's the default of split().

Mahdi4SM · Accepted Answer · 2020-05-11 08:19:21Z

-1

Try this to solve your problem:

import pandas as pd
df = pd.DataFrame(
    {'composers':
        [ 
            'Joseph Haydn', 
            'Wolfgang Amadeus Mozart', 
            'Antonio Salieri',
            'Eumir Deodato',
        ]
    }
)

df['lastname'] = df['composers'].str.split(n = 0, expand = False).str[1]

You can now find the DataFrame, as shown below.

composers   lastname
0   Joseph Haydn    Haydn
1   Wolfgang Amadeus Mozart Amadeus Mozart
2   Antonio Salieri Salieri
3   Eumir Deodato   Deodato

edited May 11, 2020 at 8:19

answered May 11, 2020 at 7:46

Mahdi4SM

11 bronze badge

1 Comment

ggorlen Over a year ago

str[1] is the wrong index. It just appears to work on this cherry-picked input, but breaks on others. If your df has Mozart first, it gives "Amadeus" for that column rather than "Mozart". Better: df['composers'].str.split().str[-1] but then it's basically the same as the existing answer, so I don't think this answer adds value even if fixed.

Collectives™ on Stack Overflow

Splitting a string in a Python DataFrame

2 Answers 2

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related