Substring on pandas dataframe column

Question

I want to extract a substring (Titles - Mr. Mrs, Miss etc.) from a column (Name) in a pandas dataframe and then write the new column (Title) back into the dataframe.

In the Name column of the dataframe I have a name such as "Brand, Mr. Owen Harris" The two delimiters are the , and .

I have attempted to use a split method, but this only splits the original string in two within a list. So I still send up ['Braund', ' Mr. Owen Harris'] in the list.

import pandas as pd
#import re
df_Train = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTliZmavBsJCFDiEwxcSIIftu-0gR9p34n8Bq4OUNL4TxwHY-JMS6KhZEbWr1bp91UqHPkliZBBFgwh/pub?gid=1593012114&single=true&output=csv')
a= df_Train['Name'].str.split(',')
for i in a:
    print(i[1])

I am thinking this might be situation where regex comes into play. My reading suggests a Lookahead (?=,) and Lookbehind (?<='.') approach should do the trick. for example

import re
a= df_Train['Name'].str.split(r'(?=,)*(?<='.'))
for i in a:
    print(i)
    print(i[1])`

But I am running into errors (EOL while scanning string literal) . Can someone point me in the right direction?

Cheers Mike

Scott Boston · Accepted Answer · 2017-11-14 03:00:37Z

8

You do it like this.

df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()

Output head(5):

0       Mr
1      Mrs
2     Miss
3      Mrs
4       Mr

Summation of results

df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()\
             .value_counts()

Output

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Lady              1
Mme               1
Sir               1
Ms                1
the Countess      1
Jonkheer          1
Don               1
Capt              1
Name: Name, dtype: int64

edited Nov 14, 2017 at 3:00

answered Nov 13, 2017 at 22:57

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

maxymoo Over a year ago

what dataset are you using for this example?

Mike Over a year ago

Ok Great. I wasn't aware you could chain the methods like that.

maxymoo Over a year ago

ah missed that, just fyi you can just call .value_counts() on the resulting series rather than .to_frame().groupby('Name')['Name'].count()

deadcode Over a year ago

I don't understand how exactly this works. What does the str[1] part do? Can someone explain?

Scott Boston Over a year ago

@deadcode, the .str accessor is acting on a list and str[1] is retrieving the second element in the list created by split(','). So in the case of "Brand, Mr. Owen Harris", split(',') returns a list of two elements ["Brand", "Mr. Owen Harris"], then we use str[1] to get the second element "Mr. Owen Harris" and we split that string into ["Mr", "Owen Harris"[ using split('.') then use .str accessor again on the list to get the first element with .str[0].

maxymoo · Accepted Answer · 2017-11-13 23:11:22Z

2

The error is coming from the fact that you have single quotes around the period inside your single-quoted regex string-literal; this actually isn't the correct syntax, I think you mean to use an escaped-period i.e. r'(?=,)*(?<=\.). However you don't need to use lookahead/lookbehind here, it's more usual and simpler to use capture-groups to describe your regex; in this case the regex would be

df_Train['Name'].str.extract(", (\w*)\.")

edited Nov 13, 2017 at 23:11

answered Nov 13, 2017 at 22:59

maxymoo

36.7k12 gold badges97 silver badges121 bronze badges

Collectives™ on Stack Overflow

Substring on pandas dataframe column

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related