4

I want to extract a substring (Titles - Mr. Mrs, Miss etc.) from a column (Name) in a pandas dataframe and then write the new column (Title) back into the dataframe.

In the Name column of the dataframe I have a name such as "Brand, Mr. Owen Harris" The two delimiters are the , and .

I have attempted to use a split method, but this only splits the original string in two within a list. So I still send up ['Braund', ' Mr. Owen Harris'] in the list.

import pandas as pd
#import re
df_Train = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTliZmavBsJCFDiEwxcSIIftu-0gR9p34n8Bq4OUNL4TxwHY-JMS6KhZEbWr1bp91UqHPkliZBBFgwh/pub?gid=1593012114&single=true&output=csv')
a= df_Train['Name'].str.split(',')
for i in a:
    print(i[1])

I am thinking this might be situation where regex comes into play. My reading suggests a Lookahead (?=,) and Lookbehind (?<='.') approach should do the trick. for example

import re
a= df_Train['Name'].str.split(r'(?=,)*(?<='.'))
for i in a:
    print(i)
    print(i[1])`

But I am running into errors (EOL while scanning string literal) . Can someone point me in the right direction?

Cheers Mike

2 Answers 2

8

You do it like this.

df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()

Output head(5):

0       Mr
1      Mrs
2     Miss
3      Mrs
4       Mr

Summation of results

df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()\
             .value_counts()

Output

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Lady              1
Mme               1
Sir               1
Ms                1
the Countess      1
Jonkheer          1
Don               1
Capt              1
Name: Name, dtype: int64
Sign up to request clarification or add additional context in comments.

5 Comments

what dataset are you using for this example?
Ok Great. I wasn't aware you could chain the methods like that.
ah missed that, just fyi you can just call .value_counts() on the resulting series rather than .to_frame().groupby('Name')['Name'].count()
I don't understand how exactly this works. What does the str[1] part do? Can someone explain?
@deadcode, the .str accessor is acting on a list and str[1] is retrieving the second element in the list created by split(','). So in the case of "Brand, Mr. Owen Harris", split(',') returns a list of two elements ["Brand", "Mr. Owen Harris"], then we use str[1] to get the second element "Mr. Owen Harris" and we split that string into ["Mr", "Owen Harris"[ using split('.') then use .str accessor again on the list to get the first element with .str[0].
2

The error is coming from the fact that you have single quotes around the period inside your single-quoted regex string-literal; this actually isn't the correct syntax, I think you mean to use an escaped-period i.e. r'(?=,)*(?<=\.). However you don't need to use lookahead/lookbehind here, it's more usual and simpler to use capture-groups to describe your regex; in this case the regex would be

df_Train['Name'].str.extract(", (\w*)\.")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.