String manipulations using Python Pandas

Question

I have some name and ethnicity data, for example:

John Wick    English
Black Widow  French

I then do a bit of manipulation to make the name as below

John Wick  -> john#wick??????????????????????????????????
Black Widow -> black#widow????????????????????????????????

I then proceed into creating multiple variables and each contain the 3-character sub-strings through the for loop.

I also try to find the number of alphabets using the re.findall.

I have two questions: 1) Is the for loop efficient? Can I replace with better code even though it is working as is? 2) I can't get the code that tries to find the number of alphabet to work. Any suggestions?

import pandas as pd
from pandas import DataFrame
import re

# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity

# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]

# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()

# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Z\s\-]', '') # Retain space and hyphen

# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[\s]', '#')

# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens

# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")

# Split into three-character strings
for i in range(1, 41):
    substr = "substr" + str(i)
    frame3[substr] = frame3["name_filled"].str[i-1:i+2]

# Count number of characters
frame3["name_len"] = len(re.findall('[a-zA-Z]', name))

# Test outputs
print frame3

Bob Haffner · Accepted Answer · 2015-03-27 12:31:04Z

1

!) Regarding the loop, I can't think of a better way than what you're already doing

2) Try frame3["name_len"] = frame3["name"].map(lambda x : len(re.findall('[a-zA-Z]', x)))

answered Mar 27, 2015 at 12:31

Bob Haffner

8,5231 gold badge40 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bob Haffner Over a year ago

@KubiK888 Yes, become familiar with pandas' map() and apply() Very powerful

Collectives™ on Stack Overflow

String manipulations using Python Pandas

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related