0

I have some name and ethnicity data, for example:

John Wick    English
Black Widow  French

I then do a bit of manipulation to make the name as below

John Wick  -> john#wick??????????????????????????????????
Black Widow -> black#widow????????????????????????????????

I then proceed into creating multiple variables and each contain the 3-character sub-strings through the for loop.

I also try to find the number of alphabets using the re.findall.

I have two questions: 1) Is the for loop efficient? Can I replace with better code even though it is working as is? 2) I can't get the code that tries to find the number of alphabet to work. Any suggestions?

import pandas as pd
from pandas import DataFrame
import re

# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity

# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]

# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()

# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Z\s\-]', '') # Retain space and hyphen

# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[\s]', '#')

# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens

# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")

# Split into three-character strings
for i in range(1, 41):
    substr = "substr" + str(i)
    frame3[substr] = frame3["name_filled"].str[i-1:i+2]

# Count number of characters
frame3["name_len"] = len(re.findall('[a-zA-Z]', name))

# Test outputs
print frame3

1 Answer 1

1

!) Regarding the loop, I can't think of a better way than what you're already doing

2) Try frame3["name_len"] = frame3["name"].map(lambda x : len(re.findall('[a-zA-Z]', x)))

Sign up to request clarification or add additional context in comments.

1 Comment

@KubiK888 Yes, become familiar with pandas' map() and apply() Very powerful

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.