0

I would like to create a SplitName() function that 1) converts all letters to lower case, 2) splits the name entry by space (ie. "John Snow" into "John" and "Snow") and 3) creates a data frame in Pandas that takes the split name entities and creates new columns (one as "first name" and another as "last name").

I am able to create new series variable from the data frame and manipulate the name entities into lower case and splitting by space. But I don't know how to create an overall data frame that takes in the original data frame's information as well as the new "lower-cased" and "split" variables information

def SplitName():
    data = pd.read_csv("C:\data.csv")
    frame2 = DataFrame(data)
    frame2.columns = ["Name", "Ethnicity", "Event_Place", "Birth_Place"]
    name_lower = frame2["Name"].str.lower() # make names lower case
    name_split = name_lower.str.split() # split string element by space
    name_split_smallList = name_split[0:10] # small set to easily handle
    #print name_split_smallList
    '''for lastName in name_split_smallList:
        print lastName[0] + " " + lastName[-1]'''

    name_lower_list = name_lower.tolist()
    frame_all = frame2 + name_lower_list
    print frame_all[0:10]

1 Answer 1

1

To create new columns in a data frame you can just assign a series in the same way you would assign some data a variable name: with an equals sign.

The following assumes that the CSV file has a header called 'Name' and that Name never can be split more than once i.e there are no middle names. The function simply created a data frame by reading the csv file, then creates two series objects of lowered strings. The first_name series takes the lowered string at index position 0 for all values of Name split but whitespace, an the 'second_name' series takes the lowered string at index position 1 for all values of Name split by whitespace. The Series objects are created using list comprehension... This therefore assumes that there are no Names with more than two components i.e. no middle names. This might be something you want to check first.

def SplitName():
    DF = pd.read_csv("C:\data.csv") #this already created a DataFrame.
    DF['first_name'] = pd.Series([Name.lower().split()[0] for Name in DF['Name'], index = DF.index)
    DF['second_name'] = pd.Series([Name.lower().split()[1] for Name in DF['Name'], index = DF.index)
    return DF
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.