4

I have the following table:

As the column 'location' has the state repeating inside it, I am trying to remove the state from location so that it only has the city name.

year    location    state   success
2009    New York, NY    NY  1
2009    New York, NY    NY  1
2009    Chicago, IL IL  1
2009    New York, NY    NY  1
2009    Boston, MA  MA  1
2009    Long Beach, CA  CA  1
2009    Atlanta, GA GA  1

I have tried the following code:

x = KS_clean.column(1)
np.chararray.split(x, ',')

How can I split the string so the result only contains the city name like the following:

array('New York', 'New York', 'Chicago', ...,) 

so that I can put it back inside the table?

Sorry it is basic question but I am new to python and still learning. Thanks

3
  • Your data looks like a pandas DataFrame, not a numpy array. Please check. Commented Aug 12, 2017 at 6:37
  • It is a pandas DataFrame but when I extract the column (var x) and check its type it says numpy.ndarray Commented Aug 12, 2017 at 6:42
  • How did you get the dataframe in the first place? It looks odd. When you select a column, you must get a Series, not anything-numpy. Commented Aug 12, 2017 at 6:54

1 Answer 1

4

I think you need working with DataFrame first (e.g. by read_csv):

import numpy as np
from pandas.compat import StringIO

temp=u"""year;location;state;success
2009;New York, NY;NY;1
2009;New York, NY;NY;1
2009;Chicago, IL;IL;1
2009;New York, NY;NY;1
2009;Boston, MA;MA;1
2009;Long Beach, CA;CA;1
2009;Atlanta, GA;GA;1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=";")

print (type(df))
<class 'pandas.core.frame.DataFrame'>

print (df)
   year        location state  success
0  2009    New York, NY    NY        1
1  2009    New York, NY    NY        1
2  2009     Chicago, IL    IL        1
3  2009    New York, NY    NY        1
4  2009      Boston, MA    MA        1
5  2009  Long Beach, CA    CA        1
6  2009     Atlanta, GA    GA        1

Then split by str.split and select first list by str[0]:

df['location'] = df['location'].str.split(', ').str[0]
print (df)
   year    location state  success
0  2009    New York    NY        1
1  2009    New York    NY        1
2  2009     Chicago    IL        1
3  2009    New York    NY        1
4  2009      Boston    MA        1
5  2009  Long Beach    CA        1
6  2009     Atlanta    GA        1

Last if necessary convert by values to numpy array:

arr = df.values
print (arr)
[[2009 'New York' 'NY' 1]
 [2009 'New York' 'NY' 1]
 [2009 'Chicago' 'IL' 1]
 [2009 'New York' 'NY' 1]
 [2009 'Boston' 'MA' 1]
 [2009 'Long Beach' 'CA' 1]
 [2009 'Atlanta' 'GA' 1]]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.