5

I am using pandas to read .csv data files. For one of my files I am able to index using the column title. For the other I get error messages

File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", 
line 1023, in _check_have
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named State'

The code I used is:

filename = "PovertyEstimates.csv"
#filename = "nm.csv"

f = open(filename)
import pandas as pd

data = pd.read_csv(f)#, index_col=0)
print data['State']

Even when I use index_col I get the same error(unless it is 0). I have found that when I print the csv file that isn't working in my terminal it is not separated into columns like the one that is. Rather the items in each row are printed consecutively separated by spaces. I believe this incorrect separation is the problem.

I am using LibreOffice Calc on Ubuntu Linux. For the improperly formatted file (which appears in perfect format in LibreOffice) the terminal output is:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3194 entries, 0 to 3193
Data columns:
FIPStxt State   Area_name   Rural-urban_Continuum Code_2003       Urban_Influence_Code_2003 Rural-urban_Continuum Code_20013      Urban_Influence_Code_20013    POVALL_2011 CI90LBAll_2011    CI90UBALL_2011    PCTPOVALL_2011  CI90LBALLP_2011 CI90UBALLP_2011 POV017_2011 CI90LB017_2011  CI90UB017_2011  PCTPOV017_2011  CI90LB017P_2011 CI90UB017P_2011 POV517_2011 CI90LB517_2011  CI90UB517_2011  PCTPOV517_2011  CI90LB517P_2011 CI90UB517P_2011 MEDHHINC_2011   CI90LBINC_2011  CI90UBINC_2011  POV05_2011  CI90LB05_2011   CI90UB05_2011   PCTPOV05_2011   CI90LB05P_2011       CI90UB05P_2011    3194  non-null values
dtypes: object(1)

The first few lines of the csv file are:

FIPStxt State   Area_name   Rural-urban_Continuum Code_2003       
01000   AL  Alabama      
01001   AL  Autauga County  2   2
01003   AL  Baldwin County  4   5
4
  • what platform are you using, and can you provide an example of your terminal output? Commented Feb 21, 2014 at 23:15
  • Can you show us the first few lines of the csv (so we can see this). Is it comma separated? Commented Feb 21, 2014 at 23:46
  • The last block I added has it. The spaces are where the columns separate. I wanted to index using the titles along the top row ('State' or 'FIPStxt') Commented Feb 21, 2014 at 23:58
  • Maybe I'm just unfamiliar with Python, but is data = pd.read_csv(f)#, index_col=0) missing a left paren? Commented Feb 22, 2014 at 6:04

3 Answers 3

5

The spaces are probably the problem. You need to tell pandas what separator to use when parsing the CSV.

data = pd.read_csv(f, sep=" ")

Problem is though, it will pick up all spaces as valid separators (e.g. Alabama County becomes 2 columns). The best would be to convert that one file to a an actual comma (semicolon or other) separated file or make sure that compound values are quoted ("Alabama County") and then specify the quotechar:

data = pd.read_csv(f, sep=" ", quotechar='"')
Sign up to request clarification or add additional context in comments.

Comments

0

according to pandas documentation.

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

The read_csv method will target a comma in your CSV file. If your file doesn't contain commas, then it will read as one big chunk of text. Passing in a parameter sep will tell the pandas where to separate.

Comments

0

The issue is that your file is not a csv. pd.read_csv expects a csv input. Your file has a different format, perhaps tab delimited.

We can use datatable's handy fread instead of pandas and then convert. fread will autodetect the file type:

$ pip3 install datatable
$ python3
>>> import datatable as dt
>>> import pandas as pd
>>> DT = dt.fread("PovertyEstimates.csv")
>>> data_pd = DT.to_pandas()
>>> DT.head()

   | index  FIPStxt  State    Area_name  Rural-urban_Continuum  Code_2003
   | int32  str32    str32    str32                      int32      int32
-- + -----  -------  -------  ---------  ---------------------  ---------
 0 |  1000  AL       Alabama  NA                            NA         NA
 1 |  1001  AL       Autauga  County                         2          2
 2 |  1003  AL       Baldwin  County                         4          5

>>> data_pd.head()

   index FIPStxt    State Area_name  Rural-urban_Continuum  Code_2003
0   1000      AL  Alabama      None                    NaN        NaN
1   1001      AL  Autauga    County                    2.0        2.0
2   1003      AL  Baldwin    County                    4.0        5.0

If you need the first column to include the leading zero, specify column types with that column as a string.

>>> DT = dt.fread("PovertyEstimates.csv", columns=[dt.str32, dt.str32, dt.str32, dt.str32, dt.int32, dt.int32])

>>> DT.head()
   | index  FIPStxt  State    Area_name  Rural-urban_Continuum  Code_2003
   | str32  str32    str32    str32                      int32      int32
-- + -----  -------  -------  ---------  ---------------------  ---------
 0 | 01000  AL       Alabama  NA                            NA         NA
 1 | 01001  AL       Autauga  County                         2          2
 2 | 01003  AL       Baldwin  County                         4          5
[3 rows x 6 columns]
>>> DT.ltypes
(ltype.str, ltype.str, ltype.str, ltype.str, ltype.int, ltype.int

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.