read_csv columns encoding

Question

I am quite new to python.

I am trying to automate some data analysis of building energy consumption data using python. I am using python 2.7.3, pandas 0.12, Canopy with qtconsole

These are the steps I am following:

Paste the data from my simulation software in excel
Export to csv from Excel
Import the csv in a pandas dataframe
Perform my analysis

In the interactive console I write the following code

import pandas as pd
rooms = pd.read_csv('IES Results - Rooms.csv', index_col='Room # (Real)')
systems = pd.read_csv('IES Results - Systems.csv',index_col='Room #')
all_values = pd.concat([rooms,systems],axis=1)
all_values = all_values.T.drop_duplicates().T
columns = [u'Room ID',u'Room Name',u'Floor Area (m²) (Real)',u'Volume (m³) (Real)']
selected_values = all_values[columns]

Unfortunately I get the following error

KeyError: "[u'Floor Area (m\\xb2) (Real)' u'Volume (m\\xb3) (Real)'] not in index"

As you can see all the columns with a superscript are not interpreted correctly and they cannot be found in the dataframe.

When I write

all_values.columns

The columns headers are displayed correctly in the IPython console. I then copy and paste the values I am interested in to create the 'columns' list to pass to 'selected_values = all_values[columns]'

I have done quite a bit of research, but I cannot get my head around it.

I have tried to specify various encoding but I am not really understanding what it is happening.

I have been stuck for more than a day.

Can you please help?

To get from an encoding (in the csv file) to Unicode (in the program) you need to use the encoding keyword argument to pd.read_csv pandas.pydata.org/pandas-docs/stable/generated/… — mechanical_meat
– mechanical_meat, Commented Feb 1, 2014 at 7:06
Hi, I'd like to try and reproduce this, but I can't get Excel to export to CSV with superscripts. Can you look at your CSV file and let me know how the superscripts are displayed there? — LondonRob
– LondonRob, Commented Feb 1, 2014 at 18:34
Hi, I just 'save as' csv. I have tried to open the file with Notepad++ and I see the superscripts. I am using Excel 2010 in Windows 7 64bit — Rojj
– Rojj, Commented Feb 2, 2014 at 8:04
I think that the problem is the copy and paste that I use to create the 'columns' list. If I print the list after I have created it I lose the superscripts. — Rojj
– Rojj, Commented Feb 2, 2014 at 8:25
@bernie I have tried to use the keyword, but there is no difference. I have saved the csv in UTF-8 from Libre and used encoding='UTF-8'. — Rojj
– Rojj, Commented Feb 2, 2014 at 12:24

Svend Feldt · Accepted Answer · 2014-02-01 07:25:06Z

1

Ok, If I was doing something like this,

1)Get rid of Excel. - Do you need it. Why does your simulation program not dump the data it self? If it can't in stead of pasteing to to Excel, paste it to a txt file and parse it from Python

2)Get rid of super script - Do you really need the superscript ? I would remove those, at least in my analysis stage, when some sort of presentation is needed, I would restore those.

answered Feb 1, 2014 at 7:25

Svend Feldt

8786 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Rojj Over a year ago

Thanks for your answer. I have tried to past the content in a csv file created with Notepadd++. Same result.

Svend Feldt Over a year ago

Can we see an example of the file content? Did you try stripping the superscript?

Rojj Over a year ago

No, I havent'. I want to minimize the manual intervention on the file. This is the file dropbox.com/s/p0tgb4hzg9aokis/IES%20Results%20-%20SO.csv

Svend Feldt Over a year ago

I would strip them in python prior to using pandas

Rojj · Accepted Answer · 2014-02-04 02:53:29Z

After a bit of testing I have discovered that the problem was in the copy and paste that I was using to create 'columns'.

Compare the difference.

This 'columns' is created with copy and paste

columns_cp = [u'Room ID',u'Room Name',u'Floor Area (m²) (Real)',u'Volume (m³) (Real)']
columns_cp
Out[29]: 
[u'Room ID',
 u'Room Name',
 u'Floor Area (m\xb2) (Real)',
 u'Volume (m\xb3) (Real)']

As you can see the formatting is gone. m\xb2 and m\xb3 are not correct.

A better way to do it is the following

columns = [all_values.columns.tolist()[i] for i in [0,1,8,11]]

Where 0,1,8,11 are the columns I am interested in. This is the output of columns

columns
Out[30]: 
['Room ID',
 'Room Name',
 'Floor Area (m\xc2\xb2) (Real)',
 'Volume (m\xc2\xb3) (Real)']

As you can see the encoding has not been lost, indeed:

selected_values = all_values[columns]
In [32]: selected_values
Out[32]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 288 entries, 0 to 457
Data columns (total 4 columns):
Room ID                   288  non-null values
Room Name                 288  non-null values
Floor Area (m²) (Real)    288  non-null values
Volume (m³) (Real)        288  non-null values
dtypes: object(4)

It works as I expected and superscript are not lost.

Cheers

Collectives™ on Stack Overflow

read_csv columns encoding

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related