2

I am quite new to python.

I am trying to automate some data analysis of building energy consumption data using python. I am using python 2.7.3, pandas 0.12, Canopy with qtconsole

These are the steps I am following:

  1. Paste the data from my simulation software in excel
  2. Export to csv from Excel
  3. Import the csv in a pandas dataframe
  4. Perform my analysis

In the interactive console I write the following code

import pandas as pd
rooms = pd.read_csv('IES Results - Rooms.csv', index_col='Room # (Real)')
systems = pd.read_csv('IES Results - Systems.csv',index_col='Room #')
all_values = pd.concat([rooms,systems],axis=1)
all_values = all_values.T.drop_duplicates().T
columns = [u'Room ID',u'Room Name',u'Floor Area (m²) (Real)',u'Volume (m³) (Real)']
selected_values = all_values[columns]

Unfortunately I get the following error

KeyError: "[u'Floor Area (m\\xb2) (Real)' u'Volume (m\\xb3) (Real)'] not in index"

As you can see all the columns with a superscript are not interpreted correctly and they cannot be found in the dataframe.

When I write

all_values.columns

The columns headers are displayed correctly in the IPython console. I then copy and paste the values I am interested in to create the 'columns' list to pass to 'selected_values = all_values[columns]'

I have done quite a bit of research, but I cannot get my head around it.

I have tried to specify various encoding but I am not really understanding what it is happening.

I have been stuck for more than a day.

Can you please help?

5
  • To get from an encoding (in the csv file) to Unicode (in the program) you need to use the encoding keyword argument to pd.read_csv pandas.pydata.org/pandas-docs/stable/generated/… Commented Feb 1, 2014 at 7:06
  • Hi, I'd like to try and reproduce this, but I can't get Excel to export to CSV with superscripts. Can you look at your CSV file and let me know how the superscripts are displayed there? Commented Feb 1, 2014 at 18:34
  • Hi, I just 'save as' csv. I have tried to open the file with Notepad++ and I see the superscripts. I am using Excel 2010 in Windows 7 64bit Commented Feb 2, 2014 at 8:04
  • I think that the problem is the copy and paste that I use to create the 'columns' list. If I print the list after I have created it I lose the superscripts. Commented Feb 2, 2014 at 8:25
  • @bernie I have tried to use the keyword, but there is no difference. I have saved the csv in UTF-8 from Libre and used encoding='UTF-8'. Commented Feb 2, 2014 at 12:24

2 Answers 2

1

Ok, If I was doing something like this,

1)Get rid of Excel. - Do you need it. Why does your simulation program not dump the data it self? If it can't in stead of pasteing to to Excel, paste it to a txt file and parse it from Python

2)Get rid of super script - Do you really need the superscript ? I would remove those, at least in my analysis stage, when some sort of presentation is needed, I would restore those.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your answer. I have tried to past the content in a csv file created with Notepadd++. Same result.
Can we see an example of the file content? Did you try stripping the superscript?
No, I havent'. I want to minimize the manual intervention on the file. This is the file dropbox.com/s/p0tgb4hzg9aokis/IES%20Results%20-%20SO.csv
I would strip them in python prior to using pandas
1

After a bit of testing I have discovered that the problem was in the copy and paste that I was using to create 'columns'.

Compare the difference.

This 'columns' is created with copy and paste

columns_cp = [u'Room ID',u'Room Name',u'Floor Area (m²) (Real)',u'Volume (m³) (Real)']
columns_cp
Out[29]: 
[u'Room ID',
 u'Room Name',
 u'Floor Area (m\xb2) (Real)',
 u'Volume (m\xb3) (Real)']

As you can see the formatting is gone. m\xb2 and m\xb3 are not correct.

A better way to do it is the following

columns = [all_values.columns.tolist()[i] for i in [0,1,8,11]]

Where 0,1,8,11 are the columns I am interested in. This is the output of columns

columns
Out[30]: 
['Room ID',
 'Room Name',
 'Floor Area (m\xc2\xb2) (Real)',
 'Volume (m\xc2\xb3) (Real)']

As you can see the encoding has not been lost, indeed:

selected_values = all_values[columns]
In [32]: selected_values
Out[32]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 288 entries, 0 to 457
Data columns (total 4 columns):
Room ID                   288  non-null values
Room Name                 288  non-null values
Floor Area (m²) (Real)    288  non-null values
Volume (m³) (Real)        288  non-null values
dtypes: object(4)

It works as I expected and superscript are not lost.

Cheers

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.