8

My goal with this script is to: 1.read timseries data in from excel file (>100,000k rows) as well as headers (Labels, Units) 2.convert excel numeric dates to best datetime object for pandas dataFrame 3.Be able to use timestamps to reference rows and series labels to reference columns

So far I used xlrd to read the excel data into a list. Made pandas Series with each list and used time list as index. Combined series with series headers to make python dictionary. Passed dictionary to pandas DataFrame. Despite my efforts the df.index seems to be set to the column headers and I'm not sure when to convert the dates into datetime object.

I just started using python 3 days ago so any advice would be great! Here's my code:

    #Open excel workbook and first sheet
    wb = xlrd.open_workbook("C:\GreenCSV\Calgary\CWater.xlsx")
    sh = wb.sheet_by_index(0)

    #Read rows containing labels and units
    Labels = sh.row_values(1, start_colx=0, end_colx=None)
    Units = sh.row_values(2, start_colx=0, end_colx=None)

    #Initialize list to hold data
    Data = [None] * (sh.ncols)

    #read column by column and store in list
    for colnum in range(sh.ncols):
        Data[colnum] = sh.col_values(colnum, start_rowx=5, end_rowx=None)

    #Delete unecessary rows and columns
    del Labels[3],Labels[0:2], Units[3], Units[0:2], Data[3], Data[0:2]   

    #Create Pandas Series
    s = [None] * (sh.ncols - 4)
    for colnum in range(sh.ncols - 4):
        s[colnum] = Series(Data[colnum+1], index=Data[0])

    #Create Dictionary of Series
    dictionary = {}
    for i in range(sh.ncols-4):
        dictionary[i]= {Labels[i] : s[i]}

    #Pass Dictionary to Pandas DataFrame
    df = pd.DataFrame.from_dict(dictionary)
3
  • 1
    Did you try pd.read_excel? (pandas.pydata.org/pandas-docs/dev/io.html) Commented Jul 17, 2013 at 22:53
  • Thanks for your comment! I'll give it a shot but if it's anything like pd.read_csv I will need to use a code like this because pd.read_csv only seems to work properly if there is only one line of column headers and no blanks before the data. Commented Jul 17, 2013 at 23:23
  • You can skip the second row with options 'skiprows'. IMO it's definitely worthwhile to look at the options for pd.read_csv (especially skiprows, skipinitialspace, parse_dates) Commented Jul 17, 2013 at 23:31

1 Answer 1

12

You can use pandas directly here, I usually like to create a dictionary of DataFrames (with keys being the sheet name):

In [11]: xl = pd.ExcelFile("C:\GreenCSV\Calgary\CWater.xlsx")

In [12]: xl.sheet_names  # in your example it may be different
Out[12]: [u'Sheet1', u'Sheet2', u'Sheet3']

In [13]: dfs = {sheet: xl.parse(sheet) for sheet in xl.sheet_names}

In [14]: dfs['Sheet1'] # access DataFrame by sheet name

You can check out the docs on the parse which offers some more options (for example skiprows), and these allows you to parse individual sheets with much more control...

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer but it seems this is very slow. I am only loading one sheet ~90k rows and it takes about 40 seconds in matlab if I used a comparable xlsread() command which uses COM it takes about 10 seconds. Also I will have 9 workbooks like this to load. Is there a faster way to do this using COM in python? Anyone using this will for sure have excel and will be windows 7.
I converted my file to csv and am using pd.read_csv and it still takes about 35 seconds. Still very slow but I guess that might be because of the datetime conversion.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.