Using Pandas read_csv() on an open file twice

Question

As I was experimenting with pandas, I noticed some odd behavior of pandas.read_csv and was wondering if someone with more experience could explain what might be causing it.

To start, here is my basic class definition for creating a new pandas.dataframe from a .csv file:

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath  # File path to the target .csv file.
        self.csvfile = open(filepath)  # Open file.
        self.csvdataframe = pd.read_csv(self.csvfile)

Now, this works pretty well and calling the class in my __ main __.py successfully creates a pandas dataframe:

From dataMatrix.py import dataMatrix

testObject = dataMatrix('/path/to/csv/file')

But I was noticing that this process was automatically setting the first row of the .csv as the pandas.dataframe.columns index. Instead, I decided to number the columns. Since I didn't want to assume I knew the number of columns before hand, I took the approach of opening the file, loading it into a dataframe, counting the columns, and then reloading the dataframe with the proper number of columns using range().

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)
        # Re-load the .csv file, manually setting the column names to their 
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

Keeping my processing in __ main __.py the same, I got back a dataframe with the correct number of columns (500 in this case) with proper names (0...499), but it was otherwise empty (no row data).

Scratching my head, I decided to close self.csvfile and reload it like so:

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)

        # Close the .csv file.         #<---- +++++++
        self.csvfile.close()           #<----  Added
        # Re-open file.                #<----  Block
        self.csvfile = open(filepath)  #<---- +++++++

        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

Closing the file and re-opening it returned correctly with a pandas.dataframe with columns numbered 0...499 and all 255 subsequent rows of data.

My question is why does closing the file and re-opening it make a difference?

unutbu · Accepted Answer · 2014-09-19 22:36:04Z

8

When you open a file with

open(filepath)

a file handle iterator is returned. An iterator is good for one pass through its contents. So

self.csvdataframe = pd.read_csv(self.csvfile)

reads the contents and exhausts the iterator. Subsequent calls to pd.read_csv thinks the iterator is empty.

Note that you could avoid this problem by just passing the file path to pd.read_csv:

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)


        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(filepath, 
                                        names=range(self.numcolumns))

pd.read_csv will then open (and close) the file for you.

PS. Another option is to reset the file handle to the beginning of the file by calling self.csvfile.seek(0), but using pd.read_csv(filepath, ...) is still easier.

Even better, instead of calling pd.read_csv twice (which is inefficient), you could rename the columns like this:

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        self.numcolumns = len(self.csvdataframe.columns)
        self.csvdataframe.columns = range(self.numcolumns)

edited Sep 19, 2014 at 22:36

answered Sep 19, 2014 at 22:27

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Grant Hulegaard Over a year ago

Thanks for the info about the file iterator. That makes sense. I will make a change to pass the 'filepath' instead of the open file. However, renaming the columns as you suggest at the end will replace the column names, meaning I lose the first row of data.

unutbu Over a year ago

Then add header=None so the first row of data will be part of the data, not interpreted as column names.

Grant Hulegaard Over a year ago

Ah yes, I forgot about header=None...I am having issues getting that to work but that is a separate issue. Thanks for answering my original question! I was just curious about the lower-level, "behind-the-scenes" interaction that was causing the 'openfile' behavior. Thanks!

Collectives™ on Stack Overflow

Using Pandas read_csv() on an open file twice

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related