As I was experimenting with pandas, I noticed some odd behavior of pandas.read_csv and was wondering if someone with more experience could explain what might be causing it.
To start, here is my basic class definition for creating a new pandas.dataframe from a .csv file:
import pandas as pd
class dataMatrix:
def __init__(self, filepath):
self.path = filepath # File path to the target .csv file.
self.csvfile = open(filepath) # Open file.
self.csvdataframe = pd.read_csv(self.csvfile)
Now, this works pretty well and calling the class in my __ main __.py successfully creates a pandas dataframe:
From dataMatrix.py import dataMatrix
testObject = dataMatrix('/path/to/csv/file')
But I was noticing that this process was automatically setting the first row of the .csv as the pandas.dataframe.columns index. Instead, I decided to number the columns. Since I didn't want to assume I knew the number of columns before hand, I took the approach of opening the file, loading it into a dataframe, counting the columns, and then reloading the dataframe with the proper number of columns using range().
import pandas as pd
class dataMatrix:
def __init__(self, filepath):
self.path = filepath
self.csvfile = open(filepath)
# Load the .csv file to count the columns.
self.csvdataframe = pd.read_csv(self.csvfile)
# Count the columns.
self.numcolumns = len(self.csvdataframe.columns)
# Re-load the .csv file, manually setting the column names to their
# number.
self.csvdataframe = pd.read_csv(self.csvfile,
names=range(self.numcolumns))
Keeping my processing in __ main __.py the same, I got back a dataframe with the correct number of columns (500 in this case) with proper names (0...499), but it was otherwise empty (no row data).
Scratching my head, I decided to close self.csvfile and reload it like so:
import pandas as pd
class dataMatrix:
def __init__(self, filepath):
self.path = filepath
self.csvfile = open(filepath)
# Load the .csv file to count the columns.
self.csvdataframe = pd.read_csv(self.csvfile)
# Count the columns.
self.numcolumns = len(self.csvdataframe.columns)
# Close the .csv file. #<---- +++++++
self.csvfile.close() #<---- Added
# Re-open file. #<---- Block
self.csvfile = open(filepath) #<---- +++++++
# Re-load the .csv file, manually setting the column names to their
# number.
self.csvdataframe = pd.read_csv(self.csvfile,
names=range(self.numcolumns))
Closing the file and re-opening it returned correctly with a pandas.dataframe with columns numbered 0...499 and all 255 subsequent rows of data.
My question is why does closing the file and re-opening it make a difference?