Reading Multiple CSV Files into Python Pandas Dataframe

Question

The general use case behind the question is to read multiple CSV log files from a target directory into a single Python Pandas DataFrame for quick turnaround statistical analysis & charting. The idea for utilizing Pandas vs MySQL is to conduct this data import or append + stat analysis periodically throughout the day.

The script below attempts to read all of the CSV (same file layout) files into a single Pandas dataframe & adds a year column associated with each file read.

The problem with the script is it now only reads the very last file in the directory instead of the desired outcome being all files within the targeted directory.

# Assemble all of the data files into a single DataFrame & add a year field
# 2010 is the last available year
years = range(1880, 2011)

for year in years:
    path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    pieces.append(frame)

# Concatenates everything into a single Dataframe
names = pd.concat(pieces, ignore_index=True)

# Expected row total should be 1690784
names
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33838 entries, 0 to 33837
Data columns:
name      33838  non-null values
sex       33838  non-null values
births    33838  non-null values
year      33838  non-null values
dtypes: int64(2), object(2)

# Start aggregating the data at the year & gender level using groupby or pivot
total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
# Prints pivot table
total_births.tail()

Out[35]:
sex     F   M
year        
2010    1759010     1898382

What type of object is pieces? Is it a list or a dataframe? — Greg Reda
– Greg Reda, Commented Apr 5, 2013 at 21:17

Greg Reda · Accepted Answer · 2013-04-05 21:36:06Z

13

The append method on an instance of a DataFrame does not function the same as the append method on an instance of a list. Dataframe.append() does not occur in-place and instead returns a new object.

years = range(1880, 2011)

names = pd.DataFrame()
for year in years:
    path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    names = names.append(frame, ignore_index=True)

or you can use concat:

years = range(1880, 2011)

names = pd.DataFrame()
for year in years:
    path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    names = pd.concat(names, frame, ignore_index=True)

edited Apr 5, 2013 at 21:36

answered Apr 5, 2013 at 21:30

Greg Reda

1,8542 gold badges15 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user892627 Over a year ago

Thanks, @gjreda. I used your method 1 provided & the desired outcome was perfect.

user892627 Over a year ago

In [3]: # Expected row total should be 1690784 names Out[3]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1690784 entries, 0 to 1690783 Data columns: births 1690784 non-null values name 1690784 non-null values sex 1690784 non-null values year 1690784 non-null values dtypes: int64(2), object(2)

ljs.dev · Accepted Answer · 2013-09-14 04:51:19Z

0

I could not get either one of the above answers to work. The first answer was close, but the line space between the second and third lines after the for weren't right. I used the below code snippet in Canopy. Also, for those who are interested... this problem came from an example in "Python for Data Analysis". (An enjoyable book so far otherwise)

import pandas as pd

years = range(1880,2011)
columns = ['name','sex','births']
names = pd.DataFrame()

for year in years:
    path = 'C:/PythonData/pydata-book-master/pydata-book-master/ch02/names/yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)
    frame['year'] = year
    names = names.append(frame,ignore_index=True)

edited Sep 14, 2013 at 4:51

ljs.dev

4,4933 gold badges51 silver badges80 bronze badges

answered Aug 5, 2013 at 1:08

cromastro

3112 silver badges4 bronze badges

1 Comment

scharfmn Over a year ago

The example is on pp.33-34 of Python for Data Analysis - & the example uses pd.concat

user3290447 · Accepted Answer · 2014-02-09 20:16:46Z

-3

remove the line space between:

    frame = pd.read_csv(path, names=columns)

&

    frame['year'] = year

so it reads

    for year in years:
        path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
        frame = pd.read_csv(path, names=columns)
        frame['year'] = year
        names = pd.append(names, frame, ignore_index=True)

answered Feb 9, 2014 at 20:16

user3290447

1

1 Comment

DSM Over a year ago

The blank line has no effect in Python code. It could only have an effect if you were pasting lines into a console or something.

Collectives™ on Stack Overflow

Reading Multiple CSV Files into Python Pandas Dataframe

The problem with the script is it now only reads the very last file in the directory instead of the desired outcome being all files within the targeted directory.

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

The problem with the script is it now only reads the very last file in the directory instead of the desired outcome being all files within the targeted directory.

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related