python, pandas and importing multiple csv's into a dataframe

Question

My code is grabbing multiple csv files from a directory, and putting all the data into a dataFrame I created and called "df". Each CSV is the same format, but can be of various lengths so this is what I want to do:

I want to have a column in my df (DataFrame) that records the second to last piece of data in each csv I pull in before it moves onto the next one. I have modified the output below to give you an example of what I mean. Let's suppose I call this column BeforeLast. When you see a 0 value, that means its not the second to last piece of data in the csv I pulled, if you see a 1 value it means its the second to last piece of data in the csv I pulled.

How can I do this as Python is pulling in each csv called upon?

import pandas as pd
import glob
import os


path =r'X:\PublicFiles\TradingData\CSV\RealMarkets\Weekly\Futures\Contracts\Corn C'
allFiles = glob.glob(path + "/*.csv")  ##'*' means any file name can be grabbed
df = pd.DataFrame()
list_ = []

for file_ in allFiles:
    names = ['Date', 'Open', 'High', 'Low', 'Close', 'Vol', 'OI']
    df = pd.read_csv(file_, index_col = None, names = names)
    list_.append(df)
frame = pd.concat(list_)

Here is a sample of my current dataFrame (df)

    Date       Open    High     Low   Close   Vol  OI
0   20141212  427.00  427.00  427.00  427.00    0   0
1   20141219  429.00  429.00  424.00  424.00    0   0
2   20141226  424.00  425.00  423.00  425.00    0   0
3   20150102  422.75  422.75  417.50  417.50    0   0

This is what I want

    Date       Open    High     Low   Close   Vol  OI  BeforeLast
0   20141212  427.00  427.00  427.00  427.00    0   0  0
1   20141219  429.00  429.00  424.00  424.00    0   0  0
2   20141226  424.00  425.00  423.00  425.00    0   0  1
3   20150102  422.75  422.75  417.50  417.50    0   0  0 (this is the last piece of data in this csv and now it moves on to the next)
4   20141226  424.00  425.00  423.00  425.00    0   0  0
5   20150102  422.75  422.75  417.50  417.50    0   0  0
6   20141226  424.00  425.00  423.00  425.00    0   0  1
7   20150102  422.75  422.75  417.50  417.50    0   0  0

user3602063 · Accepted Answer · 2015-09-09 19:58:24Z

2

Try this. You do not need a list. Just append to the original data frame.

.iloc[-2, -1] is the 2nd to last row, last col

I added a index reset as in my test I ran into duplicate index numbers.

import pandas as pd
import glob
import os


path =r'X:\PublicFiles\TradingData\CSV\RealMarkets\Weekly\Futures\Contracts\Corn C'
allFiles = glob.glob(path + "/*.csv")  ##'*' means any file name can be grabbed
df = pd.DataFrame()

for file_ in allFiles:
    names = ['Date', 'Open', 'High', 'Low', 'Close', 'Vol', 'OI']
    df_temp = pd.read_csv(file_, index_col = None, names = names)
    df_temp['beforelast'] = 0
    df_temp.iloc[-2,-1] = 1
    df = df.append(df_temp)

df = df.reset_index(drop=True)

answered Sep 9, 2015 at 19:58

user3602063

512 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

antonio_zeus Over a year ago

hey this worked great! one thing I noticed was that my columns re-arranged into some different order. any quick way to put them back the same way I have my [names] ordered?

user3602063 Over a year ago

try: df = df[names]

kezzos · Accepted Answer · 2015-09-09 19:49:23Z

0

Just create a list to keep track of the last column when you are building your dataframe:

import pandas as pd

df = pd.DataFrame()
newcol = []

for i in range(10):
    # Load 10 files and get shape
    # length = df.shape[0]
    length = 10
    c = [0 for i in range(length)]
    c[-2] = 1
    newcol += c

df['BeforeLast'] = newcol

print df

answered Sep 9, 2015 at 19:49

kezzos

3,2413 gold badges25 silver badges40 bronze badges

1 Comment

kezzos Over a year ago

It doesnt matter how many file you have. Everytime you load a file, just keep track of how long it was using the newcol list. When you have all your files loaded just add the new column to your complete dataframe

Brian Pendleton · Accepted Answer · 2015-09-09 19:55:38Z

0

df = pd.DataFrame({'a': np.zeros(5)})
df[-2:-1] = 1
print df

   a
0  0
1  0
2  0
3  1
4  0

You can use this when you create each dataframe?

Example in your code:

for file_ in allFiles:
    names = ['Date', 'Open', 'High', 'Low', 'Close', 'Vol', 'OI']
    df = pd.read_csv(file_, index_col = None, names = names)
    before = np.zeros(len(df))
    before[-2] = 1
    df['before'] = before
    list_.append(df)
frame = pd.concat(list_)

edited Sep 9, 2015 at 19:55

answered Sep 9, 2015 at 19:47

Brian Pendleton

8294 silver badges13 bronze badges

2 Comments

antonio_zeus Over a year ago

well although that seems useful, how would I incorporate that into my code as the dataFrame is being built? Maybe to incorporate that piece within pd.read_csv(.... ?

Brian Pendleton Over a year ago

I added a simpler version using numpy array, then just add it to the df before you append.

Collectives™ on Stack Overflow

python, pandas and importing multiple csv's into a dataframe

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related