0

My code is grabbing multiple csv files from a directory, and putting all the data into a dataFrame I created and called "df". Each CSV is the same format, but can be of various lengths so this is what I want to do:

I want to have a column in my df (DataFrame) that records the second to last piece of data in each csv I pull in before it moves onto the next one. I have modified the output below to give you an example of what I mean. Let's suppose I call this column BeforeLast. When you see a 0 value, that means its not the second to last piece of data in the csv I pulled, if you see a 1 value it means its the second to last piece of data in the csv I pulled.

How can I do this as Python is pulling in each csv called upon?

import pandas as pd
import glob
import os


path =r'X:\PublicFiles\TradingData\CSV\RealMarkets\Weekly\Futures\Contracts\Corn C'
allFiles = glob.glob(path + "/*.csv")  ##'*' means any file name can be grabbed
df = pd.DataFrame()
list_ = []

for file_ in allFiles:
    names = ['Date', 'Open', 'High', 'Low', 'Close', 'Vol', 'OI']
    df = pd.read_csv(file_, index_col = None, names = names)
    list_.append(df)
frame = pd.concat(list_)

Here is a sample of my current dataFrame (df)

    Date       Open    High     Low   Close   Vol  OI
0   20141212  427.00  427.00  427.00  427.00    0   0
1   20141219  429.00  429.00  424.00  424.00    0   0
2   20141226  424.00  425.00  423.00  425.00    0   0
3   20150102  422.75  422.75  417.50  417.50    0   0

This is what I want

    Date       Open    High     Low   Close   Vol  OI  BeforeLast
0   20141212  427.00  427.00  427.00  427.00    0   0  0
1   20141219  429.00  429.00  424.00  424.00    0   0  0
2   20141226  424.00  425.00  423.00  425.00    0   0  1
3   20150102  422.75  422.75  417.50  417.50    0   0  0 (this is the last piece of data in this csv and now it moves on to the next)
4   20141226  424.00  425.00  423.00  425.00    0   0  0
5   20150102  422.75  422.75  417.50  417.50    0   0  0
6   20141226  424.00  425.00  423.00  425.00    0   0  1
7   20150102  422.75  422.75  417.50  417.50    0   0  0

3 Answers 3

2

Try this. You do not need a list. Just append to the original data frame.

.iloc[-2, -1] is the 2nd to last row, last col

I added a index reset as in my test I ran into duplicate index numbers.

import pandas as pd
import glob
import os


path =r'X:\PublicFiles\TradingData\CSV\RealMarkets\Weekly\Futures\Contracts\Corn C'
allFiles = glob.glob(path + "/*.csv")  ##'*' means any file name can be grabbed
df = pd.DataFrame()

for file_ in allFiles:
    names = ['Date', 'Open', 'High', 'Low', 'Close', 'Vol', 'OI']
    df_temp = pd.read_csv(file_, index_col = None, names = names)
    df_temp['beforelast'] = 0
    df_temp.iloc[-2,-1] = 1
    df = df.append(df_temp)

df = df.reset_index(drop=True)
Sign up to request clarification or add additional context in comments.

2 Comments

hey this worked great! one thing I noticed was that my columns re-arranged into some different order. any quick way to put them back the same way I have my [names] ordered?
try: df = df[names]
0

Just create a list to keep track of the last column when you are building your dataframe:

import pandas as pd

df = pd.DataFrame()
newcol = []

for i in range(10):
    # Load 10 files and get shape
    # length = df.shape[0]
    length = 10
    c = [0 for i in range(length)]
    c[-2] = 1
    newcol += c

df['BeforeLast'] = newcol

print df

1 Comment

It doesnt matter how many file you have. Everytime you load a file, just keep track of how long it was using the newcol list. When you have all your files loaded just add the new column to your complete dataframe
0
df = pd.DataFrame({'a': np.zeros(5)})
df[-2:-1] = 1
print df

   a
0  0
1  0
2  0
3  1
4  0

You can use this when you create each dataframe?

Example in your code:

for file_ in allFiles:
    names = ['Date', 'Open', 'High', 'Low', 'Close', 'Vol', 'OI']
    df = pd.read_csv(file_, index_col = None, names = names)
    before = np.zeros(len(df))
    before[-2] = 1
    df['before'] = before
    list_.append(df)
frame = pd.concat(list_)

2 Comments

well although that seems useful, how would I incorporate that into my code as the dataFrame is being built? Maybe to incorporate that piece within pd.read_csv(.... ?
I added a simpler version using numpy array, then just add it to the df before you append.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.