Pandas: Continuously write from function to csv

Question

I have a function set up for Pandas that runs through a large number of rows in input.csv and inputs the results into a Series. It then writes the Series to output.csv.

However, if the process is interrupted (for example by an unexpected event) the program will terminate and all data that would have gone into the csv is lost.

Is there a way to write the data continuously to the csv, regardless of whether the function finishes for all rows?

Prefarably, each time the program starts, a blank output.csv is created, that is appended to while the function is running.

import pandas as pd

df = pd.read_csv("read.csv")

def crawl(a):
    #Create x, y
    return pd.Series([x, y])

df[["Column X", "Column Y"]] = df["Column A"].apply(crawl)
df.to_csv("write.csv", index=False)

write in chunks as you go and append to the csv, use mode = 'a',header=False after the first write. You can — Padraic Cunningham
– Padraic Cunningham, Commented Jun 27, 2015 at 15:21
Do you mean the order of the columns? If so, yes they need to be in a certain order. — P A N
– P A N, Commented Jun 27, 2015 at 15:38
You can use if os.path.isfile(). stackoverflow.com/questions/30991541/… — Padraic Cunningham
– Padraic Cunningham, Commented Jun 27, 2015 at 16:10

Tom Patel · Accepted Answer · 2015-08-26 14:33:24Z

23

+50

This is a possible solution that will append the data to a new file as it reads the csv in chunks. If the process is interrupted the new file will contain all the information up until the interruption.

import pandas as pd

#csv file to be read in 
in_csv = '/path/to/read/file.csv'

#csv to write data to 
out_csv = 'path/to/write/file.csv'

#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

#size of chunks of data to write to the csv
chunksize = 10

#start looping through data writing it to a new file for each chunk
for i in range(1,number_lines,chunksize):
     df = pd.read_csv(in_csv,
          header=None,
          nrows = chunksize,#number of rows to read at each loop
          skiprows = i)#skip rows that have been read

     df.to_csv(out_csv,
          index=False,
          header=False,
          mode='a',#append data to csv file
          chunksize=chunksize)#size of data to append for each loop

answered Aug 26, 2015 at 14:33

Tom Patel

4321 gold badge4 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Andy Kubiak Over a year ago

Might just add import os; start = 1 + sum(1 for row in (open(out_csv))) if os.path.isfile(out_csv) else 1 and put that in the first position of the range() call

jamesbev Over a year ago

Didn't know about mode = 'a', great tip!

P A N · Accepted Answer · 2015-08-24 18:14:32Z

5

In the end, this is what I came up with. Thanks for helping out!

import pandas as pd

df1 = pd.read_csv("read.csv")

run = 0

def crawl(a):

    global run
    run = run + 1

    #Create x, y

    df2 = pd.DataFrame([[x, y]], columns=["X", "Y"])

    if run == 1:
        df2.to_csv("output.csv")
    if run != 1:
        df2.to_csv("output.csv", header=None, mode="a")

df1["Column A"].apply(crawl)

edited Aug 24, 2015 at 18:14

answered Aug 24, 2015 at 17:56

P A N

5,95218 gold badges61 silver badges108 bronze badges

9 Comments

P A N Over a year ago

If you have suggestions for improvements, please post a full answer and I will change my selected answer accordingly.

Padraic Cunningham Over a year ago

This won't write the data if your program crashes, you will still lose everything

P A N Over a year ago

@PadraicCunningham It will write the data for successful passes of crawl(a). But if there's a crash in the current pass, that data will be lost. Not sure how to prevent that except writing to the csv instantly after x and y have been attained.

Padraic Cunningham Over a year ago

You can catch the exception and write in the except block, also I think df2.to_csv(f, header=None, mode="a") if os.path.isfile(f) else df2.to_csv(f) is what your ifs are basically doing, I am not seeing how the global fits into the whole process either

Padraic Cunningham Over a year ago

No worries, if you do catch the exceptions, you can also output where you are in your input file so at least you don't have to go comparing manually, you could do it all programmatically

|

Ben K. · Accepted Answer · 2015-08-24 16:33:00Z

1

I would suggest this:

with open("write.csv","a") as f:
    df.to_csv(f,header=False,index=False)

The argument "a" will append the new df to an existing file and the file gets closed after the with block is finished, so you should keep all of your intermediary results.

answered Aug 24, 2015 at 16:33

Ben K.

1,1506 silver badges20 bronze badges

Comments

tmsss · Accepted Answer · 2017-10-16 16:39:13Z

1

I've found a solution to a similar problem by looping the dataframe with iterrows() and saving each row to the csv file, which in your case it could be something like this:

for ix, row in df.iterrows():
    row['Column A'] = crawl(row['Column A'])

    # if you wish to mantain the header
    if ix == 0:
        df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8')
    else:
        df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8', header=False)

answered Oct 16, 2017 at 16:39

tmsss

2,16922 silver badges25 bronze badges

Collectives™ on Stack Overflow

Pandas: Continuously write from function to csv

4 Answers 4

2 Comments

9 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

9 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related