17

I have a function set up for Pandas that runs through a large number of rows in input.csv and inputs the results into a Series. It then writes the Series to output.csv.

However, if the process is interrupted (for example by an unexpected event) the program will terminate and all data that would have gone into the csv is lost.

Is there a way to write the data continuously to the csv, regardless of whether the function finishes for all rows?

Prefarably, each time the program starts, a blank output.csv is created, that is appended to while the function is running.

import pandas as pd

df = pd.read_csv("read.csv")

def crawl(a):
    #Create x, y
    return pd.Series([x, y])

df[["Column X", "Column Y"]] = df["Column A"].apply(crawl)
df.to_csv("write.csv", index=False)
17
  • 1
    write in chunks as you go and append to the csv, use mode = 'a',header=False after the first write. You can Commented Jun 27, 2015 at 15:21
  • Also does the order matter? Commented Jun 27, 2015 at 15:34
  • Do you mean the order of the columns? If so, yes they need to be in a certain order. Commented Jun 27, 2015 at 15:38
  • 1
    You can use if os.path.isfile(). stackoverflow.com/questions/30991541/… Commented Jun 27, 2015 at 16:10
  • 1
    Here is an example stackoverflow.com/questions/30776900/… Commented Jun 27, 2015 at 16:49

4 Answers 4

23
+50

This is a possible solution that will append the data to a new file as it reads the csv in chunks. If the process is interrupted the new file will contain all the information up until the interruption.

import pandas as pd

#csv file to be read in 
in_csv = '/path/to/read/file.csv'

#csv to write data to 
out_csv = 'path/to/write/file.csv'

#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))

#size of chunks of data to write to the csv
chunksize = 10

#start looping through data writing it to a new file for each chunk
for i in range(1,number_lines,chunksize):
     df = pd.read_csv(in_csv,
          header=None,
          nrows = chunksize,#number of rows to read at each loop
          skiprows = i)#skip rows that have been read

     df.to_csv(out_csv,
          index=False,
          header=False,
          mode='a',#append data to csv file
          chunksize=chunksize)#size of data to append for each loop
Sign up to request clarification or add additional context in comments.

2 Comments

Might just add import os; start = 1 + sum(1 for row in (open(out_csv))) if os.path.isfile(out_csv) else 1 and put that in the first position of the range() call
Didn't know about mode = 'a', great tip!
5

In the end, this is what I came up with. Thanks for helping out!

import pandas as pd

df1 = pd.read_csv("read.csv")

run = 0

def crawl(a):

    global run
    run = run + 1

    #Create x, y

    df2 = pd.DataFrame([[x, y]], columns=["X", "Y"])

    if run == 1:
        df2.to_csv("output.csv")
    if run != 1:
        df2.to_csv("output.csv", header=None, mode="a")

df1["Column A"].apply(crawl)

9 Comments

If you have suggestions for improvements, please post a full answer and I will change my selected answer accordingly.
This won't write the data if your program crashes, you will still lose everything
@PadraicCunningham It will write the data for successful passes of crawl(a). But if there's a crash in the current pass, that data will be lost. Not sure how to prevent that except writing to the csv instantly after x and y have been attained.
You can catch the exception and write in the except block, also I think df2.to_csv(f, header=None, mode="a") if os.path.isfile(f) else df2.to_csv(f) is what your ifs are basically doing, I am not seeing how the global fits into the whole process either
No worries, if you do catch the exceptions, you can also output where you are in your input file so at least you don't have to go comparing manually, you could do it all programmatically
|
1

I would suggest this:

with open("write.csv","a") as f:
    df.to_csv(f,header=False,index=False)

The argument "a" will append the new df to an existing file and the file gets closed after the with block is finished, so you should keep all of your intermediary results.

Comments

1

I've found a solution to a similar problem by looping the dataframe with iterrows() and saving each row to the csv file, which in your case it could be something like this:

for ix, row in df.iterrows():
    row['Column A'] = crawl(row['Column A'])

    # if you wish to mantain the header
    if ix == 0:
        df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8')
    else:
        df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8', header=False)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.