Split a large csv file through python

Question

I have a csv file with 5 million rows. I want to split the file into a number a number of rows specified by the user.

Have developed the following code, but its taking too much time for the execution. Can anyone help me with the optimization of the code.

import csv
print "Please delete the previous created files. If any."

filepath = raw_input("Enter the File path: ")

line_count = 0
filenum = 1
try:
    in_file = raw_input("Enter Input File name: ")
    if in_file[-4:] == ".csv":
        split_size = int(raw_input("Enter size: "))
        print "Split Size ---", split_size
        print in_file, " will split into", split_size, "rows per file named as OutPut-file_*.csv (* = 1,2,3 and so on)"
        with open (in_file,'r') as file1:
            row_count = 0
            reader = csv.reader(file1)
            for line in file1:
                #print line
            with open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a") as out_file:
                if row_count < split_size:
                    out_file.write(line)
                    row_count = row_count +1
                else:
                    filenum = filenum + 1
                    row_count = 0
            line_count = line_count+1
        print "Total Files Written --", filenum
     else:
        print "Please enter the Name of the file correctly."        
except IOError as e:
   print "Oops..! Please Enter correct file path values", e
except  ValueError:
   print "Oops..! Please Enter correct values"

I have also tried without "with open"

what about seeking different points with different file pointer and using all of them parallelly via co-routine/gevent? — SRC
– SRC, Commented Oct 10, 2017 at 8:47
I haven't tried that yet.. Can you please help with the same. Is multi- threading or multitasking will help here. — spontaneous_coder
– spontaneous_coder, Commented Oct 10, 2017 at 12:56
And for some reason you were unable to remove your indian words? — James Z
– James Z, Commented Oct 10, 2017 at 14:34

Serge Ballesta · Accepted Answer · 2017-10-10 09:16:03Z

3

Oups! You are consistently re-opening the output file on each row, when it is an expensive operation... Your code could could become:

    ...
    with open (in_file,'r') as file1:
        row_count = 0
        #reader = csv.reader(file1)   # unused here
        out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
        for line in file1:
            #print line
            if row_count >= split_size:
                out_file.close()
                filenum = filenum + 1
                out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
                row_count = 0
            out_file.write(line)
            row_count = row_count +1
            line_count = line_count+1
        ...

Ideally, you should even initialize out_file = None before the try block and ensure a clean close in the except blocks with if out_file is not None: out_file.close()

Remark: this code only splits in line count (as yours did). That means that is will give wrong output if the csv file can contain newlines in quoted fields...

answered Oct 10, 2017 at 9:16

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

spontaneous_coder Over a year ago

Ohh.. well in that case..I need to check for the new lines. Right?

Serge Ballesta Over a year ago

@user2597209: if you want to allow newlines in quoted fields, you will have to parse the input file with a csv reader, and write the rows with a csv writer, or do the parsing by hand but it is complex with many corner cases.

Sargam Modak · Accepted Answer · 2017-10-11 07:26:00Z

You can definitely use the multiprocessing module of python.

This is the result I have achieved when I have a csv file which had 1,000,000 lines in it.

import time
from multiprocessing import Pool

def saving_csv_normally(start):
  out_file = open('out_normally/' + str(start/batch_size) + '.csv', 'w')
  for i in range(start, start+batch_size):
    out_file.write(arr[i])
  out_file.close()

def saving_csv_multi(start):
  out_file = open('out_multi/' + str(start/batch_size) + '.csv', 'w')
  for i in range(start, start+batch_size):
    out_file.write(arr[i])
  out_file.close()

def saving_csv_multi_async(start):
  out_file = open('out_multi_async/' + str(start/batch_size) + '.csv', 'w')
  for i in range(start, start+batch_size):
    out_file.write(arr[i])
  out_file.close()

with open('files/test.csv') as file:
  arr = file.readlines()

print "length of file : ", len(arr)

batch_size = 100 #split in number of rows

start = time.time()
for i in range(0, len(arr), batch_size):
  saving_csv_normally(i)
print "time taken normally : ", time.time()-start

#multiprocessing
p = Pool()
start = time.time()
p.map(saving_csv_multi, range(0, len(arr), batch_size), chunksize=len(arr)/4) #chunksize you can define as much as you want
print "time taken for multiprocessing : ", time.time()-start

# it does the same thing aynchronically
start = time.time()
for i in p.imap_unordered(saving_csv_multi_async, range(0, len(arr), batch_size), chunksize=len(arr)/4): 
  continue
print "time taken for multiprocessing async : ", time.time()-start

output shows time taken by each :

length of file :  1000000
time taken normally :  0.733881950378
time taken for multiprocessing :  0.508712053299
time taken for multiprocessing async :  0.471592903137

I have defined three separate functions as functions passed in p.map can only have one parameter and as I am storing csv files in three different folders that is why I have written three functions.

Collectives™ on Stack Overflow

Split a large csv file through python

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related