I am currently fiddling around with Python when my boss assigned me with a quite daunting task.
He gave me a CSV file with around 14GB in size, and ask me if I can inflate that CSV to a delimited file with 4TB of size, by replicating itself several times.
For example, take this CSV:
TIME_SK,ACCOUNT_NUMBER,ACCOUNT_TYPE_SK,ACCOUNT_STATUS_SK,CURRENCY_SK,GLACC_BUSINESS_NAME,PRODUCT_SK,PRODUCT_TERM_SK,NORMAL_BAL,SPECIAL_BAL,FINAL_MOV_YTD_BAL,NO_OF_DAYS_MTD,NO_OF_DAYS_YTD,BANK_FLAG,MEASURE_ID,SOURCE_SYSTEM_ID 20150131,F290006G93996,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,865.57767676670005,0,865.57767676670005,30,121,N,DEPOSIT INSURANCE EXPENSE,1 20150131,F2900F036FF90,7,9,12,GWM BALANCE,502,0,-139.0556,0,-139.0556,30,121,N,GWM BALANCE,1 20150131,F070007GG6790,7,1,12,DEPOSIT INSURANCE EXPENSE,1008,0,14100.016698793699,0,14100.016698793699,30,121,N,DEPOSIT INSURANCE EXPENSE,1 20150131,F2F00040FG982,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,8410.4009848750993,0,8410.4009848750993,30,121,N,DEPOSIT INSURANCE EXPENSE,1 20150131,FF30009944863,7,9,12,ACCOUNT PRINCIPAL,502,0,-2367.9400000000001,0,-2367.9400000000001,30,121,N,GL BALANCE,1 20150131,F240002FG722F,7,1,12,ACCOUNT PRINCIPAL,502,0,-28978292.390000001,0,-28978292.390000001,30,121,N,GL BALANCE,1 20150131,F0G00FFF74293,7,1,12,ACCOUNT PRINCIPAL,1008,0,-855196.81000000006,0,-855196.81000000006,30,121,N,GL BALANCE,1 20150131,FF20007947687,7,9,12,GWM BALANCE,2425,0,-368.45897600000001,0,-368.45897600000001,30,121,N,GWM BALANCE,1 20150131,F200007938744,7,1,12,GWM BALANCE,502,0,-19977.173964000001,0,-19977.173964000001,30,121,N,GWM BALANCE,1
He wants me to inflate the size by replicating the contents of the CSV, with altering the TIME_SK header, like below:
TIME_SK,ACCOUNT_NUMBER,ACCOUNT_TYPE_SK,ACCOUNT_STATUS_SK,CURRENCY_SK,GLACC_BUSINESS_NAME,PRODUCT_SK,PRODUCT_TERM_SK,NORMAL_BAL,SPECIAL_BAL,FINAL_MOV_YTD_BAL,NO_OF_DAYS_MTD,NO_OF_DAYS_YTD,BANK_FLAG,MEASURE_ID,SOURCE_SYSTEM_ID 20150131,F290006G93996,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,865.57767676670005,0,865.57767676670005,30,121,N,DEPOSIT INSURANCE EXPENSE,1 20150131,F2900F036FF90,7,9,12,GWM BALANCE,502,0,-139.0556,0,-139.0556,30,121,N,GWM BALANCE,1 20150131,F070007GG6790,7,1,12,DEPOSIT INSURANCE EXPENSE,1008,0,14100.016698793699,0,14100.016698793699,30,121,N,DEPOSIT INSURANCE EXPENSE,1 20150131,F2F00040FG982,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,8410.4009848750993,0,8410.4009848750993,30,121,N,DEPOSIT INSURANCE EXPENSE,1 20150131,FF30009944863,7,9,12,ACCOUNT PRINCIPAL,502,0,-2367.9400000000001,0,-2367.9400000000001,30,121,N,GL BALANCE,1 20150131,F240002FG722F,7,1,12,ACCOUNT PRINCIPAL,502,0,-28978292.390000001,0,-28978292.390000001,30,121,N,GL BALANCE,1 20150131,F0G00FFF74293,7,1,12,ACCOUNT PRINCIPAL,1008,0,-855196.81000000006,0,-855196.81000000006,30,121,N,GL BALANCE,1 20150131,FF20007947687,7,9,12,GWM BALANCE,2425,0,-368.45897600000001,0,-368.45897600000001,30,121,N,GWM BALANCE,1 20150131,F200007938744,7,1,12,GWM BALANCE,502,0,-19977.173964000001,0,-19977.173964000001,30,121,N,GWM BALANCE,1 20150201,F290006G93996,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,865.57767676670005,0,865.57767676670005,30,121,N,DEPOSIT INSURANCE EXPENSE,1 20150201,F2900F036FF90,7,9,12,GWM BALANCE,502,0,-139.0556,0,-139.0556,30,121,N,GWM BALANCE,1 20150201,F070007GG6790,7,1,12,DEPOSIT INSURANCE EXPENSE,1008,0,14100.016698793699,0,14100.016698793699,30,121,N,DEPOSIT INSURANCE EXPENSE,1 20150201,F2F00040FG982,7,1,12,DEPOSIT INSURANCE EXPENSE,502,0,8410.4009848750993,0,8410.4009848750993,30,121,N,DEPOSIT INSURANCE EXPENSE,1 20150201,FF30009944863,7,9,12,ACCOUNT PRINCIPAL,502,0,-2367.9400000000001,0,-2367.9400000000001,30,121,N,GL BALANCE,1 20150201,F240002FG722F,7,1,12,ACCOUNT PRINCIPAL,502,0,-28978292.390000001,0,-28978292.390000001,30,121,N,GL BALANCE,1 20150201,F0G00FFF74293,7,1,12,ACCOUNT PRINCIPAL,1008,0,-855196.81000000006,0,-855196.81000000006,30,121,N,GL BALANCE,1 20150201,FF20007947687,7,9,12,GWM BALANCE,2425,0,-368.45897600000001,0,-368.45897600000001,30,121,N,GWM BALANCE,1 20150201,F200007938744,7,1,12,GWM BALANCE,502,0,-19977.173964000001,0,-19977.173964000001,30,121,N,GWM BALANCE,1
and so on.
I was able to make the Python script to do the task, however when used on the real CSV file with tens of Gigabytes in size and hundred millions of row, the task was proved to be too long to complete (there was a time constraint at that time; however, he asked me to do it again now).
I am using the Python built in CSV Writer. After a bit of research, I came up with two different approach:
1. The Old and Trusted Iterator
This is the first version of my script; it does the job all right, however it took too long for tackling the humongous CSV.
. . . omitted . . .
with open('../csv/DAILY_DDMAST.csv', 'rb') as csvinput:
with open('../result/DAILY_DDMAST_result1'+name_interval+'.csv', 'wb') as csvoutput:
reader = csv.reader(csvinput)
writer = csv.writer(csvoutput, lineterminator='\r\n')
# This part copies the original CSV to a new file
for row in reader:
writer.writerow(row)
print("Done copying. Time elapsed: %s seconds, Total time: %s seconds" %
((time.time() - start_time), (time.time() - start_time)))
i = 0
while i < 5:
# This part replicates the content of CSV, with modifying the TIME_SK value
counter_time = time.time()
for row in reader:
newdate = datetime.datetime.strptime(row[0], "%Y%m%d") + datetime.timedelta(days=i)
row[0] = newdate.strftime("%Y%m%d")
writer.writerow(row)
csvinput.seek(0)
next(reader, None)
print("Done processing for i = %d. Time elapsed: %s seconds, Total time: %s seconds" %
(i+1, (counter_time - start_time), (time.time() - start_time)))
i += 1
. . . omitted . . .
In my understanding, the script will iterate each row inside the CSV by for row in reader, and then write each row to the new file with writer.writerow(row). I also found that by iterating the source file, it is a bit repetitive and time consuming, so I thought it could have been more efficient with other approach...
2. The Bucket
This was intended as an "upgrade" to the first version of the script.
. . . omitted . . .
with open('../csv/DAILY_DDMAST.csv', 'rb') as csvinput:
with open('../result/DAILY_DDMAST_result2'+name_interval+'.csv', 'wb') as csvoutput:
reader = csv.reader(csvinput)
writer = csv.writer(csvoutput, lineterminator='\r\n')
csv_buffer = list()
for row in reader:
# Here, rather than directly writing the iterated row, I stored it in a list.
# If the list reached 1 mio rows, then it writes to the file and empty the "bucket"
csv_buffer.append(row)
if len(csv_buffer) > 1000000:
writer.writerows(csv_buffer)
del csv_buffer[:]
writer.writerows(csv_buffer)
print("Done copying. Time elapsed: %s seconds, Total time: %s seconds" %
((time.time() - start_time), (time.time() - start_time)))
i = 0
while i < 5:
counter_time = time.time()
del csv_buffer[:]
for row in reader:
newdate = datetime.datetime.strptime(row[0], "%Y%m%d") + datetime.timedelta(days=i)
row[0] = newdate.strftime("%Y%m%d")
# Same goes here
csv_buffer.append(row)
if len(csv_buffer) > 1000000:
writer.writerows(csv_buffer)
del csv_buffer[:]
writer.writerows(csv_buffer)
csvinput.seek(0)
next(reader, None)
print("Done processing for i = %d. Time elapsed: %s seconds, Total time: %s seconds" %
(i+1, (counter_time - start_time), (time.time() - start_time)))
i += 1
. . . omitted . . .
I thought, by storing it in memory then writing them altogether with writerows, I could've saved time. However, that was not the case. I found out that even if I store the rows to be written to the new CSV, writerows iterates the list then write them to the new file, thus it consumes nearly as long as the first script...
At this point, I don't know if I should come up with better algorithm or there is something that I could use - something like the writerows, only it does not iterate, but writes them all at once.
I don't know if such thing is possible or not, either
Anyway, I need help on this, and if anyone could shed some lights, I would be very thankful!