My application need to process data periodically. The application need to process new data and then merge it with old ones. The data may have billions rows with only two columns, which the first column is the row name and the second one is values. The following one is the example:
a00001,12
a00002,2321
a00003,234
The new data may has new row names or old ones. I want to merge them. So each in processing procedure I need to read the old large data file and merge it with new ones. Then I write the new data to a new file.
I find that the most time-consuming process is read and write data. I have tried several data I/O way.
- Orignal read and write text. This is the most time-consuming way
- Python pickle package, however, it is not efficient for large data file
Are there any other data I/O formats or packages can load and write large data efficiently in python?