0

My application need to process data periodically. The application need to process new data and then merge it with old ones. The data may have billions rows with only two columns, which the first column is the row name and the second one is values. The following one is the example:

a00001,12
a00002,2321
a00003,234

The new data may has new row names or old ones. I want to merge them. So each in processing procedure I need to read the old large data file and merge it with new ones. Then I write the new data to a new file.

I find that the most time-consuming process is read and write data. I have tried several data I/O way.

  1. Orignal read and write text. This is the most time-consuming way
  2. Python pickle package, however, it is not efficient for large data file

Are there any other data I/O formats or packages can load and write large data efficiently in python?

2
  • For processing billions of rows, my advice is to use Apache Spark and pyspark. Commented Apr 5, 2020 at 13:15
  • @HenriqueBranco Using apache spark will bring more hardware and maintain cost. And It may only have ten millions at beginning Commented Apr 5, 2020 at 13:34

2 Answers 2

1

If you have such large amounts of data, it might be faster to try lowering the amount of data you have to read and write.

You could spread the data over multiple files instead of saving it all in one. When processing your new data, check what old data has to be merged and just read and write those specific files.

Your data has two rows:

name1, data1
name2, data2

Files containing old data:

db_1.dat,               db_2.dat,                 db_3.dat
name_1: data_1          name_1001: data_1001      name_2001: data_2001
.                       .                         .
.                       .                         .
.                       .                         .                
name_1000: data_1000    name_2000: data_2000      name_3000: data_3000 

Now you can check what data you need to merge and just read and write the specific files holding that data.

Not sure if what you are trying to achieve allows a system like this but it would speed up the process as there is less data to handle.

Sign up to request clarification or add additional context in comments.

1 Comment

This method seems good. I can split old data into different files based on key range. Then I can merge the new one with specified file. Thx!
0

Maybe this article could help you. It seems like father and parquet may be interesting.

2 Comments

Welcome to StackOverflow! While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
This article is what I want. I will test these file format. Thx!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.