How to write and read data efficiently in python?

Question

My application need to process data periodically. The application need to process new data and then merge it with old ones. The data may have billions rows with only two columns, which the first column is the row name and the second one is values. The following one is the example:

a00001,12
a00002,2321
a00003,234

The new data may has new row names or old ones. I want to merge them. So each in processing procedure I need to read the old large data file and merge it with new ones. Then I write the new data to a new file.

I find that the most time-consuming process is read and write data. I have tried several data I/O way.

Orignal read and write text. This is the most time-consuming way
Python pickle package, however, it is not efficient for large data file

Are there any other data I/O formats or packages can load and write large data efficiently in python?

For processing billions of rows, my advice is to use Apache Spark and pyspark. — Henrique Branco
– Henrique Branco, Commented Apr 5, 2020 at 13:15
@HenriqueBranco Using apache spark will bring more hardware and maintain cost. And It may only have ten millions at beginning — DuFei
– DuFei, Commented Apr 5, 2020 at 13:34

samy · Accepted Answer · 2020-04-05 14:40:04Z

1

If you have such large amounts of data, it might be faster to try lowering the amount of data you have to read and write.

You could spread the data over multiple files instead of saving it all in one. When processing your new data, check what old data has to be merged and just read and write those specific files.

Your data has two rows:

name1, data1
name2, data2

Files containing old data:

db_1.dat,               db_2.dat,                 db_3.dat
name_1: data_1          name_1001: data_1001      name_2001: data_2001
.                       .                         .
.                       .                         .
.                       .                         .                
name_1000: data_1000    name_2000: data_2000      name_3000: data_3000

Now you can check what data you need to merge and just read and write the specific files holding that data.

Not sure if what you are trying to achieve allows a system like this but it would speed up the process as there is less data to handle.

answered Apr 5, 2020 at 14:40

samy

731 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

DuFei Over a year ago

This method seems good. I can split old data into different files based on key range. Then I can merge the new one with specified file. Thx!

artouditou · Accepted Answer · 2020-04-05 13:46:59Z

0

Maybe this article could help you. It seems like father and parquet may be interesting.

answered Apr 5, 2020 at 13:46

artouditou

11

2 Comments

Brian61354270 Over a year ago

Welcome to StackOverflow! While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

DuFei Over a year ago

This article is what I want. I will test these file format. Thx!

Collectives™ on Stack Overflow

How to write and read data efficiently in python?

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related