Reading Large CSV File in Python Panda

Question

I have a large data set which is almost 4 GB in csv format. But i do not need the whole data set, I need some specific column. Is it possible to read some specific column instead of reading the whole data set using Python Panda? Will it increase the speed of reading the file?

Thank you very much in advance for suggestion.

Michael · Accepted Answer · 2015-05-15 04:09:06Z

0

If you have 4 GB of memory, don't worry about it (the time it will take you to program a less memory intensive solution is not worth it). Read the entire dataset in using pd.read_csv and then subset to just the column that you need. If you don't have enough memory and you really need to read the file line by line (i.e. row by row), modify this code to only keep the column of interest in memory.

If you have plenty of memory and your problem is that you have multiple files in this format, then I would recommend using the multiprocessing package to parallelize the task.

from muliprocessing import Pool
pool = Pool(processes = your_processors_n)
dataframeslist = pool.map(your_regular_expression_readin_func, [df1, df2, ... dfn])

edited May 15, 2015 at 4:09

answered May 15, 2015 at 3:47

Michael

14.1k24 gold badges73 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Mohammad Saifullah Over a year ago

I think my problem is not memory, problem is reading speed. I am using a regular expression as separator, is it making it slow?

Michael Over a year ago

You speed issue then is likely with the regular expression and I would post a question about that regular expression. It obviously takes time to load the data, but you could always load it once, subset, and only save the column of interest such that the next time you needed it, the data would load much faster. Your speed issue is most likely in the regular expression.

Michael Over a year ago

Wait, maybe I misunderstood. You're using a regular expression to load a .csv file? I thought you were applying it post hoc. Use pandas.read_csv to read in a csv file, which if you import pandas as pd is pd.read_csv as in my answer above.

Mohammad Saifullah Over a year ago

Please have a look her: stackoverflow.com/questions/30248128/…

Michael Over a year ago

So it's not really comma seperated, but something else? Is the script not completing when you implement the answers in the other question? Seems like you just need to run it once and then save it as an actual .csv file, which will easily load using pandas (unless you have a lot of datasets like this).

|

Collectives™ on Stack Overflow

Reading Large CSV File in Python Panda

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related