1

I have a large data set which is almost 4 GB in csv format. But i do not need the whole data set, I need some specific column. Is it possible to read some specific column instead of reading the whole data set using Python Panda? Will it increase the speed of reading the file?

Thank you very much in advance for suggestion.

1 Answer 1

0

If you have 4 GB of memory, don't worry about it (the time it will take you to program a less memory intensive solution is not worth it). Read the entire dataset in using pd.read_csv and then subset to just the column that you need. If you don't have enough memory and you really need to read the file line by line (i.e. row by row), modify this code to only keep the column of interest in memory.

If you have plenty of memory and your problem is that you have multiple files in this format, then I would recommend using the multiprocessing package to parallelize the task.

from muliprocessing import Pool
pool = Pool(processes = your_processors_n)
dataframeslist = pool.map(your_regular_expression_readin_func, [df1, df2, ... dfn])
Sign up to request clarification or add additional context in comments.

8 Comments

I think my problem is not memory, problem is reading speed. I am using a regular expression as separator, is it making it slow?
You speed issue then is likely with the regular expression and I would post a question about that regular expression. It obviously takes time to load the data, but you could always load it once, subset, and only save the column of interest such that the next time you needed it, the data would load much faster. Your speed issue is most likely in the regular expression.
Wait, maybe I misunderstood. You're using a regular expression to load a .csv file? I thought you were applying it post hoc. Use pandas.read_csv to read in a csv file, which if you import pandas as pd is pd.read_csv as in my answer above.
So it's not really comma seperated, but something else? Is the script not completing when you implement the answers in the other question? Seems like you just need to run it once and then save it as an actual .csv file, which will easily load using pandas (unless you have a lot of datasets like this).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.