How to read data in Python dataframe without concatenating?

Question

I want to read the file f (file size:85GB) in chunks to a dataframe. Following code is suggested.

chunksize = 5
TextFileReader = pd.read_csv(f, chunksize=chunksize)

However, this code gives me TextFileReader, not dataframe. Also, I don't want to concatenate these chunks to convert TextFileReader to dataframe because of the memory limit. Please advise.

Sorry what are you asking here? you can't load the entire dataframe into memory which is why you read in chunks so why do you think that concatenating all the chunks will solve this problem? — EdChum
– EdChum, Commented Sep 8, 2016 at 8:49
Storing them in a list !?? I do not get what you want actually to achieve. Do you want to have the chunks separately? Please be aware that your TextFileReader is an iterable object where you can retrieve the individual chunks via for chunk in TextFileReader — ChE
– ChE, Commented Sep 8, 2016 at 8:51
You can use for loop(chunksize), in each iteration you will get one dataframe for each chunk. Loop will run five times and you can merge all dataframes at the end. — Sayali Sonawane
– Sayali Sonawane, Commented Sep 8, 2016 at 8:51
So loop over TextFileReader as explained above and do with the chunks whatever you want (reduce them, group them, ...) — ChE
– ChE, Commented Sep 8, 2016 at 9:02

SultanOrazbayev · Accepted Answer · 2021-07-04 06:39:53Z

23

As you are trying to process 85GB CSV file, if you will try to read all the data by breaking it into chunks and converting it into dataframe then it will hit memory limit for sure. You can try to solve this problem by using different approach. In this case, you can use filtering operations on your data. For example, if there are 600 columns in your dataset and you are interested only in 50 columns. Try to read only 50 columns from the file. This way you will save lot of memory. Process your rows as you read them. If you need to filter the data first, use a generator function. yield makes a function a generator function, which means it won't do any work until you start looping over it.

For more information regarding generator function: Reading a huge .csv file

For efficient filtering refer: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3

For processing smaller dataset:

Approach 1: To convert reader object to dataframe directly:

full_data = pd.concat(TextFileReader, ignore_index=True)

It is necessary to add parameter ignore index to function concat, because avoiding duplicity of indexes.

Approach 2: Use Iterator or get_chunk to convert it into dataframe.

By specifying a chunksize to read_csv,return value will be an iterable object of type TextFileReader.

df=TextFileReader.get_chunk(3)

for chunk in TextFileReader:
    print(chunk)

Source : http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

df= pd.DataFrame(TextFileReader.get_chunk(1))

This will convert one chunk to dataframe.

Checking total number of chunks in TextFileReader

for chunk_number, chunk in enumerate(TextFileReader):
    # some code here, if needed
    pass

print("Total number of chunks is", chunk_number+1)

If file size is bigger,I won't recommend second approach. For example, if csv file consist of 100000 records then chunksize=5 will create 20,000 chunks.

edited Jul 4, 2021 at 6:39

SultanOrazbayev

16.7k3 gold badges25 silver badges59 bronze badges

answered Sep 8, 2016 at 9:02

Sayali Sonawane

12.6k5 gold badges48 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Geet Over a year ago

Ok, but the screenshot you shared says it will still give TextFileReader. So, how should I convert that to Dataframe?

Sayali Sonawane Over a year ago

try chunk_1= pd.DataFrame(TextFileReader.get_chunk(1)) This will convert one chunk to dataframe

Geet Over a year ago

My data has millions of rows. So, I can't use 2nd approach. And, 1st approach has concatenation. So, I will hit the memory limit thanks to my 85GB csv file. What should I do?

Sayali Sonawane Over a year ago

If you can convert your csv file to some compressed file format which is supported by Python. In that case it will be easy to read data.

Sayali Sonawane Over a year ago

Check accepted answer. stackoverflow.com/questions/17444679/…

julliet · Accepted Answer · 2018-06-26 10:06:41Z

5

If you want to receive a data frame as a result of working with chunks, you can do it this way. Initialize empty data frame before you initialize chunk iterations. After you did the filtering process you can concatenate every result into your dataframe. As a result you will receive a dataframe filtered by your condition under the for loop.

file = 'results.csv'
df_empty = pd.DataFrame()
with open(file) as fl:
    chunk_iter = pd.read_csv(fl, chunksize = 100000)
    for chunk in chunk_iter:
        chunk = chunk[chunk['column1'] > 180]
        df_empty = pd.concat([df_empty,chunk])

answered Jun 26, 2018 at 10:06

julliet

1672 silver badges12 bronze badges

Comments

Hemanth kumar · Accepted Answer · 2021-12-20 07:09:17Z

0

  full_dataframe = pd.DataFrame(TextFileReader.get_chunk(100000))

answered Dec 20, 2021 at 7:09

Hemanth kumar

1411 silver badge6 bronze badges

Collectives™ on Stack Overflow

How to read data in Python dataframe without concatenating?

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related