Pandas DataFrame from dict/column generator

Question

Given a dict, where keys are column labels and values are Series, I can easily build a DataFrame as such:

dat = {'col_1': pd.Series([3, 2, 1, 0], index=range(10,50,10)),
       'col_2': pd.Series([4, 5, 6, 7], index=range(10,50,10))}

pd.DataFrame(dat)

    col_1  col_2
10      3      4
20      2      5
30      1      6
40      0      7

However I have a generator that gives (key, value) tuples:

gen = ((col, srs) for col, srs in dat.items())  # generator object

Now I can trivially use the generator to create a dict and make the same Dataframe:

pd.DataFrame(dict(gen))

However this evaluates all the generator Series first, and then sends them into Pandas, and so uses twice the memory (I presume). I'd like Pandas to iterate over the generator itself as it builds the DataFrame if possible.

I can pass the generator into the DataFrame constructor, but get an odd result:

gen = ((col, srs) for col, srs in dat.items())  # generator object
pd.DataFrame(gen)

       0                                             1
0  col_1  10    3
20    2
30    1
40    0
dtype: int64
1  col_2  10    4
20    5
30    6
40    7
dtype: int64

And I get the same result using pd.DataFrame.from_dict(gen) or pd.DataFrame.from_records(gen).

So my questions are: Can I produce my original DataFrame by passing the generator gen to Pandas? And by doing so would I reduce my memory usage (assuming a large data set, not the trivial toy example shown here).

Thanks!

Hamzah Al-Qadasi · Accepted Answer · 2022-03-12 12:24:17Z

1

You can build your dataframe from the generator this way without the need for a conversion to a dictionary:

df = pd.DataFrame()

for x in gen:
    df[x[0]] = x[1]

For the allocated memory for both methods, I tried to compare both on different notebooks or on the same notebook but after restarting and clearing the outputs each time to get true results.

First method: converting a generator to dictionary and then dataframe:

import pandas as pd
import psutil
import os


dat = {'col_1': pd.Series([3, 2, 1, 0], index=range(10,50,10)),
       'col_2': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_3': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_4': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_5': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_6': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_7': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_8': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_9': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_10': pd.Series([4, 5, 6, 7], index=range(10,50,10))}


gen = ((col, srs) for col, srs in dat.items())  

df =  pd.DataFrame(dict(gen))

print("memory usage is {} MB".format(psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)))

#output

Memory usage is 78.55859375 MB

Second method: creating a dataframe by only iterating through the generator:

import pandas as pd
import psutil
import os


dat = {'col_1': pd.Series([3, 2, 1, 0], index=range(10,50,10)),
       'col_2': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_3': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_4': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_5': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_6': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_7': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_8': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_9': pd.Series([4, 5, 6, 7], index=range(10,50,10)),
      'col_10': pd.Series([4, 5, 6, 7], index=range(10,50,10))}


gen = ((col, srs) for col, srs in dat.items())

df = pd.DataFrame()

for x in gen:
    df[x[0]] = x[1]
    
print("Memory usage is {} MB".format(psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)))

#output
Memory usage is 78.69921875 MB

Conclusion, there is very small difference in terms of memory, but converting a generator to a dictionary more efficient in terms of time.

edited Mar 12, 2022 at 12:24

answered Feb 28, 2022 at 19:30

Hamzah Al-Qadasi

10k3 gold badges29 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Hamzah Al-Qadasi Over a year ago

dat.items() is the generator object as he indicated in his question.

Hamzah Al-Qadasi Over a year ago

Yes, it is a dictionary. Thanks

AtlasEarth Over a year ago

Thanks - mine is a toy example, in actuality I do have a gen generator, but I don't have the underlying dat data. However I could adapt your answer to use the gen as @d.b suggests...

Hamzah Al-Qadasi Over a year ago

I edited the answer to adapt it with generator with allocated memory for both.

JANO · Accepted Answer · 2022-02-28 19:57:12Z

0

You can simply convert the generator to a dict and then create a dataframe from it:

# Create generator
dat = {'col_1': pd.Series([3, 2, 11, 0], index=range(10,50,10)),
       'col_2': pd.Series([4, 5, 6, 7], index=range(10,50,10))}

gen = ((col, srs) for col, srs in dat.items())


# Create df
pd.DataFrame.from_dict(dict(gen))

Output:

col_1   col_2
10  3   4
20  2   5
30  11  6
40  0   7

edited Feb 28, 2022 at 19:57

answered Feb 28, 2022 at 19:50

JANO

3,0962 gold badges17 silver badges37 bronze badges

1 Comment

AtlasEarth Over a year ago

Thanks, as I mentioned in my question I was looking to the generator directly rather than first convert it to a dict.

Collectives™ on Stack Overflow

Pandas DataFrame from dict/column generator

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related