Creating pandas dataframe with tuples per row but with missing columns

Question

I have an operation that outputs a series of tuples like [('a',1.0), ('c', 2.5)]. It does this for a lot of inputs, so the outputs would look like

[('a',1.0), ('c', 2.5)]
[('b',1.5), ('c', 2.5)]
[('a', 5.0), ('b',1.5), ('c', 2.75)]

which should output a dataframe that looks like

>>> df
     a     b     c
0    1.0   NaN   2.5
1    NaN   1.5   2.5
2    5.0   1.5   2.75

However, the column names are not known beforehand, so at some point the data generation, I could end up with some ('z',12.0).

I think the simplest way would be to create a dataframe for each row and concatenate the dataframes:

df_list = []
for row in rows:
     tuple_result = f(row)
     df_list.append(pd.DataFrame(...)) # generate a single-row dataframe
df = pd.concat(df_list, axis=0, ignore_index=True)

and this will take care of all the NaNs and column names. However, I will be doing this for several rows and I think this approach will be unnecessarily memory-intensive.

Is there a better way to do this?

Alexander · Accepted Answer · 2019-08-26 18:20:32Z

3

Your can use a list comprehension, converting each row of tuples into a dictionary.

my_data = [
    [('a',1.0), ('c', 2.5)],
    [('b',1.5), ('c', 2.5)],
    [('a', 5.0), ('b',1.5), ('c', 2.75)]
]

>>> pd.DataFrame([dict(row) for row in my_data])
     a    b     c
0  1.0  NaN  2.50
1  NaN  1.5  2.50
2  5.0  1.5  2.75

Timings

%timeit pd.DataFrame([dict(row) for row in my_data * 100000])
# 559 ms ± 92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pd.DataFrame(map(dict, my_data * 100000))
# 438 ms ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df_list = []
for row in my_data * 100000:
     df_list.append(pd.DataFrame(dict(row), index=[0])) 
df = pd.concat(df_list, axis=0, ignore_index=True, sort=False)
# 6min 11s ± 1min 54s per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Aug 26, 2019 at 18:20

answered Aug 26, 2019 at 16:29

Alexander

111k32 gold badges212 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rafaelc Over a year ago

In recent version of pandas, you might pass map objs also pd.DataFrame(map(dict, my_data)), I think it gets so clean ;p

Alexander Over a year ago

Yes, that appears to be more efficient and potentially more readable.

Collectives™ on Stack Overflow

Creating pandas dataframe with tuples per row but with missing columns

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related