2

I have a dictionary of nested columns with the index as key in each one. When i try to convert it to a polars dataframe, it fetches the column names and the values right, but each column has just one element that's the dictionary of the column elements, without "expanding" it into a series.

An example, let's say i have:

d = {'col1': {'0':'A','1':'B','2':'C'}, 'col2': {'0':1,'1':2,'2':3}}

Then, when i do a pl.DataFrame(d) or pl.from_dict(d), i'm getting:

col1           col2
---            ---
struct[3]      struct[3]
{"A","B","C"}  {1,2,3}

Instead of the regular dataframe.

Any idea how to fix this?

Thanks in advance!

1 Answer 1

2

There's not a particularly straight forward way to do that. You essentially have to take each column one at a time and unpivot it and then join each column back together.

Setup

d = {'col1': {'0':'A','1':'B','2':'C'}, 'col2': {'0':1,'1':2,'2':3}}
df = pl.DataFrame(d)

To (what I think is the) desired output


df_final=None
for col in df.columns:
    df_new = df[col].to_frame().unnest(col)
    df_new = df_new.unpivot(variable_name="index", value_name=col)
    if df_final is None:
        df_final=df_new
    else:
        df_final=df_final.join(df_new, on="index", how="full", coalesce=True)
df_final
shape: (3, 3)
┌───────┬──────┬──────┐
│ index ┆ col1 ┆ col2 │
│ ---   ┆ ---  ┆ ---  │
│ str   ┆ str  ┆ i64  │
╞═══════╪══════╪══════╡
│ 0     ┆ A    ┆ 1    │
│ 1     ┆ B    ┆ 2    │
│ 2     ┆ C    ┆ 3    │
└───────┴──────┴──────┘

Simplified if index keys are guaranteed to be balanced

If you can be assured that the keys of your nested cols will always be uniform and sorted you can do it as a map_batches instead of a for loop with joins.

df.select(pl.all().map_batches(lambda s: (
    s.to_frame().unnest(s.name).unpivot()['value']
)))
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 1    │
│ B    ┆ 2    │
│ C    ┆ 3    │
└──────┴──────┘
Sign up to request clarification or add additional context in comments.

7 Comments

+1. If uniform and sorted, OP can use a dict comprehension: pl.DataFrame({k: v.values() for k, v in d.items()}).
Thanks! So, there is no way to unnest them directly when creating the df? :/
And yes, sorry, the internal keys in every dict-col will always be the same and sorted (it's basically a df packed and passed through a flask response)
@Ghost: if you have control over that flow, you might just want to adjust the response. Because that sounds like d is the result of df.to_dict(). Changed to df.to_dict('list') (or df.to_dict('records')) you can pass that to pl.DataFrame without problems.
if, in flask, try using get_data to get the raw json and letting polars parse it withpl.read_json(BytesIO(resp.get_data())) instead of using the native python json parser. Should be faster.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.