Create dataframe from array

Question

I have data in the following form:

[('06/03/2018 17.35.18.211', 'param_a', 1),
 ('06/03/2018 17.35.19.211', 'param_b', 1),
 ('06/03/2018 17.35.20.211', 'param_c', 1),
 ('06/03/2018 17.35.21.211', 'param_a', 2),
 ('06/03/2018 17.35.22.211', 'param_b', 2),
 ('06/03/2018 17.35.22.211', 'param_c', 2)]

What would be the best way to create a dataframe out of it which looks like this:

                 timestamp   param_a   param_b   param_C
0  06/03/2018 17.35.18.211       1.0       NaN       NaN
1  06/03/2018 17.35.19.211       NaN       1.0       NaN
2  06/03/2018 17.35.20.211       NaN       NaN       1.0
3  06/03/2018 17.35.21.211       2.0       NaN       NaN
4  06/03/2018 17.35.22.211       NaN       2.0       2.0

jezrael · Accepted Answer · 2018-03-09 10:58:12Z

1

Use DataFrame contructor with pivot, rename_axis and reset_index:

arr = [('06/03/2018 17.35.18.211', 'param_a', 1),
 ('06/03/2018 17.35.19.211', 'param_b', 1),
 ('06/03/2018 17.35.20.211', 'param_c', 1),
 ('06/03/2018 17.35.21.211', 'param_a', 2),
 ('06/03/2018 17.35.22.211', 'param_b', 2),
 ('06/03/2018 17.35.23.211', 'param_c', 2)]

df = pd.DataFrame(arr, columns=['timestamp','b','c'])
df = df.pivot('timestamp','b','c').rename_axis(None, axis=1).reset_index()
print (df)
                 timestamp  param_a  param_b  param_c
0  06/03/2018 17.35.18.211      1.0      NaN      NaN
1  06/03/2018 17.35.19.211      NaN      1.0      NaN
2  06/03/2018 17.35.20.211      NaN      NaN      1.0
3  06/03/2018 17.35.21.211      2.0      NaN      NaN
4  06/03/2018 17.35.22.211      NaN      2.0      NaN
5  06/03/2018 17.35.23.211      NaN      NaN      2.0

But if duplicates in first and second values, is necessary aggregation.

edited Mar 9, 2018 at 10:58

answered Mar 9, 2018 at 10:51

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Kobe-Wan Kenobi Over a year ago

Hi, thanks for the answer. I've edited the question to try to make it more concrete about duplicates. So timestamp can be duplicated, but this works even in that case as expected. I'm not sure what other kind of duplication you had in mind. I thought that there might be a way to specify DataFrame constructor to avoid pivot, but this is fine as well. Thanks.

jezrael Over a year ago

@Marko - I think

arr = [('06/03/2018 17.35.18.211', 'param_a', 1),  ('06/03/2018 17.35.19.211', 'param_a', 1),  ('06/03/2018 17.35.20.211', 'param_c', 1),  ('06/03/2018 17.35.21.211', 'param_a', 2),  ('06/03/2018 17.35.22.211', 'param_b', 2),  ('06/03/2018 17.35.23.211', 'param_c', 2)]

- There are duplicates in first ans second row '06/03/2018 17.35.18.211', 'param_a' and pivot is not possible use, because error. ( '06/03/2018 17.35.18.211', 'param_a'), Then is possible use pivot_table

Kobe-Wan Kenobi Over a year ago

Aha, I get it, you mean in case index and column name are same in two rows. Ok, thanks, that's not the case at the moment. Btw, in the example in your comment above, it should be ('06/03/2018 17.35.18.211', 'param_a', 1), ('06/03/2018 17.35.18.211', 'param_a', 1), so same timestamp and same column name. That's what you had in mind, right?

jezrael Over a year ago

@Marko - exactly. You are right, then is necessary use groupby + aggregate function + unstack or pivot_table

Tai · Accepted Answer · 2018-03-09 11:18:31Z

You can also try this. (Note that get_dummies can be slow)

arr = [('06/03/2018 17.35.18.211', 'param_a', 1),
 ('06/03/2018 17.35.19.211', 'param_b', 1),
 ('06/03/2018 17.35.20.211', 'param_c', 1),
 ('06/03/2018 17.35.21.211', 'param_a', 2),
 ('06/03/2018 17.35.22.211', 'param_b', 2),
 ('06/03/2018 17.35.23.211', 'param_c', 2)]
df = pd.DataFrame(arr)
pd.concat([df[0], df[2].values[:,None] * df[1].str.get_dummies()], axis=1)

    0                   param_a param_b param_c
0   06/03/2018 17.35.18.211 1   0   0
1   06/03/2018 17.35.19.211 0   1   0
2   06/03/2018 17.35.20.211 0   0   1
3   06/03/2018 17.35.21.211 2   0   0
4   06/03/2018 17.35.22.211 0   2   0
5   06/03/2018 17.35.23.211 0   0   2

Or

v = df[1].str.get_dummies()
pd.concat([df[0], df[2].values[:,None] * v.where(v>0)], axis=1)


    0                   param_a param_b param_c
0   06/03/2018 17.35.18.211 1.0 NaN NaN
1   06/03/2018 17.35.19.211 NaN 1.0 NaN
2   06/03/2018 17.35.20.211 NaN NaN 1.0
3   06/03/2018 17.35.21.211 2.0 NaN NaN
4   06/03/2018 17.35.22.211 NaN 2.0 NaN
5   06/03/2018 17.35.23.211 NaN NaN 2.0

Vijith Vijayan · Accepted Answer · 2018-03-09 15:24:15Z

0

You are trying to create a dataframe that have 4 columns from 3 columned data. If you want 4 columns, you have to reformat the data.

answered Mar 9, 2018 at 15:24

Vijith Vijayan

993 bronze badges

Collectives™ on Stack Overflow

Create dataframe from array

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related