How to properly create a pandas dataframe with the given data?

Question

I have the following experiment data:

experiment1_device = 'Dev2'
experiment1_values = [1, 2, 3, 4]
experiment1_timestamps = [1, 2, 3, 6]
experiment1_task = 'Oil Level'
experiment1_user = 'Sean'

experiment2_device = 'Dev1'
experiment2_values = [5, 6, 7, 8, 9]
experiment2_timestamps = [1, 2, 3, 4, 6]
experiment2_task = 'Ventilation'
experiment2_user = 'Martin'

experiment3_device = 'Dev1'
experiment3_values = [10, 11, 12, 13, 14]
experiment3_timestamps = [1, 2, 3, 4, 6]
experiment3_task = 'Ventilation'
experiment3_user = 'Sean'

Each experiment consists of:

A user who conducted the experiment
The task the user had to do
The device the user was using
A series of timestamps ...
and a series of data values which have been read at a certain timestamp, so len(experimentx_values ) == len(experimentx_timestamps )

The data is currently given in the above format, so to speak single variables, but I could change this output format if need. For example, if it would be better to put everything in a dict or so.

The expected output format I would like to achieve is the following:

Timestamp, User and Task should be a MultiIndex, whereas the device is supposed to be the column name, and empty (grey) cells should just contain NaN.

I tried multiple approaches with pd.Dataframe.from_records but couldn't get the desired output format.

Any help is highly appreciated!

ALollz · Accepted Answer · 2020-12-08 15:19:20Z

2

Since the data are stored in all of those different variables it will be a lot of writing. However, you should try to store the results of each experiment in a DataFrame (directly from whatever outputs those values), and hold all of those DataFrames in a list which will cut down on the variables you have floating around.

Given your variables, construct DataFrames as follows:

df1 = pd.DataFrame({'Timestamp': experiment1_timestamps,
                    'User': experiment1_user,
                    'Task': experiment1_task,
                     experiment1_device: experiment1_values})

df2 = pd.DataFrame({'Timestamp': experiment2_timestamps,
                    'User': experiment2_user,
                    'Task': experiment2_task,
                     experiment2_device: experiment2_values})

df3 = pd.DataFrame({'Timestamp': experiment3_timestamps,
                    'User': experiment3_user,
                    'Task': experiment3_task,
                     experiment3_device: experiment3_values})

Now join them together into a single DataFrame, setting the index to the columns you want. It looks like your output wants the cartesian product of all possibilities, so we'll reindex to get the fully NaN rows back in:

df = pd.concat([df1, df2, df3]).set_index(['Timestamp', 'User', 'Task']).sort_index()

idx = pd.MultiIndex.from_product([df.index.get_level_values(i).unique() 
                                  for i in range(df.index.nlevels)])
df = df.reindex(idx)

                              Dev2  Dev1
Timestamp User   Task                   
1         Martin Ventilation   NaN   5.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  10.0
                 Oil Level     1.0   NaN
2         Martin Ventilation   NaN   6.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  11.0
                 Oil Level     2.0   NaN
3         Martin Ventilation   NaN   7.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  12.0
                 Oil Level     3.0   NaN
4         Martin Ventilation   NaN   8.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  13.0
                 Oil Level     NaN   NaN
6         Martin Ventilation   NaN   9.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  14.0
                 Oil Level     4.0   NaN

answered Dec 8, 2020 at 15:19

ALollz

59.7k7 gold badges73 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user7638008 Over a year ago

Thank you for your fast reply with a solution. Somehow I seem to be doing something wrong as I'm getting the error "ValueError: cannot handle a non-unique multi-index!". As in my real data, every user is basically performing every task with each device. As a side question, is there somehow the possibility to crunch the NaN values a bit to have fewer?

ALollz Over a year ago

@user7638008 if you just need one column, as it looks like you could just name the columns 'Dev' when you make the DataFrame, so it would be 'Dev': experiment1_values in the dict that makes the DataFrames

Oliver Prislan · Accepted Answer · 2020-12-08 15:38:37Z

Maybe there is a simpler way, but you can create sub-dataframes: import pandas as pd

import pandas as pd

df1 = pd.DataFrame(data=[[1, 2, 3, 4],[1, 2, 3, 6]]).T
df1.columns = ['values','timestamps']
df1['dev'] = 'Dev2'
df1['user'] = 'Sean'
df1['task'] = 'Oil Level'

df2 = pd.DataFrame(data=[[5, 6, 7, 8, 9],[1, 2, 3, 4, 6]]).T
df2.columns = ['values','timestamps']
df2['dev'] = 'Dev1'
df2['user'] = 'Martin'
df2['task'] = 'Ventilation'

df3 = pd.DataFrame(data=[[10, 11, 12, 13, 14],[1, 2, 3, 4, 6]]).T
df3.columns = ['values','timestamps']
df3['dev'] = 'Dev1'
df3['user'] = 'Sean'
df3['task'] = 'Ventilation'

merge them into one:

df = df1.merge(df2, on=['timestamps','dev','user','task','values'], how='outer')
df = df.merge(df3, on=['timestamps','dev','user','task','values'], how='outer')

and create the aggregation:

piv = df.groupby(['timestamps','user','task','dev']).sum()
piv = piv.unstack()     # for the columns dev1, dev2

Collectives™ on Stack Overflow

How to properly create a pandas dataframe with the given data?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related