1

I have the following experiment data:

experiment1_device = 'Dev2'
experiment1_values = [1, 2, 3, 4]
experiment1_timestamps = [1, 2, 3, 6]
experiment1_task = 'Oil Level'
experiment1_user = 'Sean'

experiment2_device = 'Dev1'
experiment2_values = [5, 6, 7, 8, 9]
experiment2_timestamps = [1, 2, 3, 4, 6]
experiment2_task = 'Ventilation'
experiment2_user = 'Martin'

experiment3_device = 'Dev1'
experiment3_values = [10, 11, 12, 13, 14]
experiment3_timestamps = [1, 2, 3, 4, 6]
experiment3_task = 'Ventilation'
experiment3_user = 'Sean'

Each experiment consists of:

  1. A user who conducted the experiment
  2. The task the user had to do
  3. The device the user was using
  4. A series of timestamps ...
  5. and a series of data values which have been read at a certain timestamp, so len(experimentx_values ) == len(experimentx_timestamps )

The data is currently given in the above format, so to speak single variables, but I could change this output format if need. For example, if it would be better to put everything in a dict or so.

The expected output format I would like to achieve is the following: enter image description here

Timestamp, User and Task should be a MultiIndex, whereas the device is supposed to be the column name, and empty (grey) cells should just contain NaN.

I tried multiple approaches with pd.Dataframe.from_records but couldn't get the desired output format.

Any help is highly appreciated!

2 Answers 2

2

Since the data are stored in all of those different variables it will be a lot of writing. However, you should try to store the results of each experiment in a DataFrame (directly from whatever outputs those values), and hold all of those DataFrames in a list which will cut down on the variables you have floating around.

Given your variables, construct DataFrames as follows:

df1 = pd.DataFrame({'Timestamp': experiment1_timestamps,
                    'User': experiment1_user,
                    'Task': experiment1_task,
                     experiment1_device: experiment1_values})

df2 = pd.DataFrame({'Timestamp': experiment2_timestamps,
                    'User': experiment2_user,
                    'Task': experiment2_task,
                     experiment2_device: experiment2_values})

df3 = pd.DataFrame({'Timestamp': experiment3_timestamps,
                    'User': experiment3_user,
                    'Task': experiment3_task,
                     experiment3_device: experiment3_values})

Now join them together into a single DataFrame, setting the index to the columns you want. It looks like your output wants the cartesian product of all possibilities, so we'll reindex to get the fully NaN rows back in:

df = pd.concat([df1, df2, df3]).set_index(['Timestamp', 'User', 'Task']).sort_index()

idx = pd.MultiIndex.from_product([df.index.get_level_values(i).unique() 
                                  for i in range(df.index.nlevels)])
df = df.reindex(idx)

                              Dev2  Dev1
Timestamp User   Task                   
1         Martin Ventilation   NaN   5.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  10.0
                 Oil Level     1.0   NaN
2         Martin Ventilation   NaN   6.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  11.0
                 Oil Level     2.0   NaN
3         Martin Ventilation   NaN   7.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  12.0
                 Oil Level     3.0   NaN
4         Martin Ventilation   NaN   8.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  13.0
                 Oil Level     NaN   NaN
6         Martin Ventilation   NaN   9.0
                 Oil Level     NaN   NaN
          Sean   Ventilation   NaN  14.0
                 Oil Level     4.0   NaN
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your fast reply with a solution. Somehow I seem to be doing something wrong as I'm getting the error "ValueError: cannot handle a non-unique multi-index!". As in my real data, every user is basically performing every task with each device. As a side question, is there somehow the possibility to crunch the NaN values a bit to have fewer?
@user7638008 if you just need one column, as it looks like you could just name the columns 'Dev' when you make the DataFrame, so it would be 'Dev': experiment1_values in the dict that makes the DataFrames
2

Maybe there is a simpler way, but you can create sub-dataframes: import pandas as pd

import pandas as pd

df1 = pd.DataFrame(data=[[1, 2, 3, 4],[1, 2, 3, 6]]).T
df1.columns = ['values','timestamps']
df1['dev'] = 'Dev2'
df1['user'] = 'Sean'
df1['task'] = 'Oil Level'

df2 = pd.DataFrame(data=[[5, 6, 7, 8, 9],[1, 2, 3, 4, 6]]).T
df2.columns = ['values','timestamps']
df2['dev'] = 'Dev1'
df2['user'] = 'Martin'
df2['task'] = 'Ventilation'

df3 = pd.DataFrame(data=[[10, 11, 12, 13, 14],[1, 2, 3, 4, 6]]).T
df3.columns = ['values','timestamps']
df3['dev'] = 'Dev1'
df3['user'] = 'Sean'
df3['task'] = 'Ventilation'

merge them into one:

df = df1.merge(df2, on=['timestamps','dev','user','task','values'], how='outer')
df = df.merge(df3, on=['timestamps','dev','user','task','values'], how='outer')

and create the aggregation:

piv = df.groupby(['timestamps','user','task','dev']).sum()
piv = piv.unstack()     # for the columns dev1, dev2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.