How to combine multiple rows into a single row with many columns in pandas using an id (clustering multiple records with same id into one record)

Question

Situation:

all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns. There are multiple rows with the same ID inside the column machine_ID with different values in other columns.

Columns:

'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'

2. clustering code

I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas

3. Output

Output displayed using: with option_context as it allows to better visualise the content

My Aim:

I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.

Something like this.

Don't put your code/data in images. Type the code out in a code block and add a means of reproducing it (code that creates a small sample of your data) or at least a table of your data typed out (e.g. markdown format) — Michael S.
– Michael S., Commented Aug 12, 2022 at 21:57
What happened when you ran your code, what was the expected output, what do you want to do better? — Michael S.
– Michael S., Commented Aug 12, 2022 at 22:09
What operation do you want to perform on multiple values of the same column (with the same machine_ID) to combine them into one value? — Vladimir Fokow
– Vladimir Fokow, Commented Aug 12, 2022 at 22:11
Dear @MichaelS. Thank you for your insights on how to post questions, I would certainly keep this in mind. When I run my code, I get the output: "3. Output" of my question (I have attached a picture) — Ishaan_B
– Ishaan_B, Commented Aug 12, 2022 at 22:19
Dear @VladimirFokow, I want to generate a single row using machine_ID i.e. for eg. If I have multiple rows with the machine_ID "8910" with different values in all the other columns, I would like to cluster all these rows into a single record by creating multiple sub rows on the right inside the machine_ID "8910" on the left. I have added the "My aim" section in my question for a better understanding. — Ishaan_B
– Ishaan_B, Commented Aug 12, 2022 at 22:41

Vladimir Fokow · Accepted Answer · 2022-08-13 16:10:39Z

0

Maybe a Multi-Index is what you're looking for?

df.set_index(['machine_ID', df.index])

Note that by default set_index returns a new dataframe, and does not change the original. To change the original (and return None) you can pass an argument inplace=True.

Example:

df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
                   'a': [1, 2, 3, 4, 5],
                   'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index])  # not in-place
df.set_index(['machine_ID', df.index], inplace=True)  # in-place

For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:

edited Aug 13, 2022 at 16:10

answered Aug 12, 2022 at 23:02

Vladimir Fokow

3,8932 gold badges8 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Ishaan_B Over a year ago

The output expected by the user in the reference link provided is exactly what I am looking for. In the above case (reference link question) clustering multiple records into a single record based on 'user_id' is exactly what I am aiming to implement with 'machine_ID' in my case.

Ishaan_B Over a year ago

The solution: df.set_index(['machine_ID', df.index]) still fails to achieve what I am aiming for with my data frame.

Vladimir Fokow Over a year ago

@Ishaan_B, Why does it fail?

Ishaan_B Over a year ago

When I implement the following, I get a data frame with no change but 'machine_id' column shifted slightly below the rest of the columns :

all_task_usage_10_19_clustering_through_machine_id.set_index(['machine_ID', all_task_usage_10_19_clustering_through_machine_id.index])

Ishaan_B Over a year ago

Also, I just observed in the (reference link question) that the user expects the columns to be divided into levels i.e. 'user_id' as level 0, 'account_num' as level_1 and so on and so forth but I am expecting a data frame where 'machine_id' is level 0 and everything else is level 1.

|

Ishaan_B · Accepted Answer · 2022-08-15 01:49:29Z

0

The below code worked for me:

all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
    'assigned_memory_usage', 'unmapped_page_cache_memory_usage',           'total_page_cache_memory_usage', 'maximum_memory_usage',
    'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
    'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
    'memory_accesses_per_instruction_(MAI)', 'sample_portion',
    'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()

edited Aug 15, 2022 at 1:49

answered Aug 13, 2022 at 23:25

Ishaan_B

51 silver badge8 bronze badges

Collectives™ on Stack Overflow

How to combine multiple rows into a single row with many columns in pandas using an id (clustering multiple records with same id into one record)

2 Answers 2

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related