-2

Situation:

1. all_task_usage_10_19

all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns. There are multiple rows with the same ID inside the column machine_ID with different values in other columns.

Columns:

'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'

2. clustering code

I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas


3. Output

Output displayed using: with option_context as it allows to better visualise the content


My Aim:

I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.

Something like this.

6
  • 3
    Don't put your code/data in images. Type the code out in a code block and add a means of reproducing it (code that creates a small sample of your data) or at least a table of your data typed out (e.g. markdown format) Commented Aug 12, 2022 at 21:57
  • What happened when you ran your code, what was the expected output, what do you want to do better? Commented Aug 12, 2022 at 22:09
  • What operation do you want to perform on multiple values of the same column (with the same machine_ID) to combine them into one value? Commented Aug 12, 2022 at 22:11
  • Dear @MichaelS. Thank you for your insights on how to post questions, I would certainly keep this in mind. When I run my code, I get the output: "3. Output" of my question (I have attached a picture) Commented Aug 12, 2022 at 22:19
  • Dear @VladimirFokow, I want to generate a single row using machine_ID i.e. for eg. If I have multiple rows with the machine_ID "8910" with different values in all the other columns, I would like to cluster all these rows into a single record by creating multiple sub rows on the right inside the machine_ID "8910" on the left. I have added the "My aim" section in my question for a better understanding. Commented Aug 12, 2022 at 22:41

2 Answers 2

0

Maybe a Multi-Index is what you're looking for?

df.set_index(['machine_ID', df.index])

Note that by default set_index returns a new dataframe, and does not change the original. To change the original (and return None) you can pass an argument inplace=True.

Example:

df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
                   'a': [1, 2, 3, 4, 5],
                   'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index])  # not in-place
df.set_index(['machine_ID', df.index], inplace=True)  # in-place

For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:

Sign up to request clarification or add additional context in comments.

7 Comments

The output expected by the user in the reference link provided is exactly what I am looking for. In the above case (reference link question) clustering multiple records into a single record based on 'user_id' is exactly what I am aiming to implement with 'machine_ID' in my case.
The solution: df.set_index(['machine_ID', df.index]) still fails to achieve what I am aiming for with my data frame.
@Ishaan_B, Why does it fail?
When I implement the following, I get a data frame with no change but 'machine_id' column shifted slightly below the rest of the columns : all_task_usage_10_19_clustering_through_machine_id.set_index(['machine_ID', all_task_usage_10_19_clustering_through_machine_id.index])
Also, I just observed in the (reference link question) that the user expects the columns to be divided into levels i.e. 'user_id' as level 0, 'account_num' as level_1 and so on and so forth but I am expecting a data frame where 'machine_id' is level 0 and everything else is level 1.
|
0

The below code worked for me:

all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
    'assigned_memory_usage', 'unmapped_page_cache_memory_usage',           'total_page_cache_memory_usage', 'maximum_memory_usage',
    'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
    'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
    'memory_accesses_per_instruction_(MAI)', 'sample_portion',
    'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.