Given a data frame with 9,000+ records (rows), a column with unique IDs, and each ID can have multiple records as shown below.
df.head(4)
| Unique_ID | Record_1 | Record_2 |
|---|---|---|
| AN5001 | 90.0 | ten |
| AN5002 | 90.0 | five |
| AN5001 | 95.0 | five |
| AN5003 | 60.0 | forty |
There are 360 unique IDs. However, about half of them need to be corrected. Consider below df_corrected_ID.head(3)
| Unique_ID_old | Unique_ID_new |
|---|---|
| AN5001 | AN5010 |
| AN5002 | AN5002 |
| AN5003 | AN5011 |
How would you, most efficiently, fix the Unique ID in the main df with 9,000+ records using the df_corrected_ID data frame?
So, check if Unique_ID_old is present in df['Unique_ID'] column and if it is, replace it with Unique_ID_new from df_corrected_ID.
How would one then check change occurred correctly (for example, just show the difference between the Unique_IDs -- say after converting the original and updated columns to lists and then, list(set(Unique_ID) - set(Unique_ID_new)).
It's okay to add another new column to original df if needed with corrected IDs, as long as the order is maintained an none of the records are changed.
Thanks!