0

Given a data frame with 9,000+ records (rows), a column with unique IDs, and each ID can have multiple records as shown below.

df.head(4)

Unique_ID Record_1 Record_2
AN5001 90.0 ten
AN5002 90.0 five
AN5001 95.0 five
AN5003 60.0 forty

There are 360 unique IDs. However, about half of them need to be corrected. Consider below df_corrected_ID.head(3)

Unique_ID_old Unique_ID_new
AN5001 AN5010
AN5002 AN5002
AN5003 AN5011

How would you, most efficiently, fix the Unique ID in the main df with 9,000+ records using the df_corrected_ID data frame?

So, check if Unique_ID_old is present in df['Unique_ID'] column and if it is, replace it with Unique_ID_new from df_corrected_ID.

How would one then check change occurred correctly (for example, just show the difference between the Unique_IDs -- say after converting the original and updated columns to lists and then, list(set(Unique_ID) - set(Unique_ID_new)).

It's okay to add another new column to original df if needed with corrected IDs, as long as the order is maintained an none of the records are changed.

Thanks!

2 Answers 2

2

It is simple mapping. You can add a column or update in place.

df_map = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5003'], uni_update=['AN5010','AN5002','AN5011'])) 
df_record = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5001','AN5003'], rec1=[1,2,3,4], rec2=['a','b','c','d']))
    
df_record['uni_update'] = df_record.uni.map(dict(zip(df_map.uni, df_map.uni_update)))

df w/ new col

|    | uni    |   rec1 | rec2   | uni_update   |
|---:|:-------|-------:|:-------|:-------------|
|  0 | AN5001 |      1 | a      | AN5010       |
|  1 | AN5002 |      2 | b      | AN5002       |
|  2 | AN5001 |      3 | c      | AN5010       |
|  3 | AN5003 |      4 | d      | AN5011       |

to check afterwards

df_record_reduced = df_record.drop_duplicates(subset='uni').loc[:,['uni','uni_update']].reset_index(drop=True)

|    | uni    | uni_update   |
|---:|:-------|:-------------|
|  0 | AN5001 | AN5010       |
|  1 | AN5002 | AN5002       |
|  2 | AN5003 | AN5011       |

df_map.equals(df_record_reduced) #should be true if everything is aligned
Sign up to request clarification or add additional context in comments.

Comments

1

You can use Series.replace for this. Make your corrected ID DataFrame into a dictionary that maps old IDs to new ones, and you can use it to update your existing column or create a new one.

update_dict = dict(zip(df_corrected_ID['Unique_ID_old'], df_corrected_ID['Unique_ID_new'])

# to create a new column
df['Unique_ID_updated'] = df['Unique_ID'].replace(update_dict)

# to update the existing column
df['Unique_ID'] = df['Unique_ID'].replace(update_dict)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.