I have a simple database consisting of 2 tables (say, Items and Users), where a column of the Users is their User_ID, a column of the Items is their Item_ID and another column of the Items is a foreign key to a User_ID, for instance:
Items Users
Item_ID Value_A Its_User_ID ... User_ID Name ...
1 35 1 1 Alice
2 991 1 2 John
3 20 2
Imagine I want to denormalize this database, i.e. I'm adding the value of column Name from table Users into table Items for performance reasons when querying the data. My current solution is the following:
items['User_Name'] = pd.Series([users.loc[users['User_ID']==x, 'Name'].iloc[0]
for x in items['Its_User_ID']])
That is, I'm adding the column as a Pandas Series constructed from a comprehension list, which uses .loc[] to retrieve the names of the users with a specific ID, and .iloc[0] to get the first element of the selection (which is the only one because user IDs are unique).
But this solution is really slow for large sets of items. I did the following tests:
- For 1000 items and ~200K users: 20 seconds.
- For ~400K items and ~200K users: 2.5 hours. (and this is the real data size).
Because this approach is column-wise, its execution time grows multiplicatively by the number of columns for which I'm doing this process, and gets too time-expensive. While I haven't tried using for loops to fill the new Series row by row, I expect that it should be much more costly. Are there other approaches that I'm ignoring? Is there a possible solution that takes a few minutes instead of a few hours?