3

I need to update a DataFrame column with some strings at selected rows, for which I have the index. So far, I managed to achieve what I need with list comprehension:

[data.particleIDs.values[idx[i]].append(particlenames[i]) for i in range(len(idx))]

where data.particleIDs is the DataFrame column that needs to be updated, particlenames a list containing the strings and idx an array containing, for each string, the DataFrame row it needs to be written on. Several strings correspond to the same row, and I need to write them all in the DataFrame column.

Let's say I have a DataFrame and the list of strings that I use to update it:

data = pd.DataFrame({'particleIDs': [[] for i in range(20)]}
particlenames = ['c15001'+str(i) for i in range(10))]

I have 10 strings and I need to use them to update the rows [7 8 15 8 11 0 15 1 12 8] in my DataFrame, i.e. I need to add each string to the corresponding row.

The FOR loop is terribly slow, as the actual particlenames list is long and I need to repeat this process several times.

Is there anything I can do to speed this up?

Thank you!

4
  • Some samle input and output data would help us to understand your issue better, please provide a minimal reproducible example Commented Feb 10, 2020 at 21:04
  • Done! Hope it makes it clearer. Commented Feb 10, 2020 at 21:19
  • 2
    So what is the expected result? You are updating the same row multiple times (e.g. row 8 is updated three times with values 'c150011', 'c150013' and 'c150019'). Commented Feb 10, 2020 at 21:41
  • Yes, I need that! I also tried with .loc, but I can't get that result. Commented Feb 11, 2020 at 8:19

1 Answer 1

0

I solved my problem by creating another dataframe for the strings and the corresponding indices:

df_strings = pd.DataFrame({'strings':particlenames,'rows':[7, 8, 15, 8, 11, 0, 15, 1, 12, 8]})

and then by using the groupby method on rows to append the strings with apply(list):

df_strings=df_strings.groupby('rows')['strings'].apply(list).reset_index()   

Finally, I join this new DataFrame with the one (data) that needs to be updated with the strings:

data=data.join(df_strings.set_index('rows'))

data=

    particleIDs     strings
0   []  [c150015]
1   []  [c150017]
2   []  NaN
3   []  NaN
4   []  NaN
5   []  NaN
6   []  NaN
7   []  [c150010]
8   []  [c150011, c150013, c150019]
9   []  NaN
10  []  NaN
11  []  [c150014]
12  []  [c150018]
13  []  NaN
14  []  NaN
15  []  [c150012, c150016]
16  []  NaN
17  []  NaN
18  []  NaN
19  []  NaN

So I can avoid adding the particleIDs when creating the data DataFrame (which, in my real case, has other columns), as the joined column contains the info I need.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.