2

If we have a pandas data frame and a mapping dictionary for the values in the data frame, replacing the values in the data frame using the dictionary as a mapping can be done like so:

In: df
Out:
  Col1 Col2
0    a    c
1    b    c
2    b    c

In: key
Out: {'a': 1, 'b': 2, 'c': 3}

In: df.replace(key)
Out: 
   Col1  Col2
0     1     3
1     2     3
2     2     3

How can a similar transformation be accomplished when the mapping dictionary has lists as values? For example:


In: key
Out: {'a': [1, 0, 0], 'b': [0, 1, 0], 'c': [0, 0, 1]}

In: df.replace(key)
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 1 output values where the mask is true

In this example, the end goal would be to have a new data frame that has 3 rows and 6 columns:


1 0 0 0 0 1
0 1 0 0 0 1
0 1 0 0 0 1

4 Answers 4

3

IIUC, you can applymap+explode+reshape:

df2 = df.applymap(key.get).explode(list(df.columns))
df2 = (df2
 .set_index(df2.groupby(level=0).cumcount(), append=True)
 .unstack(level=1)
)

output:

  Col1       Col2      
     0  1  2    0  1  2
0    1  0  0    0  0  1
1    0  1  0    0  0  1
2    0  1  0    0  0  1

NB. to reset the columns: df2.columns = range(df2.shape[1])

   0  1  2  3  4  5
0  1  0  0  0  0  1
1  0  1  0  0  0  1
2  0  1  0  0  0  1
Sign up to request clarification or add additional context in comments.

3 Comments

df.applymap(key.get) - cool!
@richardec not efficient though (like a loop), but better that literal_eval I guess :p I haven't had time to look your answer yet
I came up with something else: converting key to a Series and indexing it with key ;)
3

You can use a combination DataFrame.apply and Series.map to perform this substitution. From there, you can perform a DataFrame.sum to concatenate the lists and then cast your data back into a new DataFrame

out = pd.DataFrame(
    df.apply(lambda s: s.map(key)).sum(axis=1).tolist()
)

print(out)
   0  1  2  3  4  5
0  1  0  0  0  0  1
1  0  1  0  0  0  1
2  0  1  0  0  0  1

Semi-related testing of .sum vs .chain:

In [22]: %timeit tmp_df.sum(axis=1)
77.6 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [23]: %timeit tmp_df.apply(lambda row: list(chain.from_iterable(row)), axis=1)
197 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [24]: tmp_df
Out[24]: 
        Col1       Col2
0  [1, 0, 0]  [0, 0, 1]
1  [0, 1, 0]  [0, 0, 1]
2  [0, 1, 0]  [0, 0, 1]

While I won't say that .sum is the best method for concatenating lists in a Series, .apply & chain.from_iterable doesn't seem to fair much better- at least on a very small sample like this.

5 Comments

Beware of sum on Series of lists. This looks fancy but the efficiency is quite bad. It's more efficient to apply a chain ;)
Curious if my implementation differs from what you had in mind?
you need to test on more columns. Try 2k columns, I got ~90 ms for sum vs ~2-3 ms for chain. The inefficiency of sum comes from the fact that it generates a new list at each step ;)
For 20k rows the sum is not even computing vs 20 ms for chain.
Love it, thanks for the explanation and clarification!
2

Hmm, this is tricky.

One solution I came up with is to convert the lists to their string represention before replacing with them, because pandas treats lists specially. Then you can use itertools.chain.from_iterable on each row to combine all the lists into one big list, and create a dataframe out of that:

import ast
from itertools import chain

n = df.replace({k: str(v) for k, v in key.items()}).applymap(ast.literal_eval)
df  =pd.DataFrame(n.apply(lambda x: list(chain.from_iterable(x)), axis=1).tolist())

Output:

>>> df
   0  1  2  3  4  5
0  1  0  0  0  0  1
1  0  1  0  0  0  1
2  0  1  0  0  0  1

Comments

2

Here's a method of replacing the items with lists without looping or stringifying:

df[:] = pd.Series(key)[df.to_numpy().flatten()].to_numpy().reshape(df.shape)

Output:

>>> df
        Col1       Col2
0  [1, 0, 0]  [0, 0, 1]
1  [0, 1, 0]  [0, 0, 1]
2  [0, 1, 0]  [0, 0, 1]

Or, you can use explode and reshape to convert the data directly to a numpy array:

arr = pd.Series(key)[df.to_numpy().flatten()].explode().to_numpy().reshape(-1, 6) # 6 = len of one of the items of `key` * number of columns in df

Output:

>>> arr
array([[1, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1]], dtype=object)
       
>>> pd.DataFrame(arr)
   0  1  2  3  4  5
0  1  0  0  0  0  1
1  0  1  0  0  0  1
2  0  1  0  0  0  1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.