2

I want to add new rows and add a new column based on a existing column. For example let's say I have following Dataframe:

   A          B
   1          a
   2          b
   3          c
   4          b

And a dictionary with the unique column B values as keys. Each key is associated with a list of values. These values are used for the new rows and column: {a: [x, y, z], b: [x, w, r], c: [x, q]}

The transformation should result in the following Dataframe:

   A          C          
   1          x
   1          y
   1          z
   2          x
   2          w
   2          r
   3          x
   3          q
   4          x
   4          w
   4          r

I know how to add a new column but I'm stuck with trying to replicate the rows. What is the most efficient solution to this problem? Do I update the existing Dataframe or create a new one?

Update

The operation will be used on a large dataframe (20 milion+ rows) using Dask.

4 Answers 4

2

I suggest create new with map, np.repeat and chain.from_iterable:

d = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}

s = df['B'].map(d)
lens = [len(x) for x in s]

from itertools import chain

df = pd.DataFrame({
    'A' : df['A'].values.repeat(lens),
    'C' : list(chain.from_iterable(s.values.tolist()))
})
print (df)
    A  C
0   1  x
1   1  y
2   1  z
3   2  x
4   2  w
5   2  r
6   3  x
7   3  q
8   4  x
9   4  w
10  4  r

More general solution working if some value of dictionary not matched:

First solution return error, because map return missing value:

TypeError: object of type 'NoneType' has no len()

print (df)
   A  B
0  1  d <- change data
1  2  b
2  3  c
3  4  b

d = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}

s = [d.get(x, [x]) for x in df['B']]
print (s)
[['d'], ['x', 'w', 'r'], ['x', 'q'], ['x', 'w', 'r']]

lens = [len(x) for x in s]

from itertools import chain

df = pd.DataFrame({
    'A' : df['A'].values.repeat(lens),
    'B' : list(chain.from_iterable(s))
})
print (df)
   A  B
0  1  d
1  2  x
2  2  w
3  2  r
4  3  x
5  3  q
6  4  x
7  4  w
8  4  r

Because use dask, another solution should be:

d = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}
df1 = pd.DataFrame([(k, y) for k, v in d.items() for y in v], columns=['B','C'])
print (df1)
   B  C
0  a  x
1  a  y
2  a  z
3  b  x
4  b  w
5  b  r
6  c  x
7  c  q

df = df.merge(df1, on='B', how='left')
print (df)
    A  B  C
0   1  a  x
1   1  a  y
2   1  a  z
3   2  b  x
4   2  b  w
5   2  b  r
6   3  c  x
7   3  c  q
8   4  b  x
9   4  b  w
10  4  b  r
Sign up to request clarification or add additional context in comments.

1 Comment

I've used your dask solution and that worked very efficient, thank you.
2

You can convert the dict into a DataFrame with columns called B and C

df2 = pd.DataFrame.from_dict(d, orient='index').stack().reset_index().iloc[:, [0, -1]]
df2.columns = ['B', 'C']

merge this new df2 with your initial dfand select the data you want to have:

df.merge(df2, on='B').set_index('A')['C'].sort_index()

1 Comment

Nice answer! I'd just replace the last line with pd.DataFrame(df.merge(df2, on='B').set_index('A')['C'].sort_index()).reset_index() to get a df. +1
2

One more method using sum() and map():

d = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}
df_new= pd.DataFrame({'A': np.repeat(df.A,df.B.map(d).apply(len)).\
              reset_index(drop=True),'B':df.B.map(d).sum()})

Or even better using operator reduce (for large dataframes) :

import functools,operator
df_new= pd.DataFrame({'A': np.repeat(df.A,df.B.map(d).apply(len)).\
                  reset_index(drop=True),'B':functools.reduce(operator.iadd, df.B.map(d),[])})
print(df_new)

    A  B
0   1  x
1   1  y
2   1  z
3   2  x
4   2  w
5   2  r
6   3  x
7   3  q
8   4  x
9   4  w
10  4  r

Comments

1

My answer - making a new DF.

di = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}
x = df.to_dict()
temp = list(zip(df.A, [di[z] for z in x['B'].values()]))
A = [[x[0]] * len(x[1]) for x in temp]
B = [x[1] for x in temp]

A = [item for sublist in A for item in sublist]
B = [item for sublist in B for item in sublist]

pd.DataFrame({'A':A, 'B':B})

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.