Add multiple rows AND single column to Dataframe based on existing column

Question

I want to add new rows and add a new column based on a existing column. For example let's say I have following Dataframe:

   A          B
   1          a
   2          b
   3          c
   4          b

And a dictionary with the unique column B values as keys. Each key is associated with a list of values. These values are used for the new rows and column: {a: [x, y, z], b: [x, w, r], c: [x, q]}

The transformation should result in the following Dataframe:

   A          C          
   1          x
   1          y
   1          z
   2          x
   2          w
   2          r
   3          x
   3          q
   4          x
   4          w
   4          r

I know how to add a new column but I'm stuck with trying to replicate the rows. What is the most efficient solution to this problem? Do I update the existing Dataframe or create a new one?

Update

The operation will be used on a large dataframe (20 milion+ rows) using Dask.

jezrael · Accepted Answer · 2019-02-11 10:19:14Z

I suggest create new with map, np.repeat and chain.from_iterable:

d = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}

s = df['B'].map(d)
lens = [len(x) for x in s]

from itertools import chain

df = pd.DataFrame({
    'A' : df['A'].values.repeat(lens),
    'C' : list(chain.from_iterable(s.values.tolist()))
})
print (df)
    A  C
0   1  x
1   1  y
2   1  z
3   2  x
4   2  w
5   2  r
6   3  x
7   3  q
8   4  x
9   4  w
10  4  r

More general solution working if some value of dictionary not matched:

First solution return error, because map return missing value:

TypeError: object of type 'NoneType' has no len()

print (df)
   A  B
0  1  d <- change data
1  2  b
2  3  c
3  4  b

d = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}

s = [d.get(x, [x]) for x in df['B']]
print (s)
[['d'], ['x', 'w', 'r'], ['x', 'q'], ['x', 'w', 'r']]

lens = [len(x) for x in s]

from itertools import chain

df = pd.DataFrame({
    'A' : df['A'].values.repeat(lens),
    'B' : list(chain.from_iterable(s))
})
print (df)
   A  B
0  1  d
1  2  x
2  2  w
3  2  r
4  3  x
5  3  q
6  4  x
7  4  w
8  4  r

Because use dask, another solution should be:

d = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}
df1 = pd.DataFrame([(k, y) for k, v in d.items() for y in v], columns=['B','C'])
print (df1)
   B  C
0  a  x
1  a  y
2  a  z
3  b  x
4  b  w
5  b  r
6  c  x
7  c  q

df = df.merge(df1, on='B', how='left')
print (df)
    A  B  C
0   1  a  x
1   1  a  y
2   1  a  z
3   2  b  x
4   2  b  w
5   2  b  r
6   3  c  x
7   3  c  q
8   4  b  x
9   4  b  w
10  4  b  r

I've used your dask solution and that worked very efficient, thank you.

JoergVanAken · Accepted Answer · 2019-02-11 09:19:34Z

2

You can convert the dict into a DataFrame with columns called B and C

df2 = pd.DataFrame.from_dict(d, orient='index').stack().reset_index().iloc[:, [0, -1]]
df2.columns = ['B', 'C']

merge this new df2 with your initial dfand select the data you want to have:

df.merge(df2, on='B').set_index('A')['C'].sort_index()

answered Feb 11, 2019 at 9:19

JoergVanAken

1,2969 silver badges13 bronze badges

1 Comment

Josh Friedlander Over a year ago

Nice answer! I'd just replace the last line with pd.DataFrame(df.merge(df2, on='B').set_index('A')['C'].sort_index()).reset_index() to get a df. +1

anky · Accepted Answer · 2019-02-11 10:10:35Z

2

One more method using sum() and map():

d = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}
df_new= pd.DataFrame({'A': np.repeat(df.A,df.B.map(d).apply(len)).\
              reset_index(drop=True),'B':df.B.map(d).sum()})

Or even better using operator reduce (for large dataframes) :

import functools,operator
df_new= pd.DataFrame({'A': np.repeat(df.A,df.B.map(d).apply(len)).\
                  reset_index(drop=True),'B':functools.reduce(operator.iadd, df.B.map(d),[])})
print(df_new)

    A  B
0   1  x
1   1  y
2   1  z
3   2  x
4   2  w
5   2  r
6   3  x
7   3  q
8   4  x
9   4  w
10  4  r

edited Feb 11, 2019 at 10:10

answered Feb 11, 2019 at 9:46

anky

75.3k11 gold badges46 silver badges76 bronze badges

Comments

Josh Friedlander · Accepted Answer · 2019-02-11 08:44:40Z

1

My answer - making a new DF.

di = {'a': ['x', 'y', 'z'], 'b': ['x', 'w', 'r'], 'c': ['x', 'q']}
x = df.to_dict()
temp = list(zip(df.A, [di[z] for z in x['B'].values()]))
A = [[x[0]] * len(x[1]) for x in temp]
B = [x[1] for x in temp]

A = [item for sublist in A for item in sublist]
B = [item for sublist in B for item in sublist]

pd.DataFrame({'A':A, 'B':B})

answered Feb 11, 2019 at 8:44

Josh Friedlander

11.8k7 gold badges42 silver badges89 bronze badges

Collectives™ on Stack Overflow

Add multiple rows AND single column to Dataframe based on existing column

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related