Creating extra rows based on string formatting in a DataFrame

Question

I am looking to perform the following operation on a DataFrame efficiently. There DataFrame has a special column, containing strings, where some rows have a formatting problem. Naemly, in my case it has a + sign seperating what should be entries of two separate columns.

In particular, consider:

import pandas as pd
pd.DataFrame([ ['a',   0, 1  ], ['b+c', 2, 3  ], 
               ['d+e', 4, 5  ], ['f',   6, 7  ] ])

which prints:

     0  1  2
0    a  0  1
1  b+c  2  3
2  d+e  4  5
3    f  6  7

I want to transform this into:

That is, to "spread out" rows where there is the + sign, duplicating the other columns. This can be done by looping over rows and assigning to a new dataframe using regex, but I am looking for a simpler and more efficient way.

Edit: Optimally, the function would allow for multiple separators (+ signs). That is, transforming also

import pandas as pd
pd.DataFrame([ ['a',   0, 1  ], ['b+c', 2, 3  ], 
               ['d+e+f', 4, 5  ], ['g',   6, 7  ] ])

into

DSM · Accepted Answer · 2017-04-19 03:29:16Z

3

One way would be to combine .str.split with stack and then join:

s = df[0].str.split("+", expand=True).stack()
s.index = s.index.droplevel(1)
result = s.to_frame().join(df.drop(0, axis=1)).reset_index(drop=True)

gives me

In [18]: result
Out[18]: 
   0  1  2
0  a  0  1
1  b  2  3
2  c  2  3
3  d  4  5
4  e  4  5
5  f  4  5
6  g  6  7

answered Apr 19, 2017 at 3:29

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

piRSquared · Accepted Answer · 2017-04-19 06:24:22Z

2

I like to decompose this into numpy bits and build the dataframe back together.

Plan

pull the first column's values and split them by '+'
count the length of each sub-array and use to create an index to slice with
reform the data from the a concatenated list from step 1 and slicing the rest of the data with the slicer in step 2

v = df.values[:, 1:]
z = np.core.defchararray.split(df[0].values.astype(str), '+')
i = np.arange(len(z)).repeat([len(x) for x in z])
pd.DataFrame(np.column_stack([np.concatenate(z), v[i]]))

   0  1  2
0  a  0  1
1  b  2  3
2  c  2  3
3  d  4  5
4  e  4  5
5  f  6  7

is it fast?
sure it is!

If you need to ensure that dtypes stay the same, we can do an astype at the end. This incurs an performance penalty, but still fast.

v = df.values[:, 1:]
z = np.core.defchararray.split(df[0].values.astype(str), '+')
i = np.arange(len(z)).repeat([len(x) for x in z])
pd.DataFrame(np.column_stack([np.concatenate(z), v[i]])).astype(df.dtypes)

edited Apr 19, 2017 at 6:24

answered Apr 19, 2017 at 6:09

piRSquared

296k68 gold badges509 silver badges654 bronze badges

5 Comments

piRSquared Over a year ago

@jezrael good questions... let me check if np.core.defchararray.split works with NaN... And yes, it is much faster.

piRSquared Over a year ago

You need to cast the array as str and that converts to None to 'None'. If there are null values, I'd have to handle them.

piRSquared Over a year ago

@jezrael confirmed, my method doesn't handle nulls gracefully. However, to be fair, neither do the other methods proposed. I could drop them to begin with and add them back later.

piRSquared Over a year ago

@jezrael unfortunately, I have to handle more than that.

jezrael Over a year ago

Ya, numpy. But you can simplify it - solution works nice if not NaN. ;)

James · Accepted Answer · 2017-04-19 03:30:53Z

You need to split the strings in the first column on the plus sign to lists, recast each list as Series object, stack the Series objects into a single Series, and reset the index to a single-level index, keeping only the original row identifier.

Then we need to concatenate this series back with the original DataFrame using the index, and dropping the original column. I have named the columns for convenience:

import pandas as pd

df = pd.DataFrame([['a', 0, 1], ['b+c', 2, 3], ['d+e+f', 4, 5], ['g', 6, 7]], 
                  columns=list('ABC'))

s_A = df.A.str.split('+').apply(pd.Series).stack().reset_index(level=1, drop=True)
s_A.name = 'A_split'
pd.concat([df.drop('A', axis=1), s_A], axis=1)

# returns:
   B  C A_split
0  0  1       a
1  2  3       b
1  2  3       c
2  4  5       d
2  4  5       e
2  4  5       f
3  6  7       g

Ken Wei · Accepted Answer · 2017-04-19 04:40:48Z

If your problem is specific to either having each row either split into two or left alone, you can simply collect the rows which you want to split, and append them to your dataframe:

import pandas as pd
df = pd.DataFrame([ ['a',   0, 1  ], ['b+c', 2, 3  ], 
                    ['d+e', 4, 5  ], ['f',   6, 7  ] ])
df_split = df[df[0].str.contains('\+')].copy()
df_split['new_col_name'] = df[0].str.extract('\+(.*)', expand = False)
df['new_col_name'] = df[0].str.extract('([^\+]*)', expand = False)

df.append(df_split) # required answer

If the ordering of the rows is important, you could start by creating a column of each row number, e.g. df['no'] = list(range(len(df))), then doing sort_values('no') at the end.

Collectives™ on Stack Overflow

Creating extra rows based on string formatting in a DataFrame

4 Answers 4

Comments

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related