2

I am looking to perform the following operation on a DataFrame efficiently. There DataFrame has a special column, containing strings, where some rows have a formatting problem. Naemly, in my case it has a + sign seperating what should be entries of two separate columns.

In particular, consider:

import pandas as pd
pd.DataFrame([ ['a',   0, 1  ], ['b+c', 2, 3  ], 
               ['d+e', 4, 5  ], ['f',   6, 7  ] ])

which prints:

     0  1  2
0    a  0  1
1  b+c  2  3
2  d+e  4  5
3    f  6  7

I want to transform this into:

   0  1  2
0  a  0  1
1  b  2  3
2  c  2  3
3  d  4  5
4  e  4  5
5  f  6  7

That is, to "spread out" rows where there is the + sign, duplicating the other columns. This can be done by looping over rows and assigning to a new dataframe using regex, but I am looking for a simpler and more efficient way.

Edit: Optimally, the function would allow for multiple separators (+ signs). That is, transforming also

import pandas as pd
pd.DataFrame([ ['a',   0, 1  ], ['b+c', 2, 3  ], 
               ['d+e+f', 4, 5  ], ['g',   6, 7  ] ])

into

   0  1  2
0  a  0  1
1  b  2  3
2  c  2  3
3  d  4  5
4  e  4  5
5  f  4  5
6  g  6  7

4 Answers 4

3

One way would be to combine .str.split with stack and then join:

s = df[0].str.split("+", expand=True).stack()
s.index = s.index.droplevel(1)
result = s.to_frame().join(df.drop(0, axis=1)).reset_index(drop=True)

gives me

In [18]: result
Out[18]: 
   0  1  2
0  a  0  1
1  b  2  3
2  c  2  3
3  d  4  5
4  e  4  5
5  f  4  5
6  g  6  7
Sign up to request clarification or add additional context in comments.

Comments

2

I like to decompose this into numpy bits and build the dataframe back together.

Plan

  1. pull the first column's values and split them by '+'
  2. count the length of each sub-array and use to create an index to slice with
  3. reform the data from the a concatenated list from step 1 and slicing the rest of the data with the slicer in step 2

v = df.values[:, 1:]
z = np.core.defchararray.split(df[0].values.astype(str), '+')
i = np.arange(len(z)).repeat([len(x) for x in z])
pd.DataFrame(np.column_stack([np.concatenate(z), v[i]]))

   0  1  2
0  a  0  1
1  b  2  3
2  c  2  3
3  d  4  5
4  e  4  5
5  f  6  7

is it fast?
sure it is!

enter image description here

If you need to ensure that dtypes stay the same, we can do an astype at the end. This incurs an performance penalty, but still fast.

v = df.values[:, 1:]
z = np.core.defchararray.split(df[0].values.astype(str), '+')
i = np.arange(len(z)).repeat([len(x) for x in z])
pd.DataFrame(np.column_stack([np.concatenate(z), v[i]])).astype(df.dtypes)

enter image description here

5 Comments

@jezrael good questions... let me check if np.core.defchararray.split works with NaN... And yes, it is much faster.
You need to cast the array as str and that converts to None to 'None'. If there are null values, I'd have to handle them.
@jezrael confirmed, my method doesn't handle nulls gracefully. However, to be fair, neither do the other methods proposed. I could drop them to begin with and add them back later.
@jezrael unfortunately, I have to handle more than that.
Ya, numpy. But you can simplify it - solution works nice if not NaN. ;)
1

You need to split the strings in the first column on the plus sign to lists, recast each list as Series object, stack the Series objects into a single Series, and reset the index to a single-level index, keeping only the original row identifier.

Then we need to concatenate this series back with the original DataFrame using the index, and dropping the original column. I have named the columns for convenience:

import pandas as pd

df = pd.DataFrame([['a', 0, 1], ['b+c', 2, 3], ['d+e+f', 4, 5], ['g', 6, 7]], 
                  columns=list('ABC'))

s_A = df.A.str.split('+').apply(pd.Series).stack().reset_index(level=1, drop=True)
s_A.name = 'A_split'
pd.concat([df.drop('A', axis=1), s_A], axis=1)

# returns:
   B  C A_split
0  0  1       a
1  2  3       b
1  2  3       c
2  4  5       d
2  4  5       e
2  4  5       f
3  6  7       g

Comments

1

If your problem is specific to either having each row either split into two or left alone, you can simply collect the rows which you want to split, and append them to your dataframe:

import pandas as pd
df = pd.DataFrame([ ['a',   0, 1  ], ['b+c', 2, 3  ], 
                    ['d+e', 4, 5  ], ['f',   6, 7  ] ])
df_split = df[df[0].str.contains('\+')].copy()
df_split['new_col_name'] = df[0].str.extract('\+(.*)', expand = False)
df['new_col_name'] = df[0].str.extract('([^\+]*)', expand = False)

df.append(df_split) # required answer

If the ordering of the rows is important, you could start by creating a column of each row number, e.g. df['no'] = list(range(len(df))), then doing sort_values('no') at the end.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.