0

I am trying to split multiple columns into multiple rows in few lines rather than writing a big 'def function' for it. I have 2 columns that need splitting by ;. I am trying to follow this: Split cell into multiple rows in pandas dataframe

from pandas import DataFrame
import numpy as np
from itertools import chain
import pandas as pd

a = DataFrame([{'var1': 'a;b;c', 'var2': 1,'var3':'apples;mango'},
            {'var1': 'd;e;f', 'var2': 2,'var3':'kiwi;pineapple'},
            {'var1': 'g;e;a', 'var2': 15,'var3':'pinneapple'},
            {'var1': 'm', 'var2': 12,'var3':'orange'}])
a

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split(';')))

# calculate lengths of splits
lens = a['var1'].str.split(';').map(len)

# create new dataframe, repeating or chaining as appropriate
new_df = pd.DataFrame({'var1': chainer(a['var1']),
                    'var2': np.repeat(a['var2'], lens),
                    'var3': chainer(a['var3'])
                })

print(new_df)

#ERROR: ValueError: arrays must all be same length

Looking for Output like this: enter image description here

1
  • the main difference you have with the solution of jpp you seem to try is that you want the Cartesian product of the values in both var1 and var3 (for example the first row has 3*2 rows in the output) while the solution is written when any value in package is associated to one value in package code. Commented Dec 13, 2018 at 20:12

3 Answers 3

1

A product is more appropriated here :

import itertools

def splitter(values):
    return list(itertools.product( *[str(v).split(';') for v in values]))

In [6]: splitter(df.iloc[2]) #example
Out[6]: 
[('g', '15', 'pinneapple'),
 ('e', '15', 'pinneapple'),
 ('a', '15', 'pinneapple')]

Now, do it for all rows and rebuild a fresh dataframe:

def expand(df):
    tuples=list()
    for i,row in df.iterrows():
        tuples.extend(splitter(row))
    return  pd.DataFrame.from_records(tuples,columns=df.columns)

Result :

In [7]: expand(df)
   var  var2        var3
0     a    1      apples
1     a    1       mango
2     b    1      apples
3     b    1       mango
4     c    1      apples
5     c    1       mango
6     d    2        kiwi
7     d    2   pineapple
8     e    2        kiwi
9     e    2   pineapple
10    f    2        kiwi
11    f    2   pineapple
12    g   15  pinneapple
13    e   15  pinneapple
14    a   15  pinneapple
15    m   12      orange
Sign up to request clarification or add additional context in comments.

Comments

0

Try the following code:

import pandas as pd

# Source data
a = pd.DataFrame([{'var1': 'a;b;c', 'var2': 1,'var3':'apples;mango'},
    {'var1': 'd;e;f', 'var2':  2,'var3':'kiwi;pineapple'},
    {'var1': 'g;e;a', 'var2': 15,'var3':'pinneapple'},
    {'var1': 'm',     'var2': 12,'var3':'orange'}])
# Split var1
a2 = a.var1.apply(lambda t: pd.Series(t.split(';')))\
    .merge(a, right_index = True, left_index = True)\
    .drop(['var1'], axis = 1)\
    .melt(id_vars = ['var2', 'var3'], value_name = 'var1')\
    .drop('variable', axis = 1).dropna()
# Split var3
a3 = a2.var3.apply(lambda t: pd.Series(t.split(';')))\
    .merge(a2, right_index = True, left_index = True)\
    .drop(['var3'], axis = 1)\
    .melt(id_vars = ['var1', 'var2'], value_name = 'var3')\
    .drop('variable', axis = 1).dropna()
# Sort the result
a3.sort_values(['var2', 'var3'])

a2 contains a table with var1 divided into separate rows.

The same operation for var3 performs the next instruction (similar to the previous with changed names.

The last step is to sort the result.

To understand how this code works, perform separately each step of one of the chained instructions.

If you have more columns to be divided, add analogous "splitting" instructions for them.

Comments

0

You can join each column once you have use str.split and stack on each column. This can be done by first defining a function to get the column in the proper shape:

def split_stack_col (a_col):
    return (a_col.astype(str).str.split(';',expand=True).stack()
                 .reset_index(level=1,name=a_col.name)[[a_col.name]])

Then you map this function to each column before using reduce with join. Add a reset_index if needed

from functools import reduce
new_df = reduce(lambda x,y: x.join(y), 
                map( split_stack_col, 
                     (a[col] for col in a.columns))).reset_index(drop=True)

This method can be used for any number of column in a, and you get as expected:

print (new_df)
   var1 var2        var3
0     a    1      apples
1     a    1       mango
2     b    1      apples
3     b    1       mango
4     c    1      apples
5     c    1       mango
6     d    2        kiwi
7     d    2   pineapple
8     e    2        kiwi
9     e    2   pineapple
10    f    2        kiwi
11    f    2   pineapple
12    g   15  pinneapple
13    e   15  pinneapple
14    a   15  pinneapple
15    m   12      orange

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.