0

I have below pandas data frame and I am trying to split col1 into multiple columns based on split_format string.

Inputs:

split_format = 'id-id1_id2|id3'

data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
        'col2':[20, 21, 19, 18]}
df = pd.DataFrame(data).style.hide_index()
df

col1        col2
a-a1_a2|a3   20
b-b1_b2|b3   21
c-c1_c2|c3   19
d-d1_d2|d3   18

Expected Output:

id  id1 id2 id3 col2
 a   a1  a2  a3  20
 b   b1  b2  b3  21
 c   c1  c2  c3  19
 d   d1  d2  d3  18

**Note: The special characters and column name in split_string can be changed.

3
  • explain the split format. why isn't is id_id1_id2_col2 Commented Jun 3, 2021 at 14:06
  • @GoldenLion I want to split the columns based on user input string. In this example the user input is split_string = 'id-id1_id2|id3' and we would be able to split accordingly. Commented Jun 3, 2021 at 14:10
  • I am parsing the split_string for non alpha numeric symbols to get the column names id id1 id2 and id3. I then will use a recurse tree to evaluate the value string for the value in the columns Commented Jun 3, 2021 at 14:15

2 Answers 2

2

I think I am able to figure it out.

col_name = re.split('[^0-9a-zA-Z]+',split_format)
df[col_name] = df['col1'].str.split('[^0-9a-zA-Z]+',expand=True)
del df['col1']
df



   col2 id  id1 id2 id3
0   20  a   a1  a2  a3
1   21  b   b1  b2  b3
2   19  c   c1  c2  c3
3   18  d   d1  d2  d3
Sign up to request clarification or add additional context in comments.

Comments

1

I parse the symbols and then recursively evaluate the resulting strings from the token split on the string. I flatten the resulting list and their recursive evaluate the resulting list until all the symbols have been evaluated.

 split_format = 'id-id1_id2|id3'

 data = {'col1':['a-a1_a2|a3', 'b-b1_b2|b3', 'c-c1_c2|c3', 'd-d1_d2|d3'],
    'col2':[20, 21, 19, 18]}
 df = pd.DataFrame(data)

symbols=[]
for x in split_format:
    if x.isalnum()==False:
        symbols.append(x)

result=[]
def parseTree(stringlist,symbols,result):

    #print("String list",stringlist)

    if len(symbols)==0:
        [result.append(x) for x in stringlist]
        return
    token=symbols.pop(0)
    elements=[]
    for item in stringlist:
        elements.append(item.split(token))
    
    flat_list = [item for sublist in elements for item in sublist]        
    parseTree(flat_list,symbols,result)

df2=pd.DataFrame(columns=["id","id1","id2","id3"])
for key, item in df.iterrows():
    symbols2=symbols.copy()
    value=item['col1']
    parseTree([value],symbols2,result)
    a_series = pd. Series(result, index = df2.columns)
    df2=df2.append(a_series, ignore_index=True)
    result.clear()

df2['col2']=df['col2']    
print(df2)

output:

  id id1 id2 id3  col2
0  a  a1  a2  a3    20
1  b  b1  b2  b3    21
2  c  c1  c2  c3    19
3  d  d1  d2  d3    18

2 Comments

Thank you. I ended up using shorter version using regex.
great. I take the longer path because of case analysis I learned in college algebra

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.