How to create columns from a string in a dataframe?

Question

WHAT I HAVE:

import pandas as pd
inp = [{'long string':'ha: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ho: (tra: 1 la: 2)'}, 
{'long string':'hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2) \n ho: (tra: 1 la: 2)'}, 
{'long string':'ho: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2)'}]
df = pd.DataFrame(inp)
df

GIVES

    long string
0   ha: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ho...
1   hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2) \n ho...
2   ho: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ha...

WHAT I WANT

inp = {'ha-tra':['1', '1', '1'], 'ha-la':['2', '2', '2'], 'hi-tra':['1', '1', '1'], 'hi-la':['2', '2', '2'],'ho-tra':['1', '1', '1'], 'ho-la':['2', '2', '2']}
df = pd.DataFrame(inp)
df

GIVES

    ha-tra  ha-la   hi-tra  hi-la   ho-tra  ho-la
0   1       2       1       2       1       2
1   1       2       1       2       1       2
2   1       2       1       2       1       2

CONTEXT

From a large string, I want to get each combination of (ha hi ho) and (tra la), and get the scores related to those combinations from the string. The problem is that the order of (ha hi ho) is not similar.

Mustafa Aydın · Accepted Answer · 2021-05-31 08:51:38Z

3

ndf = (df["long string"]
         .str.extractall(r"(ha|hi|ho):\s\((?:tra|la):\s(\d+)\s(?:tra|la):\s(\d+)\)")
         .droplevel("match")
         .set_index(0, append=True)
         .set_axis(["tra", "la"], axis=1)
         .unstack()
         .swaplevel(axis=1))
ndf.columns = ndf.columns.map("-".join)

Extract the desired parts with a regex
Drop the index level induced by extractall called match
Append the ha-hi-ho matches as the index (0 is first capturing group)
Rename the remaining columns tra and la
Unstack the ha-hi-ho index to the columns
Swap the ha-hi-ho and tra-la levels' order in columns so that ha-hi-ho is upper
Lastly join these levels of columns' names with a hyphen

to get

  ha-tra hi-tra ho-tra ha-la hi-la ho-la
0      1      1      1     2     2     2
1      1      1      1     2     2     2
2      1      1      1     2     2     2

answered May 31, 2021 at 8:51

Mustafa Aydın

18.4k4 gold badges21 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Charles Over a year ago

This answer is perfect. Follow up question for those interested: what if you have a lot more variables than (ha-ho-hi) but 52? They all have the following structure: PA0.01, PB0.02, PA0.03.

Mustafa Aydın Over a year ago

@Charles So the pattern is not the same as this question e.g, PA: (tra: 0.01)? If not, maybe you can ask another question and people may answer because it would be somewhat a different question than this.

Charles Over a year ago

Sorry yes the pattern is similar, only it would be 52 variants of ha/ho/hi and 4 variants of tra/la. An example would be "Sorted zone data list: ZoneData(zoneId=PA3.40, zoneLocationNumber=2672747, occupancyPercentage=61, numberOfHitsOnItemsFromTote=0, spurIsFull=false, isOnHold=false)\n" instead of "ha: (tra: 1 la: 2) \n"

Mustafa Aydın Over a year ago

@Charles I see, but you only want to extract certain fields, e.g., those that start with PA? Also, can you please edit the question with these new samples and desired output?

Charles Over a year ago

I tuned this solution for my own one a little bit. If you are interested, the final regex became: .str.extractall(r"(PA0.02|PB0.09|PB3.47|PA1.14|PB1.21|PA3.44|PB1.24|PB3.45|PB2.34|PA0.03|PA2.30|PA0.01|PB2.33|PA3.40|PA0.06|PB1.23|PB3.49|PB1.20|PB2.31|PA1.13|PA3.42|PA3.39|PA1.18|PB3.48|PB1.19|PB1.22|PB0.10|PA1.17|PA2.28|PA1.16|PA2.26|PA3.41|PA2.25|PA1.15|PA0.05|PA2.29|PB3.50|PB2.35|PB3.52|PB2.36|PB0.08|PB3.51|PB0.12|PB2.38|PA3.43|PB3.46|PA2.27|PB2.32|PB0.07|PA0.04|PB2.37),\s(?:zoneLocationNumber)=(\d+),\soccupancyPercentage=(\d+), numberOfHitsOnItemsFromTote=(\d+)")

Nk03 · Accepted Answer · 2021-05-31 09:44:46Z

One way to solve:

df1 = (
    df['long string'].str.extractall(
        r'.*?([a-z]+)\s*?:\s*?\(([a-z]+):\s*(\d+)\s*([a-z]+):\s*(\d+)\)')
    .droplevel("match")
    .set_index(0, append=True)
)

d1 = df1.iloc[:, :2]
d2 = df1.iloc[:, 2:]
d2.columns = d1.columns

df2 = pd.concat([d1, d2]).reset_index()
df2 = df2.pivot(index='level_0', columns=[0, 1], values=2)
df2.columns = df2.columns.map('-'.join)
df2 = df2.reset_index(drop=True)

ALTERNATIVE:

df2 = (
    (
        df['long string'].str.extractall(
            r'.*?([a-z]+)\s*?:\s*?\(([a-z]+):\s*(\d+)\s*([a-z]+):\s*(\d+)\)')
        .droplevel("match")
        .set_index(0, append=True)
        .apply(lambda x: x.values.reshape(-1, 2), axis=1)
        .explode()
        .apply(pd.Series)
        .add_prefix('val')
        .reset_index()
    ).pivot(index=['level_0'], columns=[0, 'val0'], values='val1')
).reset_index(drop=True)
df2.columns = df2.columns.map('-'.join)

OUTPUT:

  ha_la ha_tra hi_la hi_tra ho_la ho_tra
0     2      1     2      1     2      1
1     2      1     2      1     2      1
2     2      1     2      1     2      1

Collectives™ on Stack Overflow

How to create columns from a string in a dataframe?

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related