2

WHAT I HAVE:

import pandas as pd
inp = [{'long string':'ha: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ho: (tra: 1 la: 2)'}, 
{'long string':'hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2) \n ho: (tra: 1 la: 2)'}, 
{'long string':'ho: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2)'}]
df = pd.DataFrame(inp)
df

GIVES

    long string
0   ha: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ho...
1   hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2) \n ho...
2   ho: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ha...

WHAT I WANT

inp = {'ha-tra':['1', '1', '1'], 'ha-la':['2', '2', '2'], 'hi-tra':['1', '1', '1'], 'hi-la':['2', '2', '2'],'ho-tra':['1', '1', '1'], 'ho-la':['2', '2', '2']}
df = pd.DataFrame(inp)
df

GIVES

    ha-tra  ha-la   hi-tra  hi-la   ho-tra  ho-la
0   1       2       1       2       1       2
1   1       2       1       2       1       2
2   1       2       1       2       1       2

CONTEXT

From a large string, I want to get each combination of (ha hi ho) and (tra la), and get the scores related to those combinations from the string. The problem is that the order of (ha hi ho) is not similar.

2 Answers 2

3
ndf = (df["long string"]
         .str.extractall(r"(ha|hi|ho):\s\((?:tra|la):\s(\d+)\s(?:tra|la):\s(\d+)\)")
         .droplevel("match")
         .set_index(0, append=True)
         .set_axis(["tra", "la"], axis=1)
         .unstack()
         .swaplevel(axis=1))
ndf.columns = ndf.columns.map("-".join)
  • Extract the desired parts with a regex
  • Drop the index level induced by extractall called match
  • Append the ha-hi-ho matches as the index (0 is first capturing group)
  • Rename the remaining columns tra and la
  • Unstack the ha-hi-ho index to the columns
  • Swap the ha-hi-ho and tra-la levels' order in columns so that ha-hi-ho is upper
  • Lastly join these levels of columns' names with a hyphen

to get

  ha-tra hi-tra ho-tra ha-la hi-la ho-la
0      1      1      1     2     2     2
1      1      1      1     2     2     2
2      1      1      1     2     2     2
Sign up to request clarification or add additional context in comments.

5 Comments

This answer is perfect. Follow up question for those interested: what if you have a lot more variables than (ha-ho-hi) but 52? They all have the following structure: PA0.01, PB0.02, PA0.03.
@Charles So the pattern is not the same as this question e.g, PA: (tra: 0.01)? If not, maybe you can ask another question and people may answer because it would be somewhat a different question than this.
Sorry yes the pattern is similar, only it would be 52 variants of ha/ho/hi and 4 variants of tra/la. An example would be "Sorted zone data list: ZoneData(zoneId=PA3.40, zoneLocationNumber=2672747, occupancyPercentage=61, numberOfHitsOnItemsFromTote=0, spurIsFull=false, isOnHold=false)\n" instead of "ha: (tra: 1 la: 2) \n"
@Charles I see, but you only want to extract certain fields, e.g., those that start with PA? Also, can you please edit the question with these new samples and desired output?
I tuned this solution for my own one a little bit. If you are interested, the final regex became: .str.extractall(r"(PA0.02|PB0.09|PB3.47|PA1.14|PB1.21|PA3.44|PB1.24|PB3.45|PB2.34|PA0.03|PA2.30|PA0.01|PB2.33|PA3.40|PA0.06|PB1.23|PB3.49|PB1.20|PB2.31|PA1.13|PA3.42|PA3.39|PA1.18|PB3.48|PB1.19|PB1.22|PB0.10|PA1.17|PA2.28|PA1.16|PA2.26|PA3.41|PA2.25|PA1.15|PA0.05|PA2.29|PB3.50|PB2.35|PB3.52|PB2.36|PB0.08|PB3.51|PB0.12|PB2.38|PA3.43|PB3.46|PA2.27|PB2.32|PB0.07|PA0.04|PB2.37),\s(?:zoneLocationNumber)=(\d+),\soccupancyPercentage=(\d+), numberOfHitsOnItemsFromTote=(\d+)")
2

One way to solve:

df1 = (
    df['long string'].str.extractall(
        r'.*?([a-z]+)\s*?:\s*?\(([a-z]+):\s*(\d+)\s*([a-z]+):\s*(\d+)\)')
    .droplevel("match")
    .set_index(0, append=True)
)

d1 = df1.iloc[:, :2]
d2 = df1.iloc[:, 2:]
d2.columns = d1.columns

df2 = pd.concat([d1, d2]).reset_index()
df2 = df2.pivot(index='level_0', columns=[0, 1], values=2)
df2.columns = df2.columns.map('-'.join)
df2 = df2.reset_index(drop=True)

ALTERNATIVE:

df2 = (
    (
        df['long string'].str.extractall(
            r'.*?([a-z]+)\s*?:\s*?\(([a-z]+):\s*(\d+)\s*([a-z]+):\s*(\d+)\)')
        .droplevel("match")
        .set_index(0, append=True)
        .apply(lambda x: x.values.reshape(-1, 2), axis=1)
        .explode()
        .apply(pd.Series)
        .add_prefix('val')
        .reset_index()
    ).pivot(index=['level_0'], columns=[0, 'val0'], values='val1')
).reset_index(drop=True)
df2.columns = df2.columns.map('-'.join)

OUTPUT:

  ha_la ha_tra hi_la hi_tra ho_la ho_tra
0     2      1     2      1     2      1
1     2      1     2      1     2      1
2     2      1     2      1     2      1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.