2

My question is similar to this one and this one but I can't get their solutions to work for my problem.

I have a dataframe that looks like this:

    study_id    fuzzy_market
0   study1  [Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]
1   study2  [Country: Germany], [Management experience: Yes]
2   study3  [Country: United Kingdom], [Language: English]
3   study4  [Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]
4   study5  [Age: 48-99]

I'd like it to look like this:

study_id Age Country of Birth Country Language Management experience
study1 18-67 Austria, Germany Austria, Germany German None
study2 None None Germany None Yes
study3 None None United Kingdom English None
study4 18-67 Austria, Germany Austria, Germany German None
study5 48-99 None None None None

So one row per study_id, the text before each colon in the fuzzy_market column as the column title, and the text after each colon as the data in the cell. Where there is no relevant data for a column, I'd like to fill it with None. All the columns can be strings. I don't know how many columns there will be, so I need this to be dynamic.

Here's the setup and data:

import pandas as pd
import numpy as np
import re

np.random.seed(12345)

df = pd.DataFrame.from_dict({'study_id': {0: 'study1',
  1: 'study2',
  2: 'study3',
  3: 'study4',
  4: 'study5'},
 'fuzzy_market': {0: '[Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]',
  1: '[Country: Germany], [Management experience: Yes]',
  2: '[Country: United Kingdom], [Language: English]',
  3: '[Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]',
  4: '[Age: 48-99]'}})

So far I have tried manipulating the strings in the fuzzy_markets column, but I don't think this approach is correct.

# a function to strip the square brackets, as I'm not sure this is really a list in here
def remove_square_brackets(x):
    return re.sub(r"[\[\]]", "", x)

# make a new dataframe where there are new columns for data after every comma
df2 = df.join(df['fuzzy_market'].apply(remove_square_brackets).str.split(',', expand=True))

# rename the columns arbitrarily - these will need to be the question titles eventually e.g. Age rather than A, Country of Birth rather than B etc.
df2.columns = ('study_id', 'fuzzy_market', 'A', 'B', 'C', 'D', 'E', 'F')

# try and split again
df3 = df2[['study_id','A', 'B']].join(df2['A'].str.split(":", expand=True).rename(columns={0:'A1', 1:'A2'})).join(df2['B'].str.split(":", expand=True).rename(columns={0:'B1', 1:'B2'}))

# this isn't quite there yet
df3

    study_id    A   B   A1  A2  B1  B2
0   study1  Age: 18-67  Country of Birth: Austria   Age 18-67   Country of Birth    Austria
1   study2  Country: Germany    Management experience: Yes  Country Germany Management experience   Yes
2   study3  Country: United Kingdom Language: English   Country United Kingdom  Language    English
3   study4  Age: 18-67  Country of Birth: Austria   Age 18-67   Country of Birth    Austria
4   study5  Age: 48-99  None    Age 48-99   None    None

Thanks for any help or hints!

1 Answer 1

5

We can use findall to extract all the matching key-value pairs from each row, then map these pairs to to dict and create a dataframe

p = df['fuzzy_market'].str.findall(r'([^:\[]+): ([^\]]+)')
df[['study_id']].join(pd.DataFrame(map(dict, p)))

  study_id    Age  Country of Birth           Country Language Management experience
0   study1  18-67  Austria, Germany  Austria, Germany   German                   NaN
1   study2    NaN               NaN           Germany      NaN                   Yes
2   study3    NaN               NaN    United Kingdom  English                   NaN
3   study4  18-67  Austria, Germany  Austria, Germany   German                   NaN
4   study5  48-99               NaN               NaN      NaN                   NaN
Sign up to request clarification or add additional context in comments.

3 Comments

Perfect, thank you so much!
@meenaparam Happy coding!
@ShubhamSharma: I have just modified this question. Could you please answer it? Thank you. stackoverflow.com/questions/74440094/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.