6

So I have roughly 40,000 rows of people and their complaints. I am attempting to sort them into their respective columns for analysis, and for other analysts at my company who use other tools can use this data.

DataFrame Example:

df = pd.DataFrame({"person": [1, 2, 3], 
                   "problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired", 
                                "soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger", 
                                "none"]})
df     
╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║   ║ person ║                                                     problems                                                     ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired                                         ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║      3 ║ none                                                                                                             ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Desired Output:

╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╦════════════════════════════════════════════════════════════════════════════════╦═══════════════════════╦═══════════════╗
║   ║ person ║                                                     problems                                                     ║                                      body                                      ║         mind          ║     soul      ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╬════════════════════════════════════════════════════════════════════════════════╬═══════════════════════╬═══════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired                                         ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE)                              ║ mind: stressed, tired ║ NaN           ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║ body: feels great(lifts weights), overweight(always bulking), missing a finger ║ mind: can't think     ║ soul: missing ║
║ 2 ║      3 ║ none                                                                                                             ║ NaN                                                                            ║ NaN                   ║ NaN           ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╩════════════════════════════════════════════════════════════════════════════════╩═══════════════════════╩═══════════════╝

Things I've tried / where I'm at:

So I've been able to at least separate these with a regex statement that seems to do the job with my real data.

df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")


+---+-------+--------------------------------------------------------------------------------+
|   |       |                                       0                                        |
+---+-------+--------------------------------------------------------------------------------+
|   | match |                                                                                |
| 0 | 0     | body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE)                              |
|   | 1     | mind: stressed, tired                                                          |
| 1 | 0     | soul: missing                                                                  |
|   | 1     | mind: can't think                                                              |
|   | 2     | body: feels great(lifts weights), overweight(always bulking), missing a finger |
| 2 | 0     | none                                                                           |
+---+-------+--------------------------------------------------------------------------------+

I'm a regex beginner, so I expect this could probably be done better. My original regex pattern was r'([^;]+)', but I was trying to exclude the space after the semi-colons.

So I'm at a loss. I played with:

df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)").unstack(), which "works"(doesn't error out) with my example here.

But with my real data, I get an error: "ValueError: Index contains duplicate entries, cannot reshape"

Even if it worked with my real data, I'd still have to figure out how to get these 'categories'(body, mind, soul) into assigned columns.

I'd probably have better luck if I could word this question better. I'm trying to really self-learn here, so I'll appreciate any leads even if they're not a complete solution.

I'm kind of sniffing a trail that maybe I can do this somehow with a groupby or multiIndex know-how. Kind of new to programming, so I'm still feeling my way around in the dark. I would appreciate any tips or ideas anyone has to offer. Thank you!

EDIT: I just want to come back and mention the error I was getting in my real data "ValueError: Index contains duplicate entries, cannot reshape" when using @WeNYoBen's solution:

(df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")[0]
.str.split(':',expand=True)
.set_index(0,append=True)[1]
.unstack()
.groupby(level=0)
.first())

It turned out I had some groups with multiple colons. For example:

df = pd.DataFrame({"person": [1, 2, 3], 
                   "problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired", 
                                "soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger", 
                                "none"]})




╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║   ║ person ║                                                     problems                                                     ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired                                 ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║      3 ║ none                                                                                                             ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

See the first row update reflecting the edge case I discovered ; mind: stressed, energy: tired.

I was able to fix this by altering my regex to say the beginning of the match must be the beginning of the string or be preceded with a semi-colon.

splits = [r'(^)(.+?)[:]', r'(;)(.+?)[:]']
str.split('|'.join(splits)

After that I just had to re-tweak the set_index portion to get @WeNYoBen's helpful solution to work, so I'll stick with this one.

15
  • I should add that yesterday I did figure out how to get the full list of all these groupings in my real data, which is around 10 total. sorted(df.problems.str.extractall(r'(^|;)(.+?)[:]').reset_index()[1].str.strip().unique()) Commented Aug 9, 2019 at 19:33
  • 2
    Are the three categories always the same? Body, mind, soul... Commented Aug 9, 2019 at 19:38
  • 1
    If they are always three category, then you could probably try matching from body ... ; and likewise for each category. Commented Aug 9, 2019 at 19:39
  • 2
    df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")[0].str.split(':',expand=True).set_index(0,append=True)[1].unstack().groupby(level=0).first() your own functionafter polish is better than applt + lambda Commented Aug 9, 2019 at 20:11
  • 1
    @sobek I'm actually using both of your solutions as a stepping stone. Trying to work out another edge case I discovered where a semi-colon is used within a phrase, just two occurrences out of tens of thousands of rows :/ You guys rock though. Learned a lot this weekend Commented Aug 12, 2019 at 16:38

1 Answer 1

4

It's not elegant but it gets the job done:

df['split'] = df.problems.str.split(';')
df['mind'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'mind' in category]))
df['body'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'body' in category]))
df['soul'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'soul' in category]))
df.drop('split', inplace=True)

You could probably wrap

df[cat] = df.split.apply(lambda x: ''.join([category for category in x if cat in category])) 

in a function and run it on your dataframe for each cat (e.g. cats=['mind', 'body', 'soul', 'whathaveyou', 'etc.'].


Edit:

As @ifly6 has pointed out, there may be intersections of keywords in the strings that users enter. To be safe, the function should be altered to

df[cat] = df.split.apply(lambda x: ''.join([category for category in x if category.startswith(cat)])) 
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you so much! I can work with this! Man I really need to spend time understanding lambda.
If there's a specific thing you don't understand, ask and i'll try to clarify my answer.
I appreciate the offer! I'm still so new to programming in general I just need to revisit this from square one.
What if the string is like "mind: he thinks his body is out of shape"?
Note that I have edited my answer to deal with this edge case (see bottom).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.