Creating new variable based on substring in another variable in Python

Question

I am trying to create a binary (yes/no) variable based on what is in a particular text string (in Python).

The data looks something like:

Person ID	Test Result
87	No exercise induced ischaemia
88	Treadmill test induced increased BP
89	NORMAL test on treadmill

and so on.

I need to pick out all the people who have "No exercise induced ischaemia". Can anybody shed some light on how to do this, given I have about 20 columns in the real data set and about 14000 rows that need to be searched.

Here's an example dataframe for convenience

d = {'ID': [87, 88, 89, 90, 91, 92], 'TestResult': ["No exercise induced ischaemia", "NORMAL test on treadmill",  "No exercise induced ischaemia", "treadmill induced ischaemia", "NORMAL test on treadmill", "No exercise induced ischaemia"]}
df = pd.DataFrame(data=d)

I've tried things like

df['NegTest'] = df[df.TestResult.str.contains('No exercise induced ischaemia',case=True)]

with no luck.

Thanks for any help!

Do you just need to drop the outer df[] on the right hand side? Just assign the result of .str.contains(). That's a bool Series. — Andras Deak -- Слава Україні
– Andras Deak -- Слава Україні, Commented Feb 2, 2022 at 0:31
Just df['NegTest'] = df.TestResult.str.contains('No exercise induced ischaemia',case=True) will give you a new column of bools. — Mark
– Mark, Commented Feb 2, 2022 at 0:32

user17242583 · Accepted Answer · 2022-02-02 00:32:28Z

1

You're very close. Just use np.where to actually generate the yes/no:

df['NegTest'] = np.where(df.TestResult.str.contains('No exercise induced ischaemia', case=True), 'yes', 'no')

Output:

>>> df
   ID                     TestResult NegTest
0  87  No exercise induced ischaemia     yes
1  88       NORMAL test on treadmill      no
2  89  No exercise induced ischaemia     yes
3  90    treadmill induced ischaemia      no
4  91       NORMAL test on treadmill      no
5  92  No exercise induced ischaemia     yes

If you want it to just be True/False, you can even skip np.where:

df['NegTest'] = df.TestResult.str.contains('No exercise induced ischaemia', case=True)

Output:

>>> df
   ID                     TestResult  NegTest
0  87  No exercise induced ischaemia     True
1  88       NORMAL test on treadmill    False
2  89  No exercise induced ischaemia     True
3  90    treadmill induced ischaemia    False
4  91       NORMAL test on treadmill    False
5  92  No exercise induced ischaemia     True

answered Feb 2, 2022 at 0:32

user17242583

Sign up to request clarification or add additional context in comments.

1 Comment

imogenrae Over a year ago

Oh brilliant, I knew there had to be something missing!! Thanks :)

Collectives™ on Stack Overflow

Creating new variable based on substring in another variable in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related