0

I'm trying to subset (retrieve a set of rows) a python pandas data frame by using pd.filter with a regex string to identify the columns of interest before performing a subset based on the values in those columns.

For example, this is my mock data frame:

id status status_drug_use drugA drugA_use    drugB  drugB_use
0  1      analgesic       0     None         1      hypertensive
1  0      analgesic       1     analgesic    1      hypertensive
2  0      analgesic       1     hypertensive 0      None
3  1      analgesic       0     None         1      analgesic

I would like all rows that contain the values in columns drugA_use or drugB_use which match the value in status_drug_use. As per the example, this would return the two rows:

id status status_drug_use drugA drugA_use    drugB  drugB_use
1  0      analgesic       1     analgesic    1      hypertensive
3  1      analgesic       0     None         1      analgesic

There are a few column name conventions to stick with:

  1. status_drug_use is always there.
  2. The matching columns (drugA_use and drugB_use) always follow the template <ANYTHING>_use.

Alteration There is a second scenario, one in which I would like to perform a comparison between a user defined string eg analgesic and the two columns drugA_use and drugB_use. This is different from using the content of status_drug_use.

1 Answer 1

1

Here's a way to do what you've asked:

df2 = df.assign(all_use=df.apply(
    lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]), 
    axis=1)).explode(
    'all_use').query('status_drug_use == all_use').drop_duplicates().drop(columns='all_use')

Input:

  id status status_drug_use drugA     drugA_use drugB     drugB_use
0  0      1       analgesic     0          None     1  hypertensive
1  1      0       analgesic     1     analgesic     1  hypertensive
2  2      0       analgesic     1  hypertensive     0          None
3  3      1       analgesic     0          None     1     analgesic

Output:

  id status status_drug_use drugA  drugA_use drugB     drugB_use
1  1      0       analgesic     1  analgesic     1  hypertensive
3  3      1       analgesic     0       None     1     analgesic

Explanation:

  • find the subset of all columns ending in _use (excluding status_drug_use)
  • add a column named all_use whose value for a given row is a list of the values in the columns ending in _use
  • use explode() to add rows such that for each original row, there are now multiple rows, one for each of the values in all_use for the original row
  • use query() to select only rows where status_drug_use matches the value in all_use
  • use drop_duplicates to eliminate rows in case there were multiple matches for any rows in the original dataframe (for example, if both drugA_use and drugB_use contained "analgesic" and so did status_drug_use)
  • drop the column all_use as we no longer need it.

UPDATE: Addressing OP's question in a comment: 'Rather than using the values in column status_drug_use, how do I achieve the same output but by using a single user defined string e.g., "analgesic"?'

You can do this by having the user defined query string (call it user_defined_str) as a variable and changing the contents of query() by replacing the column name status_drug_use with the variable name with @ prepended: @user_defined_str (see the query() docs here for more detail).

user_defined_str = 'analgesic'
df3 = df.assign(all_use=df.apply(
    lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]), 
    axis=1)).explode(
    'all_use').query('@user_defined_str == all_use').drop_duplicates().drop(columns='all_use')
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. Certainly does what I want. I've updated the original question as I have a closely related question. Rather than using the values in column status_drug_use, how do I achieve the same output but by using a single user defined string e.g., "analgesic"? The result should be the same, it only differs when None entries make there way in the status_drug_use column.
See my updated answer to address this. For additional follow-up queries, please ask a new question as that's the usual protocol (feel free to link it in the comments here if you'd like to make sure I see it).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.