5

I'm trying to visualize some data, but I'm not very experienced with the subject, and am having trouble finding the best bay to get what I'm looking for. I've searched around and found similar questions, but nothing that'll answer exactly what I want, so hopefully I'm not duplicating a common question.

Anyway, I have a DataFrame with a column for patient_id (and others, but this is the relevant one. For example:

   patient_id  other_stuff
0      000001          ...
1      000001          ...
2      000001          ...
3      000002          ...
4      000003          ...
5      000003          ...
6      000004          ...
etc

Where each row represents a specific episode that patient had. I want to plot the distribution in which the x axis is the number of episodes a patient had, and the y axis is the number of patients that have had said number of episodes. For example, based on the above, there's one patient with three episodes, one patient with two episodes, and two patients with one episode each, i.e. x = [1, 2, 3], y = [2, 1, 1]. Currently, I do the following:

episode_count_distribution = (
    patients.patient_id
    .value_counts() # the number of rows for each patient_id (i.e. episodes per patient)
    .value_counts() # the number of patients for each possible row count above (i.e. distribution of episodes per patient)
    .sort_index()
)
episode_count_distribution.plot()

This method does what I want, but strikes me as a bit opaque and hard to follow, so I'm wondering if there's a better way.

0

1 Answer 1

6

You might be looking for something like

df.procedure_id.groupby(df.patient_id).nunique().hist();

Explanation:

  • df.procedure_id.groupby(df.patient_id).nunique() finds the number of unique procedures per patient.

  • hist() plots a histogram.

Example

df = pd.DataFrame({'procedure_id': [3, 2, 3, 2, 4, 1, 2, 3], 'patient_id': [1, 2, 3, 2, 1, 2, 3, 2]})
df.procedure_id.groupby(df.patient_id).nunique().hist();
xlabel('num patients');
ylabel('num treatments');

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

I think including procedure_id in my explanation just confused things, since patient_id is all the data that really matters for my problem. A patient can have an entry in more than one row in which the same procedure was performed in each entry, so number of unique procedures per patient wouldn't get me the right numbers. Instead, what I really need – speaking in terms of the data and not what it means – is the number of df rows per patient.
@Wmbuch That is just df.patient_id.value_counts().hist();
Happy it helped.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.