pandas - Plot distribution of column variable

Question

I'm trying to visualize some data, but I'm not very experienced with the subject, and am having trouble finding the best bay to get what I'm looking for. I've searched around and found similar questions, but nothing that'll answer exactly what I want, so hopefully I'm not duplicating a common question.

Anyway, I have a DataFrame with a column for patient_id (and others, but this is the relevant one. For example:

   patient_id  other_stuff
0      000001          ...
1      000001          ...
2      000001          ...
3      000002          ...
4      000003          ...
5      000003          ...
6      000004          ...
etc

Where each row represents a specific episode that patient had. I want to plot the distribution in which the x axis is the number of episodes a patient had, and the y axis is the number of patients that have had said number of episodes. For example, based on the above, there's one patient with three episodes, one patient with two episodes, and two patients with one episode each, i.e. x = [1, 2, 3], y = [2, 1, 1]. Currently, I do the following:

episode_count_distribution = (
    patients.patient_id
    .value_counts() # the number of rows for each patient_id (i.e. episodes per patient)
    .value_counts() # the number of patients for each possible row count above (i.e. distribution of episodes per patient)
    .sort_index()
)
episode_count_distribution.plot()

This method does what I want, but strikes me as a bit opaque and hard to follow, so I'm wondering if there's a better way.

Ami Tavory · Accepted Answer · 2018-04-12 16:35:10Z

6

You might be looking for something like

df.procedure_id.groupby(df.patient_id).nunique().hist();

Explanation:

df.procedure_id.groupby(df.patient_id).nunique() finds the number of unique procedures per patient.
hist() plots a histogram.

Example

df = pd.DataFrame({'procedure_id': [3, 2, 3, 2, 4, 1, 2, 3], 'patient_id': [1, 2, 3, 2, 1, 2, 3, 2]})
df.procedure_id.groupby(df.patient_id).nunique().hist();
xlabel('num patients');
ylabel('num treatments');

edited Apr 12, 2018 at 16:35

answered Apr 12, 2018 at 16:29

Ami Tavory

76.7k13 gold badges152 silver badges196 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mike S Over a year ago

I think including procedure_id in my explanation just confused things, since patient_id is all the data that really matters for my problem. A patient can have an entry in more than one row in which the same procedure was performed in each entry, so number of unique procedures per patient wouldn't get me the right numbers. Instead, what I really need – speaking in terms of the data and not what it means – is the number of df rows per patient.

Ami Tavory Over a year ago

@Wmbuch That is just df.patient_id.value_counts().hist();

Ami Tavory Over a year ago

Happy it helped.

Collectives™ on Stack Overflow

pandas - Plot distribution of column variable

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related