3

Is there a way to only have the 10 most frequent entries for a gtsummary tbl_summary with categorical data?

I'm currently using the following code

library(forcats)
library(dplyr)
library(magrittr)

table<- fct_count(df$name, sort = T, prop = T)%>%
  slice_head(n = 10)
table$p<- round(table$p, digits = 3)
table$p<- table$p * 100
table %<>% rename(Organism = f,`%` = p)
table

Which produces a beautiful table in the console:

a table of bacterial names with the frequency and proportion for which they appear in the dataframe

But ideally I would have it in a gtsummary table (because that's what the rest of my report is using). I can make the tbl_summary no problems, I just can't figure out how to limit to only the 10 most common organisms, and I haven't seen this asked/answered anywhere.

example dataset

library(AMR)
df<- data.table::as.data.table(example_isolates)
df$name<- mo_name(df$mo)

3 Answers 3

2

You could determine the top 10 beforehand and then convert the name to a factor with "Other" as the last level.

df <- data.table::as.data.table(example_isolates)
df$name<- mo_name(df$mo)
df

top10 <- names(rev(tail(sort(table(df$name)), 10)))

df %>%
  mutate(name=factor(case_when(name %in% top10~name,
                        .default="(Other)"),
                     levels=c(top10, "(Other)"))) %>%
  tbl_summary(include=name)

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

2

As tbl_summary does not provide such a functionality out of the box, you could first lump all levels which are not in the top 10 into a special category and then remove this very entry via remove_row_type:

remove_me <- "(Remove)"
df <- df %>% 
  mutate(name2 = fct_lump($name, 10, other_level = remove_me))

tbl_summary(df, include = name2, 
            sort = all_categorical(FALSE) ~ "frequency") %>%
  remove_row_type(name2, type = "level", level_value = remove_me)

Table with columns Characteristic and N showing the distribution of name in total and percent


Personally, I would maybe even include the lumped factor in the table (labeled "(Other)") at the end.

other_category <- "(Other)"
df <- df %>% 
  mutate(name3 = fct_lump(name, 10, other_level = other_category) %>%
                    fct_infreq() %>%
                    fct_relevel(other_category, after = Inf))

tbl_summary(df, include = name3)

Table with columns Characteristic and N showing the distribution of name in total and percent

2 Comments

The percentages are not based on the original data.
Good point, I remove solution 1.
2
df0$name[!df0$name %in% names(head(sort(table(df0$name), TRUE), 10))] = "Other"
library(gtsummary)
tbl_summary(data.frame(name = df0$name), 
            sort = all_categorical() ~ "frequency") #optional

giving

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.