Create a histogram of aging

Question

I am looking to create a histogram of aging. By aging, I'm referring to the accounting report of tracking an asset like accounts payable into bins of 30 days, 60 days, 90 days, and 120+ days old.

In my case, however, I need to track how many days has it been since statistics were last created for a table in a database using the same bins as accountants do. I hope that makes sense.

The data collected by my company's script has 2 variables of unpredictable amounts of observations. The two variables are NUM_TABLES (this is the number of tables updated) and STATS_DATE (this is the YYYY-MM-DD that statistics were last updated on that amount of tables).

> head(df)
  STATS_DATE NUM_TABLES
1   20210908          5
2   20240814        193
3   20240815        746

I would like to report this information in a histogram using the bins 30, 60, 90, 120+. The final chart should look something like the following idealized graph that doesn't represent the sample data above:

I am able to calculate the number of days from the target date.

# target date formatted to database's timezone  
DATE <- parse_date_time("2024-08-16", "ymd", tz = "US/Central")

# calculate difference between target date and date column in days
df$DAYS <- as.numeric(DATE - df$STATS_DATE)

What I can't seem to do is bring it all together using NUM_TABLES as the frequency.

Any help is greatly appreciated.

I've tried using R base hist function as well as the ggplot2 function. I've researched sites like StackOverflow, Statology, and others for various features and key words. Though, being new I do struggle with my lack of understanding for the R terminology.

UPDATE: Here is the output from my data as requested by Edward.

> dput(df)
structure(list(STATS_DATE = c(20210908L, 20240814L, 20240815L
), NUM_TABLES = c(5L, 193L, 746L)), class = "data.frame", row.names = c(NA, 3L))
> sapply(df, mode)
STATS_DATE NUM_TABLES 
 "numeric"  "numeric"

I did apply the tag bar chart to this post because I was not sure if a histogram would be possible. Edward's suggestion is perfectly valid, and I really appreciate the input.

At this point I'm having some trouble with applying the histogram.

I believe my issue is with the STATS_DATE column.

When I use the line stats_dates <- stats_dates[stats_dates <= ref_date] it deletes all the data. When I remove that line I get the following error message: Error in hist.default(unclass(x), unclass(breaks), plot = FALSE, warn.unused = FALSE, : some 'x' not counted; maybe 'breaks' do not span range of 'x'.

Here is the code based on the responses, I know I'm doing something wrong prior to ref_date <- as.Date("2024-08-16").

raw <- read.csv("data/stats.del", header=FALSE, sep=",")

df <- data.frame(na.omit(raw))  ## remove rows from temp tables

colnames(df) <- c('STATS_DATE','NUM_TABLES')

# convert date column to correct timezone and format
df$STATS_DATE <- parse_date_time(df$STATS_DATE, "ymd", tz = "US/Central") ## I think my issue is here 

ref_date <- as.Date("2024-08-16")

stats_dates <- Map(rep, df$STATS_DATE, df$NUM_TABLES) |> unlist() |> as.Date()

# create the bins
bins <- seq.int(0, 120, length.out=5L)
names(bins) <- replace(bins, length(bins), paste0(bins[length(bins)], '+'))

# check bins
bins

# create the histogram
h <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)], 
             ref_date - bins[length(bins)]) |> 
  hist(breaks=ref_date - bins, freq=TRUE, xlab='Days Old', ylab='Num. Tables',
       xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat', rev=TRUE),
       main='')
mtext(text=names(bins)[-1L], side=1, line=1, at=h$mids)

# check bin counts
h$counts |> setNames(names(bins)[-1L])

The terminology for the graph you link to is "barchart" or "barplot", not histogram. Could you provide (a sample of) your data using dput(df)? — Edward
– Edward, Commented Oct 3, 2024 at 4:22

jay.sf · Accepted Answer · 2024-10-03 23:30:31Z

0

First, you could use rep in Map to expand the date tables in a date vector stats_dates.

> ref_date <- as.Date("2024-08-16")
> 
> stats_dates <- Map(rep, df$STATS_DATE, df$NUM_TABLES) |> unlist() |> as.Date()
> stats_dates <- stats_dates[stats_dates <= ref_date]  ## delete everything after ref_date

Define the bins.

> bins <- seq.int(0, 120, length.out=5L)
> ## naming the bins makes life easier:
> names(bins) <- replace(bins, length(bins), paste0(bins[length(bins)], '+'))
> bins
   0   30   60   90 120+ 
   0   30   60   90  120

Next, use replace to censor dates before ref_date - 120 days at ref_date - 120 to get'em in the same bin (aka breaks). As histogram-breaks= then use ref_date - bins. When using hist() on a "Date" object, graphics:::hist.Date is dispatched.

> h <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)], 
+              ref_date - bins[length(bins)]) |> 
+   hist(breaks=ref_date - bins, freq=TRUE, xlab='Days Old', ylab='Num. Tables',
+        xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat'),
+        main='')
> mtext(text=rev(names(bins)[-1L]), side=1, line=1, at=h$mids)

hist() invisibly throws stats, which is e.g. useful to extract counts to see how the bins are filled. We already used the 'mids' in mtext above.

> h$counts |> setNames(rev(names(bins)[-1L]))
120+   90   60   30 
6182 2695 2264 4393

Inverse:

> g <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)], 
+              ref_date - bins[length(bins)]) |> 
+   as.integer() |> base::`*`(-1L) |> 
+   hist(breaks=-as.numeric(ref_date - bins), freq=TRUE, xlab='Days Old', ylab='Num. Tables',
+        xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat', rev=TRUE),
+        main='')
> mtext(text=names(bins)[-1L], side=1, line=1, at=g$mids)
> g$counts |> setNames(names(bins)[-1L])
  30   60   90 120+ 
4393 2552 2407 6182

Data:

set.seed(42)
n <- 50
df <- data.frame(
  STATS_DATE=sample(seq(as.Date("2024-03-01"), as.Date("2024-10-03"), 
                        by="day"), n, replace=TRUE),
  NUM_TABLES=round(runif(n, 5, 746))
) |> sort_by(~STATS_DATE) |> `rownames<-`(NULL)

edited Oct 3, 2024 at 23:30

answered Oct 3, 2024 at 11:32

jay.sf

76.3k8 gold badges66 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

mperedithe Over a year ago

deleted original comment to clarify. How can I use specific colors for each bin? Using the code provided, I do not get a graph of the sample data. Instead I get a flat line at zero with a y axis running from -1.0 to 1.0. I'm not getting any errors but I do see that h, prior to piping to the plot, is being finalized as a list() of structure(numeric(0), class = "Date"). I'm not sure if that is the problem. h$counts is showing 0 across all bins.

jay.sf Over a year ago

@mperedithe The hcl.colors(length(bins) - 1L, 'heat', rev=TRUE) produces the current colors automatically over the bins minus one, you may replace them with c('green', 'yellow', 'orange', 'red' ). Not sure why you get a list, did you follow the first part of creating the stats_dates object? It should be a vector, and replaceing sth in a vector should give a vector again. You may want to let the code sink in a little and then try again. Just compare with the sample data I provided.

mperedithe Over a year ago

Thanks for all your help. I did finally figure out that I was not formatting my data correctly. I did however find an error with your solution. I've spent the day combing through the lines trying to figure it out for myself. Your solution plots the data backwards. Even using the sample data you used, the amount of tables in the 120+ bin should actually be in the 30 day bin, and so on. Any idea on what is causing that?

jay.sf Over a year ago

@mperedithe Great that you worked through the code and even discovered this flaw. You're absolutely right, the older date bins come to the right, the younger to the right. Age in this case is reversed. We can coerce the dates as.integer and multiply by -1 to get this inverse. I've just added the inverse solution, currently not sure which is more intuitive.

Edward · Accepted Answer · 2024-10-03 07:48:40Z

Assuming your data is in this format:

> head(df)
  STATS_DATE NUM_TABLES
1   20210908          5
2   20240814        193
3   20240815        746

You can "bin" the days using cut with the appropriate arguments:

library(lubridate)
library(dplyr)
library(ggplot2)

df <- mutate(df, 
             STATS_DATE=parse_date_time(STATS_DATE, tz = "US/Central", orders = "Ymd"),
             DAYS=as.numeric(DATE - STATS_DATE),
             DAYS_OLD=cut(DAYS, 
                          breaks=c(0,30,60,90,120,Inf),
                          labels=c("0-30","31-60","61-90","91-120",">120")))

and then summarise this using summarise from dplyr, which you can then pass to ggplot:

summarise(df, NUM_TABLES=sum(NUM_TABLES), .by=DAYS_OLD) |>
  print() |>
  ggplot(aes(x=DAYS_OLD, y=NUM_TABLES, fill=DAYS_OLD)) +
  geom_col(show.legend=FALSE, col=1) +
  labs(x="Days old", y="Number of tables") +
  theme_classic()

Producing:

  DAYS_OLD NUM_TABLES
1   91-120        185
2     >120        554
3    31-60        251
4    61-90        303
5     0-30        240

Toy data:

df <- structure(list(STATS_DATE = c("20240705", "20240220", "20240619", 
"20240710", "20240719", "20240508", "20240513", "20240723", "20240628", 
"20240303", "20240422", "20240313", "20240617", "20240316", "20240208", 
"20240728", "20240611", "20240409", "20240319", "20240428", "20240504", 
"20240601", "20240813", "20240622", "20240415", "20240511", "20240312", 
"20240419", "20240424", "20240727"), NUM_TABLES = c(75, 7, 9, 
11, 41, 98, 17, 51, 21, 21, 19, 28, 14, 86, 65, 61, 97, 38, 82, 
23, 21, 20, 25, 98, 53, 83, 67, 98, 7, 34)), class = "data.frame", row.names = c(NA, 
-30L))

Generated using:

N <- 30
set.seed(2)
df <- data.frame(STATS_DATE=gsub("-", "", sample(seq.Date(as.Date("2024-02-08"), 
                                            as.Date("2024-08-15"), 1), N)),
                 NUM_TABLES=round(runif(N, 1, 100)))

Collectives™ on Stack Overflow

Create a histogram of aging

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related