1

I am looking to create a histogram of aging. By aging, I'm referring to the accounting report of tracking an asset like accounts payable into bins of 30 days, 60 days, 90 days, and 120+ days old.

In my case, however, I need to track how many days has it been since statistics were last created for a table in a database using the same bins as accountants do. I hope that makes sense.

The data collected by my company's script has 2 variables of unpredictable amounts of observations. The two variables are NUM_TABLES (this is the number of tables updated) and STATS_DATE (this is the YYYY-MM-DD that statistics were last updated on that amount of tables).

> head(df)
  STATS_DATE NUM_TABLES
1   20210908          5
2   20240814        193
3   20240815        746

I would like to report this information in a histogram using the bins 30, 60, 90, 120+. The final chart should look something like the following idealized graph that doesn't represent the sample data above:

enter image description here

I am able to calculate the number of days from the target date.

# target date formatted to database's timezone  
DATE <- parse_date_time("2024-08-16", "ymd", tz = "US/Central")

# calculate difference between target date and date column in days
df$DAYS <- as.numeric(DATE - df$STATS_DATE)

What I can't seem to do is bring it all together using NUM_TABLES as the frequency.

Any help is greatly appreciated.

I've tried using R base hist function as well as the ggplot2 function. I've researched sites like StackOverflow, Statology, and others for various features and key words. Though, being new I do struggle with my lack of understanding for the R terminology.

UPDATE: Here is the output from my data as requested by Edward.

> dput(df)
structure(list(STATS_DATE = c(20210908L, 20240814L, 20240815L
), NUM_TABLES = c(5L, 193L, 746L)), class = "data.frame", row.names = c(NA, 3L))
> sapply(df, mode)
STATS_DATE NUM_TABLES 
 "numeric"  "numeric" 

I did apply the tag bar chart to this post because I was not sure if a histogram would be possible. Edward's suggestion is perfectly valid, and I really appreciate the input.

At this point I'm having some trouble with applying the histogram.

I believe my issue is with the STATS_DATE column.

When I use the line stats_dates <- stats_dates[stats_dates <= ref_date] it deletes all the data. When I remove that line I get the following error message: Error in hist.default(unclass(x), unclass(breaks), plot = FALSE, warn.unused = FALSE, : some 'x' not counted; maybe 'breaks' do not span range of 'x'.

Here is the code based on the responses, I know I'm doing something wrong prior to ref_date <- as.Date("2024-08-16").

raw <- read.csv("data/stats.del", header=FALSE, sep=",")

df <- data.frame(na.omit(raw))  ## remove rows from temp tables

colnames(df) <- c('STATS_DATE','NUM_TABLES')

# convert date column to correct timezone and format
df$STATS_DATE <- parse_date_time(df$STATS_DATE, "ymd", tz = "US/Central") ## I think my issue is here 

ref_date <- as.Date("2024-08-16")

stats_dates <- Map(rep, df$STATS_DATE, df$NUM_TABLES) |> unlist() |> as.Date()

# create the bins
bins <- seq.int(0, 120, length.out=5L)
names(bins) <- replace(bins, length(bins), paste0(bins[length(bins)], '+'))

# check bins
bins

# create the histogram
h <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)], 
             ref_date - bins[length(bins)]) |> 
  hist(breaks=ref_date - bins, freq=TRUE, xlab='Days Old', ylab='Num. Tables',
       xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat', rev=TRUE),
       main='')
mtext(text=names(bins)[-1L], side=1, line=1, at=h$mids)

# check bin counts
h$counts |> setNames(names(bins)[-1L])
1
  • 2
    The terminology for the graph you link to is "barchart" or "barplot", not histogram. Could you provide (a sample of) your data using dput(df)? Commented Oct 3, 2024 at 4:22

2 Answers 2

0

First, you could use rep in Map to expand the date tables in a date vector stats_dates.

> ref_date <- as.Date("2024-08-16")
> 
> stats_dates <- Map(rep, df$STATS_DATE, df$NUM_TABLES) |> unlist() |> as.Date()
> stats_dates <- stats_dates[stats_dates <= ref_date]  ## delete everything after ref_date

Define the bins.

> bins <- seq.int(0, 120, length.out=5L)
> ## naming the bins makes life easier:
> names(bins) <- replace(bins, length(bins), paste0(bins[length(bins)], '+'))
> bins
   0   30   60   90 120+ 
   0   30   60   90  120 

Next, use replace to censor dates before ref_date - 120 days at ref_date - 120 to get'em in the same bin (aka breaks). As histogram-breaks= then use ref_date - bins. When using hist() on a "Date" object, graphics:::hist.Date is dispatched.

> h <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)], 
+              ref_date - bins[length(bins)]) |> 
+   hist(breaks=ref_date - bins, freq=TRUE, xlab='Days Old', ylab='Num. Tables',
+        xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat'),
+        main='')
> mtext(text=rev(names(bins)[-1L]), side=1, line=1, at=h$mids)

enter image description here

hist() invisibly throws stats, which is e.g. useful to extract counts to see how the bins are filled. We already used the 'mids' in mtext above.

> h$counts |> setNames(rev(names(bins)[-1L]))
120+   90   60   30 
6182 2695 2264 4393  

Inverse:

> g <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)], 
+              ref_date - bins[length(bins)]) |> 
+   as.integer() |> base::`*`(-1L) |> 
+   hist(breaks=-as.numeric(ref_date - bins), freq=TRUE, xlab='Days Old', ylab='Num. Tables',
+        xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat', rev=TRUE),
+        main='')
> mtext(text=names(bins)[-1L], side=1, line=1, at=g$mids)
> g$counts |> setNames(names(bins)[-1L])
  30   60   90 120+ 
4393 2552 2407 6182 

enter image description here


Data:

set.seed(42)
n <- 50
df <- data.frame(
  STATS_DATE=sample(seq(as.Date("2024-03-01"), as.Date("2024-10-03"), 
                        by="day"), n, replace=TRUE),
  NUM_TABLES=round(runif(n, 5, 746))
) |> sort_by(~STATS_DATE) |> `rownames<-`(NULL)
Sign up to request clarification or add additional context in comments.

4 Comments

deleted original comment to clarify. How can I use specific colors for each bin? Using the code provided, I do not get a graph of the sample data. Instead I get a flat line at zero with a y axis running from -1.0 to 1.0. I'm not getting any errors but I do see that h, prior to piping to the plot, is being finalized as a list() of structure(numeric(0), class = "Date"). I'm not sure if that is the problem. h$counts is showing 0 across all bins.
@mperedithe The hcl.colors(length(bins) - 1L, 'heat', rev=TRUE) produces the current colors automatically over the bins minus one, you may replace them with c('green', 'yellow', 'orange', 'red' ). Not sure why you get a list, did you follow the first part of creating the stats_dates object? It should be a vector, and replaceing sth in a vector should give a vector again. You may want to let the code sink in a little and then try again. Just compare with the sample data I provided.
Thanks for all your help. I did finally figure out that I was not formatting my data correctly. I did however find an error with your solution. I've spent the day combing through the lines trying to figure it out for myself. Your solution plots the data backwards. Even using the sample data you used, the amount of tables in the 120+ bin should actually be in the 30 day bin, and so on. Any idea on what is causing that?
@mperedithe Great that you worked through the code and even discovered this flaw. You're absolutely right, the older date bins come to the right, the younger to the right. Age in this case is reversed. We can coerce the dates as.integer and multiply by -1 to get this inverse. I've just added the inverse solution, currently not sure which is more intuitive.
0

Assuming your data is in this format:

> head(df)
  STATS_DATE NUM_TABLES
1   20210908          5
2   20240814        193
3   20240815        746

You can "bin" the days using cut with the appropriate arguments:

library(lubridate)
library(dplyr)
library(ggplot2)

df <- mutate(df, 
             STATS_DATE=parse_date_time(STATS_DATE, tz = "US/Central", orders = "Ymd"),
             DAYS=as.numeric(DATE - STATS_DATE),
             DAYS_OLD=cut(DAYS, 
                          breaks=c(0,30,60,90,120,Inf),
                          labels=c("0-30","31-60","61-90","91-120",">120")))

and then summarise this using summarise from dplyr, which you can then pass to ggplot:

summarise(df, NUM_TABLES=sum(NUM_TABLES), .by=DAYS_OLD) |>
  print() |>
  ggplot(aes(x=DAYS_OLD, y=NUM_TABLES, fill=DAYS_OLD)) +
  geom_col(show.legend=FALSE, col=1) +
  labs(x="Days old", y="Number of tables") +
  theme_classic()

Producing:

  DAYS_OLD NUM_TABLES
1   91-120        185
2     >120        554
3    31-60        251
4    61-90        303
5     0-30        240

enter image description here


Toy data:

df <- structure(list(STATS_DATE = c("20240705", "20240220", "20240619", 
"20240710", "20240719", "20240508", "20240513", "20240723", "20240628", 
"20240303", "20240422", "20240313", "20240617", "20240316", "20240208", 
"20240728", "20240611", "20240409", "20240319", "20240428", "20240504", 
"20240601", "20240813", "20240622", "20240415", "20240511", "20240312", 
"20240419", "20240424", "20240727"), NUM_TABLES = c(75, 7, 9, 
11, 41, 98, 17, 51, 21, 21, 19, 28, 14, 86, 65, 61, 97, 38, 82, 
23, 21, 20, 25, 98, 53, 83, 67, 98, 7, 34)), class = "data.frame", row.names = c(NA, 
-30L))

Generated using:

N <- 30
set.seed(2)
df <- data.frame(STATS_DATE=gsub("-", "", sample(seq.Date(as.Date("2024-02-08"), 
                                            as.Date("2024-08-15"), 1), N)),
                 NUM_TABLES=round(runif(N, 1, 100)))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.