I am looking to create a histogram of aging. By aging, I'm referring to the accounting report of tracking an asset like accounts payable into bins of 30 days, 60 days, 90 days, and 120+ days old.
In my case, however, I need to track how many days has it been since statistics were last created for a table in a database using the same bins as accountants do. I hope that makes sense.
The data collected by my company's script has 2 variables of unpredictable amounts of observations. The two variables are NUM_TABLES (this is the number of tables updated) and STATS_DATE (this is the YYYY-MM-DD that statistics were last updated on that amount of tables).
> head(df)
STATS_DATE NUM_TABLES
1 20210908 5
2 20240814 193
3 20240815 746
I would like to report this information in a histogram using the bins 30, 60, 90, 120+. The final chart should look something like the following idealized graph that doesn't represent the sample data above:
I am able to calculate the number of days from the target date.
# target date formatted to database's timezone
DATE <- parse_date_time("2024-08-16", "ymd", tz = "US/Central")
# calculate difference between target date and date column in days
df$DAYS <- as.numeric(DATE - df$STATS_DATE)
What I can't seem to do is bring it all together using NUM_TABLES as the frequency.
Any help is greatly appreciated.
I've tried using R base hist function as well as the ggplot2 function. I've researched sites like StackOverflow, Statology, and others for various features and key words. Though, being new I do struggle with my lack of understanding for the R terminology.
UPDATE: Here is the output from my data as requested by Edward.
> dput(df)
structure(list(STATS_DATE = c(20210908L, 20240814L, 20240815L
), NUM_TABLES = c(5L, 193L, 746L)), class = "data.frame", row.names = c(NA, 3L))
> sapply(df, mode)
STATS_DATE NUM_TABLES
"numeric" "numeric"
I did apply the tag bar chart to this post because I was not sure if a histogram would be possible. Edward's suggestion is perfectly valid, and I really appreciate the input.
At this point I'm having some trouble with applying the histogram.
I believe my issue is with the STATS_DATE column.
When I use the line stats_dates <- stats_dates[stats_dates <= ref_date] it deletes all the data. When I remove that line I get the following error message: Error in hist.default(unclass(x), unclass(breaks), plot = FALSE, warn.unused = FALSE, : some 'x' not counted; maybe 'breaks' do not span range of 'x'.
Here is the code based on the responses, I know I'm doing something wrong prior to ref_date <- as.Date("2024-08-16").
raw <- read.csv("data/stats.del", header=FALSE, sep=",")
df <- data.frame(na.omit(raw)) ## remove rows from temp tables
colnames(df) <- c('STATS_DATE','NUM_TABLES')
# convert date column to correct timezone and format
df$STATS_DATE <- parse_date_time(df$STATS_DATE, "ymd", tz = "US/Central") ## I think my issue is here
ref_date <- as.Date("2024-08-16")
stats_dates <- Map(rep, df$STATS_DATE, df$NUM_TABLES) |> unlist() |> as.Date()
# create the bins
bins <- seq.int(0, 120, length.out=5L)
names(bins) <- replace(bins, length(bins), paste0(bins[length(bins)], '+'))
# check bins
bins
# create the histogram
h <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)],
ref_date - bins[length(bins)]) |>
hist(breaks=ref_date - bins, freq=TRUE, xlab='Days Old', ylab='Num. Tables',
xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat', rev=TRUE),
main='')
mtext(text=names(bins)[-1L], side=1, line=1, at=h$mids)
# check bin counts
h$counts |> setNames(names(bins)[-1L])




dput(df)?