Before I learned that GoAccess was a thing, I wanted an analytics solution that I could use locally on my web server. My solution was to write a bash script that would give me some basic info on how my blog is doing. What it attempts to do is to:
- Find all views of a given page (as determined by lines in the
access.log) - Filter out all bots and crawlers
- Filter myself out (as I tend to look over my own site quite often)
- Determine some counts based off the remaining number of lines
What I'm curious about is: are there any faulty assumptions, and are there any glaring inefficiencies in this process?
#!/bin/bash
# Initialize some values
PAGE="$1"
LOG="/var/log/apache2/home-site/access.log"
# Words in user agents that hint that the page view is not a person
BOTS=("bot" "facebookexternalhit" "crawler")
# There are other filtered IP's but I've removed them for privacy reasons
FILTERED_IPS=("$(curl -s https://icanhazip.com)")
# Most pages that I care about are <url>/blog/<post name>
# But not all of them are
if [ "${PAGE:0:1}" != "/" ]
then
PAGE="/blog/$PAGE"
fi
echo "$PAGE"
# Get all views for our page
BLOG_VIEWS=$(grep -a "GET $PAGE" $LOG)
BLOG_VIEW_COUNT=$(echo "$BLOG_VIEWS" | wc -l)
BLOG_VIEWS_FILTERED="$BLOG_VIEWS"
# Clear out the bots
for BOT in "${BOTS[@]}"
do
TEMP=$(echo "$BLOG_VIEWS_FILTERED" | grep -avi "$BOT")
BLOG_VIEWS_FILTERED="$TEMP"
done
# Clear out any filtered IP addresses
# Usually their just me
for IP in "${FILTERED_IPS[@]}"
do
TEMP=$(echo "$BLOG_VIEWS_FILTERED" | grep -av "$IP")
BLOG_VIEWS_FILTERED="$TEMP"
done
TOTAL_VIEWS=$(echo "$BLOG_VIEWS_FILTERED" | wc -l)
UNIQUE_IPS=$(echo "$BLOG_VIEWS_FILTERED" | awk '{print $1}' | sort | uniq)
UNIQUE_IP_COUNT=$(echo "$UNIQUE_IPS" | wc -l)
echo "Total Legitimate Views: $TOTAL_VIEWS"
echo "Legitimate View Percentage:" $( echo "printf('%.2f', ($TOTAL_VIEWS/$BLOG_VIEW_COUNT) * 100)" | perl)
echo "Across $UNIQUE_IP_COUNT unique IP addresses"
An example of use looks like the following
:~$ .scripts/analytics hello-world
/blog/hello-world
Total Legitimate Views: 64
Legitimate View Percentage: 47.06
Across 36 unique IP addresses