2

I am attempting to do statistical tracking. In my database I am storing referring urls. Frequently I have url's that resemble the following:

http://www2.trafficadbar.com/__a4w4
http://trafficadbar.com/__a4w4
http://www.trafficadbar.com/__a4w4
http://4acesmailer.com/credit_click.php?userid=2472&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=2714&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=2723&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=3245&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=3259&openkey=gbyp2vcm

I want to know how I would do a GROUP BY and COUNT on a regex pattern. Basically what I want is as follows returned:

trafficadbar 3
4acesmailer 5 

Currently when I try to do a GROUP BY it only works where the url's are exactly the same. so www.blah.com and blah.com are two different results and further each url variable ?blah=1&blahblah=2 acts as yet anoher unique group,

I have searched for countless solutions, but they mostly seem to be very specific to the problem asked, and almost all seem to show some "non-regex" workaround - which would be fine... if I could find a method I could apply.

2
  • I don't say this often but you are probably better off selecting all and then parsing it in PHP. Commented Jul 23, 2016 at 0:30
  • I agree with @AbraCadaver. I would do the parsing in PHP. Commented Jul 23, 2016 at 1:44

5 Answers 5

1

To retrieve the part immediately preceding the top level domain from the hostnames, you could work along:

SELECT
  REVERSE(SUBSTRING(SUBSTRING_INDEX(rev_hostname, '.', 2),
          LOCATE('.', rev_hostname) + 1)
         ) domain
  , COUNT(id) hits
FROM (
  SELECT
    id
    , CONCAT(REVERSE(SUBSTRING_INDEX(SUBSTRING(referring_site, 8),
                                     '/', 1)), '.') rev_hostname
  FROM TestData
  ) T
GROUP BY domain
;

It:

  • relies on all referring_sites to start off with http://, and
  • will fail - as it stands - for, e.g., 4acesmailer.co.uk.

Either one could be addressed (to some degree) if required.

See it in action SQL Fiddle (with your data somewhat adjusted/extended to cover some more cases).

Please comment if and as this requires adjustment / further detail.

Sign up to request clarification or add additional context in comments.

Comments

0

If you just care about those 2 values, something like this would work:

select case when yourcolumn like '%trafficadbar%' then 'trafficadbar' 
            when yourcolumn like '%4acesmailer%' then '4acesmailer' 
       end,
    count(*)
from yourtable 
group by 1

Edit, given your comments, this could be a little more dynamic and relatively easy to extend:

select 
  replace(replace(replace(
     left(yourcolumn, locate('.com', yourcolumn) - 1), 
     'http://', ''), 
     'www.', ''), 
     'www2.', ''),
  count(*)
from yourtable 
group by 1

5 Comments

I'm pretty sure those are examples and they don't know all of the domain names up front.
@AbraCadaver -- Good point, could be, not completely clear if the OP is searching for specific domains or wants to somehow aggregate similar ones. Don't think the later is possible with sql alone...
That word "resembling" in my post, definitely was intended to say I am not specifically searching for those - that would be easy :)
@Bruce -- fair enough, this isn't trivial with sql alone. See edits though for a potential "working" solution...
@sgeddes this answer perhaps? idk
0

I'm not skilled enough to do this reliably all in SQL; there are too many moving parts: lots of subdomains, lots of TLDs possible (not just .com), malformed domains possible etc...

My approach: Select everything and parse in PHP.

In the example below, I assume the URLs are in a urls column, and that you have a date_added column with the datetime when each url was added to the DB. Adjust your query accordingly.

Select all urls added within the last 30 days:

SELECT `urls` FROM `myTable`
WHERE `date_added` >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)

Place all results in a $rows array, then process that to generate the report you want

$rows = [...];//Associative array of all rows returned by the query above
$results = []; //will hold aggregate counts

foreach($rows as $row){   
    $host = parse_url($row['urls'],PHP_URL_HOST); //eg: www2.trafficadbar.com
    $matches = [];

    //find top level domain or skip to next row
    if(!preg_match('/[^\.]*\.[^\.]+$/',$host,$matches)) continue;        

    $domain = $matches[0]; //eg: trafficadbar.com

    //increment the count for this domain in results
    if(!isset($results[$domain])) $results[$domain]=0;
    $results[$domain]++;
}

With the input you gave in the OP, the $results will be:

[
    'trafficadbar.com' => 3,
    '4acesmailer.com' => 5,
]

You'll notice that unlike you, I kept the TLD (eg: .com, .net...) because ebay.com and ebay.ph are completely different domains. I would advise against mashing them into one result.

Live demo

2 Comments

Good one. But just add to array in loop and then use array_count values()
I don't think that would be faster because you will be traversing arrays twice (first all urls to add domains, then all results to count values) when I only do it once.
0

Although the solution from @BeetleJuice would have worked, and likely more reliably than the solution I chose, I opted for an SQL solution

SELECT 
   CASE WHEN SUBSTRING(referring_site, 1, 8) = 'http://w' 
      THEN SUBSTRING_INDEX((SUBSTRING_INDEX(referring_site, '.', 2)), '.', -1)
      ELSE SUBSTRING_INDEX((SUBSTRING_INDEX(referring_site, '.', 1)), '://', -1) 
   END 
AS domain 
FROM 
....

Drawbacks are when it doesn't star with a http://w but rather some http://random.sub.domain

2 Comments

If anyone has a better pure mysql solution, that is what I would really like. Ultimately what I want to do is is count the number of "." between :// and / then if 2 do the first substring_index else do the second. I just don't know how to get and apply a proper count on the "."
(i) Is it correct to re-phrase: You want the part preceding the top level domain (com, org, etc.)? (Judging by the number of dots might be misleading: What would you like to get from abc.def.ghi.com?) (ii) Always starting http?
0

Similar to: https://stackoverflow.com/a/72834976/7768504

MySQL introduced REGEXP_SUBSTR to group columns by applying regular expressions. Documentation for REGEXP_SUBSTR

REGEXP_SUBSTR(<column_name>, <regular_expression>, <starting_position>, <match_occurrence>)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.