MYSQL: Group By regex pattern

Question

I am attempting to do statistical tracking. In my database I am storing referring urls. Frequently I have url's that resemble the following:

http://www2.trafficadbar.com/__a4w4
http://trafficadbar.com/__a4w4
http://www.trafficadbar.com/__a4w4
http://4acesmailer.com/credit_click.php?userid=2472&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=2714&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=2723&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=3245&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=3259&openkey=gbyp2vcm

I want to know how I would do a GROUP BY and COUNT on a regex pattern. Basically what I want is as follows returned:

trafficadbar 3
4acesmailer 5

Currently when I try to do a GROUP BY it only works where the url's are exactly the same. so www.blah.com and blah.com are two different results and further each url variable ?blah=1&blahblah=2 acts as yet anoher unique group,

I have searched for countless solutions, but they mostly seem to be very specific to the problem asked, and almost all seem to show some "non-regex" workaround - which would be fine... if I could find a method I could apply.

I don't say this often but you are probably better off selecting all and then parsing it in PHP. — AbraCadaver
– AbraCadaver, Commented Jul 23, 2016 at 0:30

Abecee · Accepted Answer · 2016-07-26 20:50:29Z

1

To retrieve the part immediately preceding the top level domain from the hostnames, you could work along:

SELECT
  REVERSE(SUBSTRING(SUBSTRING_INDEX(rev_hostname, '.', 2),
          LOCATE('.', rev_hostname) + 1)
         ) domain
  , COUNT(id) hits
FROM (
  SELECT
    id
    , CONCAT(REVERSE(SUBSTRING_INDEX(SUBSTRING(referring_site, 8),
                                     '/', 1)), '.') rev_hostname
  FROM TestData
  ) T
GROUP BY domain
;

It:

relies on all referring_sites to start off with http://, and
will fail - as it stands - for, e.g., 4acesmailer.co.uk.

Either one could be addressed (to some degree) if required.

See it in action SQL Fiddle (with your data somewhat adjusted/extended to cover some more cases).

Please comment if and as this requires adjustment / further detail.

answered Jul 26, 2016 at 20:50

Abecee

2,4032 gold badges14 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sgeddes · Accepted Answer · 2016-07-23 01:17:38Z

0

If you just care about those 2 values, something like this would work:

select case when yourcolumn like '%trafficadbar%' then 'trafficadbar' 
            when yourcolumn like '%4acesmailer%' then '4acesmailer' 
       end,
    count(*)
from yourtable 
group by 1

SQL Fiddle Demo

Edit, given your comments, this could be a little more dynamic and relatively easy to extend:

select 
  replace(replace(replace(
     left(yourcolumn, locate('.com', yourcolumn) - 1), 
     'http://', ''), 
     'www.', ''), 
     'www2.', ''),
  count(*)
from yourtable 
group by 1

More Fiddle

edited Jul 23, 2016 at 1:17

answered Jul 23, 2016 at 0:32

sgeddes

62.9k7 gold badges67 silver badges85 bronze badges

5 Comments

AbraCadaver Over a year ago

I'm pretty sure those are examples and they don't know all of the domain names up front.

sgeddes Over a year ago

@AbraCadaver -- Good point, could be, not completely clear if the OP is searching for specific domains or wants to somehow aggregate similar ones. Don't think the later is possible with sql alone...

Bruce Over a year ago

That word "resembling" in my post, definitely was intended to say I am not specifically searching for those - that would be easy :)

sgeddes Over a year ago

@Bruce -- fair enough, this isn't trivial with sql alone. See edits though for a potential "working" solution...

Drew Over a year ago

@sgeddes this answer perhaps? idk

BeetleJuice · Accepted Answer · 2016-07-23 01:37:41Z

0

I'm not skilled enough to do this reliably all in SQL; there are too many moving parts: lots of subdomains, lots of TLDs possible (not just .com), malformed domains possible etc...

My approach: Select everything and parse in PHP.

In the example below, I assume the URLs are in a urls column, and that you have a date_added column with the datetime when each url was added to the DB. Adjust your query accordingly.

Select all urls added within the last 30 days:

SELECT `urls` FROM `myTable`
WHERE `date_added` >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)

Place all results in a $rows array, then process that to generate the report you want

$rows = [...];//Associative array of all rows returned by the query above
$results = []; //will hold aggregate counts

foreach($rows as $row){   
    $host = parse_url($row['urls'],PHP_URL_HOST); //eg: www2.trafficadbar.com
    $matches = [];

    //find top level domain or skip to next row
    if(!preg_match('/[^\.]*\.[^\.]+$/',$host,$matches)) continue;        

    $domain = $matches[0]; //eg: trafficadbar.com

    //increment the count for this domain in results
    if(!isset($results[$domain])) $results[$domain]=0;
    $results[$domain]++;
}

With the input you gave in the OP, the $results will be:

[
    'trafficadbar.com' => 3,
    '4acesmailer.com' => 5,
]

You'll notice that unlike you, I kept the TLD (eg: .com, .net...) because ebay.com and ebay.ph are completely different domains. I would advise against mashing them into one result.

Live demo

edited Jul 23, 2016 at 1:37

answered Jul 23, 2016 at 1:31

BeetleJuice

41.1k24 gold badges120 silver badges181 bronze badges

2 Comments

AbraCadaver Over a year ago

Good one. But just add to array in loop and then use array_count values()

BeetleJuice Over a year ago

I don't think that would be faster because you will be traversing arrays twice (first all urls to add domains, then all results to count values) when I only do it once.

Bruce · Accepted Answer · 2016-07-23 06:14:50Z

0

Although the solution from @BeetleJuice would have worked, and likely more reliably than the solution I chose, I opted for an SQL solution

SELECT 
   CASE WHEN SUBSTRING(referring_site, 1, 8) = 'http://w' 
      THEN SUBSTRING_INDEX((SUBSTRING_INDEX(referring_site, '.', 2)), '.', -1)
      ELSE SUBSTRING_INDEX((SUBSTRING_INDEX(referring_site, '.', 1)), '://', -1) 
   END 
AS domain 
FROM 
....

Drawbacks are when it doesn't star with a http://w but rather some http://random.sub.domain

answered Jul 23, 2016 at 6:14

Bruce

1,0711 gold badge9 silver badges34 bronze badges

2 Comments

Bruce Over a year ago

If anyone has a better pure mysql solution, that is what I would really like. Ultimately what I want to do is is count the number of "." between :// and / then if 2 do the first substring_index else do the second. I just don't know how to get and apply a proper count on the "."

Abecee Over a year ago

(i) Is it correct to re-phrase: You want the part preceding the top level domain (com, org, etc.)? (Judging by the number of dots might be misleading: What would you like to get from abc.def.ghi.com?) (ii) Always starting http?

Yashashvi · Accepted Answer · 2022-07-01 22:07:29Z

0

Similar to: https://stackoverflow.com/a/72834976/7768504

MySQL introduced REGEXP_SUBSTR to group columns by applying regular expressions. Documentation for REGEXP_SUBSTR

REGEXP_SUBSTR(<column_name>, <regular_expression>, <starting_position>, <match_occurrence>)

answered Jul 1, 2022 at 22:07

Yashashvi

4785 silver badges14 bronze badges

Collectives™ on Stack Overflow

MYSQL: Group By regex pattern

5 Answers 5

Comments

5 Comments

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

5 Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related