5

I have a table filled with URLs. The URLs are in all sorts of formats: http://foo.com, http://bar.foo.com, http://foo.com/bar, etc. But I'm only interested in the domain name itself, so in this case: foo.com. What I'd like to do is select how many times domain names exists in this table. So something like:

SELECT "whatever the domain is in field 'url'", COUNT(*) AS count
FROM table_with_urls
GROUP BY "whatever the domain is in field 'url'"

There are a few similar questions on Stack Overflow, but nothing really answered this. I can't use LIKE or match something with REGEXP, because I'm not (always) looking for specific domain names to match against, but mostly I just want all domain names from the table along with a total count.

Is this possible using MySQL?

2
  • It's very difficult to define 'domain name' as you have things like 'foo.co.uk', 'mydomain.myhost.com', 'foo.museum', etc. The only accurate way to do it is to have a list of the possible top level domains, and that list is quite long (100s of elements). Can you be more specific by what you mean by 'domain name' in your context? Commented Jan 26, 2010 at 20:35
  • Looking for only 2 in 1:[subdomain]2:[foo.com/foo.co.uk]3:[/whatever]. But I was afraid that it would come to something as matching against all possible TLDs and such. Commented Jan 26, 2010 at 20:39

3 Answers 3

10

i had the same problem and this is what i did:

select SUBSTRING(url from 1 for locate('/',url ,10)-1),count(*) from url_list group by SUBSTRING(url from 1 for locate('/',url ,10)-1);
Sign up to request clarification or add additional context in comments.

Comments

6

Add another indexed column for 'domain' and when you do an INSERT, store this value separately.

Comments

0

If you want to install a MySQL extension then https://github.com/StirlingMarketingGroup/mysql-get-etld-p1

It extracts basically what you'd expect it to

select`get_etld_p1`('http://a.very.complex-domain.co.uk:8080/foo/bar');-- 'complex-domain.co.uk'
select`get_etld_p1`('https://www.bbc.co.uk/');-- 'bbc.co.uk'
select`get_etld_p1`('https://github.com/StirlingMarketingGroup/');-- 'github.com'
select`get_etld_p1`('https://localhost:10000/index');-- 'localhost'
select`get_etld_p1`('android-app://com.google.android.gm');-- 'com.google.android.gm'
select`get_etld_p1`('example.test.domain.com');-- 'domain.com'
select`get_etld_p1`('postgres://user:[email protected]:5432/path?k=v#f');-- 'host.com'
select`get_etld_p1`('exzvk.omsk.so-ups.ru');-- 'so-ups.ru'
select`get_etld_p1`('http://10.64.3.5/data_check/index.php?r=index/rawdatacheck');-- '10.64.3.5'
select`get_etld_p1`('not a domain');-- null

Then, if you wanted that to be performant, you could make a second, denormalizing, column that stores just those values, something like

CREATE TABLE `db`.`sometablewithurls` (
  `SomeTableWithURLsID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `URL` TEXT NOT NULL DEFAULT '',
  `_ETLDP1` VARCHAR(255) NOT NULL DEFAULT '',
  PRIMARY KEY (`SomeTableWithURLsID`),
  INDEX `_ETLDP1` (`_ETLDP1` ASC));
DROP TRIGGER IF EXISTS `db`.`sometablewithurls_BEFORE_INSERT`;

DELIMITER $$
USE `db`$$
CREATE DEFINER = CURRENT_USER TRIGGER `db`.`sometablewithurls_BEFORE_INSERT` BEFORE INSERT ON `sometablewithurls` FOR EACH ROW
BEGIN

set new.`_ETLDP1`=ifnull(`get_etld_p1`(new.`URL`),'');

END$$
DELIMITER ;
DROP TRIGGER IF EXISTS `db`.`sometablewithurls_BEFORE_UPDATE`;

DELIMITER $$
USE `db`$$
CREATE DEFINER = CURRENT_USER TRIGGER `db`.`sometablewithurls_BEFORE_UPDATE` BEFORE UPDATE ON `sometablewithurls` FOR EACH ROW
BEGIN

set new.`_ETLDP1`=ifnull(`get_etld_p1`(new.`URL`),'');

END$$
DELIMITER ;

Notice the index on the _ETLDP1 (stands for extended top-level domain plus 1), and the trigger updating it both on insert and on update to make sure it keeps up to date even if URL changes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.