1

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.

 SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain", 
 COUNT(DISTINCT(user)) AS "Unique Users"
 FROM db
 GROUP BY 1
 ORDER BY 2 DESC;

This seems unfavorable as the query needs to be constantly updated for list of unwanted words.

I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?

A dataset sample to try with:

 CREATE TABLE sample (
 url VARCHAR(100) NOT NULL);

 INSERT INTO sample url) 
 VALUES 
 ("sample.co.uk"),
 ("www.sample.co.uk"),
 ("www3.sample.co.uk"),
 ("biz.sample.co.uk"),
 ("digital.testing.sam.co"),
 ("sam.co"),
 ("m.sam.co");

Desired output

+------------------------+--------------+
|    url                 |  domain      |
+------------------------+--------------+
| sample.co.uk           | sample.co.uk |
| www.sample.co.uk       | sample.co.uk |
| www3.sample.co.uk      | sample.co.uk |
| biz.sample.co.uk       | sample.co.uk |
| digital.testing.sam.co | sam.co       |
| sam.co                 | sam.co       |
| m.sam.co               | sam.co       |
+------------------------+--------------+
7
  • Can you make a list of "doubled TLDs" like co.uk? Commented May 7, 2019 at 9:48
  • Meaning? You want me to create more variation of doubled TLDs in the sample data? Commented May 7, 2019 at 9:53
  • No not in the sample data, that's OK as far as I'm concerned. But a possible solution I could imagine would match the end of the DNS names. But that may give you just co.uk instead of sample.co.uk. So these "doubled TLDs" need a special handling. That's why I as if you can make a list of them. After all the computer cannot "know" that co.uk is actually to be treated as one TLD. Commented May 7, 2019 at 9:58
  • This is exactly where I got stuck. The TLDs could either be .co.uk, .co or .uk. Commented May 7, 2019 at 10:02
  • 1
    This is far more complicated than it probably seems at first glance (I've tried to do it in the past). Take a look at some of the python libraries (tld or tldextract) that do this. They generally start with the full list of tlds available here: publicsuffix.org/list . It's quite long... Commented May 7, 2019 at 12:30

5 Answers 5

3

So, I've found the solution using Jeremy and Rémy Baron's answer.

  1. Extract all the public suffix from public suffix and store into a table which I labelled as tlds.

  2. Get the unique urls in the dataset and match to its TLD. part1

  3. Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output: final_output

The SQL query is as below:

WITH stored_tld AS(
SELECT 
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
                            rows between unbounded preceding and unbounded following) AS "tld" 
FROM sample s 
JOIN tlds t 
ON (s.url like '%%'||domain))

SELECT 
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2') 
END AS "extracted_domain" 
FROM(
    SELECT a.url,st."tld"
    FROM sample a
    LEFT JOIN stored_tld st
    ON a.url = st.url
    )t1

Links to try: SQL Tester

Sign up to request clarification or add additional context in comments.

Comments

2

I use split_part(url,'/',3) for this :

select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;

output

stackoverflow.com

1 Comment

If you want to remove the "www." select regexp_replace(split_part('https://stackoverflow.com/questions/56019744', '/', 3),'^www\.','');
1

You can try this :

with tlds as (
     select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
    select * from (values ('sample.co.uk'),
                          ('www.sample.co.uk'),
                          ('www3.sample.co.uk'),
                          ('biz.sample.co.uk'),
                          ('digital.testing.sam.co'),
                          ('sam.co'),
                          ('m.sam.co')
                   ) a(url)
     ) 
  select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
            select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld 
               from sample join tlds on (url like '%'||tld) 
         ) a

Comments

1

Here is my solution (a little bit more complex)

WITH
fqdn AS (
    SELECT
        row_number() over () as id,
        url,
        FQDN(url) AS "fqdn"
    FROM urls
),
stored_tld AS (
    SELECT DISTINCT ON (id)
        id,
        url,
        tld,
        fqdn
    FROM fqdn
    LEFT JOIN tlds
           ON reverse(fqdn(url)) LIKE 
              replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"
    ORDER BY id, -- for correct distinct on
             tld LIKE '%*%' DESC, -- prefer tld with wildcard
             length(tld) DESC -- prefer longer tld
),  extrated_domain AS (
    SELECT
        id,
        url,
        fqdn,
        reverse(
                substring(
                        reverse(fqdn), 
                        '#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)', 
                        '#'
                )
        ) AS "extracted_domain"
    FROM stored_tld
)
SELECT
    url,
    fqdn,
    coalesce(extracted_domain, fqdn) AS "domain",
    extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain

Fiddle with comments : https://dbfiddle.uk/QSDKx2-t

FQDN extracting

In order to extract FQDN from a url, you can use a more complex regexp https://regex101.com/r/vT9k3d/2

/^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)/igm

Also, you can store this regexp as function

CREATE OR REPLACE FUNCTION fqdn(url TEXT)
  RETURNS TEXT
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
select (regexp_matches(url, '^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', 'i'))[1]
$function$;

Save row order

There are can be duplicates, especially, after extracting domains. Better to save order using row_number() over ()

SELECT
    row_number() over () as id,
    url,
    (regexp_matches(url, '^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', 'i'))[1] AS "fqdn"
FROM urls

Pattern matching

First, we need to match all patterns for this domain

fqdn LIKE '%.' || replace(lower(tld), '*', '%') COLLATE "C"

or, better yet, use reversed strings to speed up the process later using indexes with prefix matching

reverse(fqdn) LIKE replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"

Result ranking

For extracting the best-match tld order rules should be used

ORDER BY id, -- for correct distinct on
         tld LIKE '%*%' DESC, -- prefer tld with wildcard
         length(tld) DESC -- prefer longer tld

Extracting domain with suffix name without subdomain

We will use substing postgresql function and pattern-matching.

After some experiments I found out, that this works for me (for reverse suffix abc.def)

select substring('db.abc.def.fsdfsd', '#"db.[a-z0-9]*.[a-z0-9]*#"(.%|)', '#');
select substring('db.abc.def', '#"db.[a-z0-9]*.[a-z0-9]*#"(.%|)', '#');

the resulting extract is

reverse(
        substring(
                reverse(fqdn), 
                '#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)', 
                '#'
        )
) AS "extracted_domain"

Merging results

On the final step, we add coalesce for domains, that weren't found and add a flag to monitor if the domain was extracted or not.

SELECT
    url,
    fqdn,
    coalesce(extracted_domain, fqdn) AS "domain",
    extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain

Fiddle with comments: https://dbfiddle.uk/QSDKx2-t

Comments

0

Improve performance

In my case, I need to extract domain from 1M+ urls.

First of all, I cached columns, to speed up the calculations

/* LIKE column and index */

alter table tlds
add tld_like text GENERATED ALWAYS AS (replace(lower(reverse(tld)), '*', '%') || '.%') STORED;

/* Create pattern column */

alter table tlds
add tld_pattern text GENERATED ALWAYS AS (
  '#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)' 
) STORED;

Then I added index for prefix matching

CREATE INDEX CONCURRENTLY tlds_like_idx
ON tlds (tld_like COLLATE "C");

Unfortunately, it doesn't work. Prefix matching working on the reverse order. You should add such index on domain table. It my case, it isn't possible.

Profiling

Also, using profiling I found out, that LEFT JOIN generates 1M x 10k tdls = 10B rows in memory and do filter with 99.99% rows excluding

LEFT JOIN tlds
           ON reverse(fqdn(url)) LIKE 
              replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"

Dirty hack

There are no top-level domains with less than 2 letters! So we can use 3 letters of domain ending as filter key! And index will be used!

alter table tlds
add tld_3l text GENERATED ALWAYS AS (
    left(replace(lower(reverse(tld)), '*', '') || '.', 3)
) STORED;

CREATE INDEX CONCURRENTLY tlds_3l_idx
ON tlds (tld_3l COLLATE "C");

Final query:

WITH
fqdn AS (
    SELECT
        row_number() over () as id,
        url,
        fqdn(url) AS "fqdn",
        left(reverse(fqdn(url)), 3) as "f3l",
        reverse(fqdn(url)) as "rev_fqdn"
    FROM technical.domains_sample_test
),
stored_tld AS (
    SELECT DISTINCT ON (id)
        id,
        url,
        fqdn,
        tld,
        tld_like,
        tld_pattern
    FROM fqdn
    LEFT JOIN technical.webisite_tlds
           ON f3l = tld_3l 
          AND rev_fqdn LIKE tld_like COLLATE "C"

    ORDER BY id, 
             tld LIKE '%*%' DESC, -- prefer tld with wildcard
             length(tld) DESC -- prefer longer tld
),  extrated_domain AS (
    SELECT
        id,
        url,
        fqdn,
        reverse(substring(reverse(fqdn), '#"' || replace(tld_pattern, '%', '[^.]*#"(.%|)'), '#')) AS "extracted_domain"
    FROM stored_tld
)
SELECT
    url,
    fqdn,
    coalesce(extracted_domain, fqdn) AS "domain",
    extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain

New fiddle: https://dbfiddle.uk/GLcG6lQo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.