Extract domain from url using PostgreSQL

Question

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.

 SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain", 
 COUNT(DISTINCT(user)) AS "Unique Users"
 FROM db
 GROUP BY 1
 ORDER BY 2 DESC;

This seems unfavorable as the query needs to be constantly updated for list of unwanted words.

I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?

A dataset sample to try with:

 CREATE TABLE sample (
 url VARCHAR(100) NOT NULL);

 INSERT INTO sample url) 
 VALUES 
 ("sample.co.uk"),
 ("www.sample.co.uk"),
 ("www3.sample.co.uk"),
 ("biz.sample.co.uk"),
 ("digital.testing.sam.co"),
 ("sam.co"),
 ("m.sam.co");

Desired output

+------------------------+--------------+
|    url                 |  domain      |
+------------------------+--------------+
| sample.co.uk           | sample.co.uk |
| www.sample.co.uk       | sample.co.uk |
| www3.sample.co.uk      | sample.co.uk |
| biz.sample.co.uk       | sample.co.uk |
| digital.testing.sam.co | sam.co       |
| sam.co                 | sam.co       |
| m.sam.co               | sam.co       |
+------------------------+--------------+

Meaning? You want me to create more variation of doubled TLDs in the sample data? — user123
– user123, Commented May 7, 2019 at 9:53
No not in the sample data, that's OK as far as I'm concerned. But a possible solution I could imagine would match the end of the DNS names. But that may give you just co.uk instead of sample.co.uk. So these "doubled TLDs" need a special handling. That's why I as if you can make a list of them. After all the computer cannot "know" that co.uk is actually to be treated as one TLD. — sticky bit
– sticky bit, Commented May 7, 2019 at 9:58
This is exactly where I got stuck. The TLDs could either be .co.uk, .co or .uk. — user123
– user123, Commented May 7, 2019 at 10:02
This is far more complicated than it probably seems at first glance (I've tried to do it in the past). Take a look at some of the python libraries (tld or tldextract) that do this. They generally start with the full list of tlds available here: publicsuffix.org/list . It's quite long... — Jeremy
– Jeremy, Commented May 7, 2019 at 12:30

user123 · Accepted Answer · 2019-05-14 03:16:02Z

So, I've found the solution using Jeremy and Rémy Baron's answer.

Extract all the public suffix from public suffix and store into a table which I labelled as tlds.
Get the unique urls in the dataset and match to its TLD.
Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output:

The SQL query is as below:

WITH stored_tld AS(
SELECT 
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
                            rows between unbounded preceding and unbounded following) AS "tld" 
FROM sample s 
JOIN tlds t 
ON (s.url like '%%'||domain))

SELECT 
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2') 
END AS "extracted_domain" 
FROM(
    SELECT a.url,st."tld"
    FROM sample a
    LEFT JOIN stored_tld st
    ON a.url = st.url
    )t1

Links to try: SQL Tester

vjeantet · Accepted Answer · 2022-02-27 14:15:29Z

2

I use split_part(url,'/',3) for this :

select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;

output

stackoverflow.com

answered Feb 27, 2022 at 14:15

vjeantet

3012 silver badges5 bronze badges

1 Comment

Alexandre Testu Over a year ago

If you want to remove the "www." select regexp_replace(split_part('https://stackoverflow.com/questions/56019744', '/', 3),'^www\.','');

Rémy Baron · Accepted Answer · 2019-05-07 10:32:01Z

You can try this :

with tlds as (
     select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
    select * from (values ('sample.co.uk'),
                          ('www.sample.co.uk'),
                          ('www3.sample.co.uk'),
                          ('biz.sample.co.uk'),
                          ('digital.testing.sam.co'),
                          ('sam.co'),
                          ('m.sam.co')
                   ) a(url)
     ) 
  select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
            select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld 
               from sample join tlds on (url like '%'||tld) 
         ) a

Eugene Chernyavsky · Accepted Answer · 2023-11-24 19:04:52Z

Here is my solution (a little bit more complex)

WITH
fqdn AS (
    SELECT
        row_number() over () as id,
        url,
        FQDN(url) AS "fqdn"
    FROM urls
),
stored_tld AS (
    SELECT DISTINCT ON (id)
        id,
        url,
        tld,
        fqdn
    FROM fqdn
    LEFT JOIN tlds
           ON reverse(fqdn(url)) LIKE 
              replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"
    ORDER BY id, -- for correct distinct on
             tld LIKE '%*%' DESC, -- prefer tld with wildcard
             length(tld) DESC -- prefer longer tld
),  extrated_domain AS (
    SELECT
        id,
        url,
        fqdn,
        reverse(
                substring(
                        reverse(fqdn), 
                        '#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)', 
                        '#'
                )
        ) AS "extracted_domain"
    FROM stored_tld
)
SELECT
    url,
    fqdn,
    coalesce(extracted_domain, fqdn) AS "domain",
    extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain

Fiddle with comments : https://dbfiddle.uk/QSDKx2-t

FQDN extracting

In order to extract FQDN from a url, you can use a more complex regexp https://regex101.com/r/vT9k3d/2

/^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)/igm

Also, you can store this regexp as function

CREATE OR REPLACE FUNCTION fqdn(url TEXT)
  RETURNS TEXT
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
select (regexp_matches(url, '^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', 'i'))[1]
$function$;

Save row order

There are can be duplicates, especially, after extracting domains. Better to save order using row_number() over ()

SELECT
    row_number() over () as id,
    url,
    (regexp_matches(url, '^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', 'i'))[1] AS "fqdn"
FROM urls

Pattern matching

First, we need to match all patterns for this domain

fqdn LIKE '%.' || replace(lower(tld), '*', '%') COLLATE "C"

or, better yet, use reversed strings to speed up the process later using indexes with prefix matching

reverse(fqdn) LIKE replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"

Result ranking

For extracting the best-match tld order rules should be used

ORDER BY id, -- for correct distinct on
         tld LIKE '%*%' DESC, -- prefer tld with wildcard
         length(tld) DESC -- prefer longer tld

Extracting domain with suffix name without subdomain

We will use substing postgresql function and pattern-matching.

After some experiments I found out, that this works for me (for reverse suffix abc.def)

select substring('db.abc.def.fsdfsd', '#"db.[a-z0-9]*.[a-z0-9]*#"(.%|)', '#');
select substring('db.abc.def', '#"db.[a-z0-9]*.[a-z0-9]*#"(.%|)', '#');

the resulting extract is

reverse(
        substring(
                reverse(fqdn), 
                '#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)', 
                '#'
        )
) AS "extracted_domain"

Merging results

On the final step, we add coalesce for domains, that weren't found and add a flag to monitor if the domain was extracted or not.

SELECT
    url,
    fqdn,
    coalesce(extracted_domain, fqdn) AS "domain",
    extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain

Fiddle with comments: https://dbfiddle.uk/QSDKx2-t

Eugene Chernyavsky · Accepted Answer · 2023-11-24 19:27:36Z

Improve performance

In my case, I need to extract domain from 1M+ urls.

First of all, I cached columns, to speed up the calculations

/* LIKE column and index */

alter table tlds
add tld_like text GENERATED ALWAYS AS (replace(lower(reverse(tld)), '*', '%') || '.%') STORED;

/* Create pattern column */

alter table tlds
add tld_pattern text GENERATED ALWAYS AS (
  '#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)' 
) STORED;

Then I added index for prefix matching

CREATE INDEX CONCURRENTLY tlds_like_idx
ON tlds (tld_like COLLATE "C");

Unfortunately, it doesn't work. Prefix matching working on the reverse order. You should add such index on domain table. It my case, it isn't possible.

Profiling

Also, using profiling I found out, that LEFT JOIN generates 1M x 10k tdls = 10B rows in memory and do filter with 99.99% rows excluding

LEFT JOIN tlds
           ON reverse(fqdn(url)) LIKE 
              replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"

Dirty hack

There are no top-level domains with less than 2 letters! So we can use 3 letters of domain ending as filter key! And index will be used!

alter table tlds
add tld_3l text GENERATED ALWAYS AS (
    left(replace(lower(reverse(tld)), '*', '') || '.', 3)
) STORED;

CREATE INDEX CONCURRENTLY tlds_3l_idx
ON tlds (tld_3l COLLATE "C");

Final query:

WITH
fqdn AS (
    SELECT
        row_number() over () as id,
        url,
        fqdn(url) AS "fqdn",
        left(reverse(fqdn(url)), 3) as "f3l",
        reverse(fqdn(url)) as "rev_fqdn"
    FROM technical.domains_sample_test
),
stored_tld AS (
    SELECT DISTINCT ON (id)
        id,
        url,
        fqdn,
        tld,
        tld_like,
        tld_pattern
    FROM fqdn
    LEFT JOIN technical.webisite_tlds
           ON f3l = tld_3l 
          AND rev_fqdn LIKE tld_like COLLATE "C"

    ORDER BY id, 
             tld LIKE '%*%' DESC, -- prefer tld with wildcard
             length(tld) DESC -- prefer longer tld
),  extrated_domain AS (
    SELECT
        id,
        url,
        fqdn,
        reverse(substring(reverse(fqdn), '#"' || replace(tld_pattern, '%', '[^.]*#"(.%|)'), '#')) AS "extracted_domain"
    FROM stored_tld
)
SELECT
    url,
    fqdn,
    coalesce(extracted_domain, fqdn) AS "domain",
    extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain

New fiddle: https://dbfiddle.uk/GLcG6lQo

Collectives™ on Stack Overflow

Extract domain from url using PostgreSQL

5 Answers 5

Comments

1 Comment

Comments

FQDN extracting

Save row order

Pattern matching

Result ranking

Extracting domain with suffix name without subdomain

Merging results

Comments

Improve performance

Profiling

Dirty hack

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

FQDN extracting

Save row order

Pattern matching

Result ranking

Extracting domain with suffix name without subdomain

Merging results

Comments

Improve performance

Profiling

Dirty hack

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related