3

I have a database that contains website URL's. From those URL's I'd like to extract the domain name. Here are two (quiet different) examples:

http://www.example.com       -> example.com
example.co.uk/dir/index.html -> example.co.uk

In order to do this I am using a regular expression and the functions REGEXP_SUBSTR and REGEXP_REPLACE that Oracle provides. I am using replace to replace the preceding http[s] and the www. with an empty string (deleting it). Then I use substring to get the string between the beginning and the first / or if there is no / the whole string. My code looks like this:

REGEXP_SUBSTR(REGEXP_REPLACE(website_url, '^http[s]?://(www\.)?|^www\.', '', 1), '(.+?)(/|$)')

Everything works as expected, except the fact that my regex fails to exclude the /:

example.com/dir/index.html -> example.com/

I would like to get rid of the /. How do I do that?

5 Answers 5

7

Use this :

WITH tab AS 
 (SELECT 'https://www.example.co.uk/dir/index.html' AS website_url 
    FROM dual)
SELECT REGEXP_SUBSTR(REGEXP_REPLACE(website_url, '^http[s]?://(www\.)?|^www\.', '', 1), '\w+(\.\w+)+') 
  FROM tab;

output:

|REGEXP_SUBSTR(REGEXP_REPLACE(W|
--------------------------------
|example.co.uk                 |
Sign up to request clarification or add additional context in comments.

3 Comments

This works very nice! Thank you very much. But sadly it doesn't work for URL's that include a - for example the URl www.top.i-am-a-example.com gives top.i I tried but I can't fix it. Do you know how?
Adding permissible range could be one solution to this. REGEXP_SUBSTR(REGEXP_REPLACE(website_url, '^http[s]?://(www\.)?|^www\.', '', 1), '[a-z,A-Z,0-9,-]+(\.\w+)+')
Yes adding a range seems to be the only option. Using the your code I still get top.i. I am not an expert on regex, so I don't know why... Looks correct to me
5

Thanks to the hints in the answers I finally got it working!

The code I am using now looks like this:

REGEXP_REPLACE(website_url, '(http[s]?://)?(www\.)?(.*?)((/|:)(.)*|$)', '\3')

Thanks for the help everybody!

Comments

1

Not sure whether oracle supports the ?: to exclude a group or not.

REGEXP_REPLACE(website_url, '^(?:(?:http[s]?://)?www\.)?(.*?)(?:/.*|$)', '\1')

If it doesn't, then this one:

REGEXP_REPLACE(website_url, '^((http[s]?://)?www\.)?(.*?)(/.*|$)', '\3')

1 Comment

As far as I can see it Oracle does not support ?: the second works as expected, but somehow it does not work for urls like: www.example.com/dir/index.html it returns: example.comdir/index.html
0

You could use the following regex matching something_without_a_dot.something_without_a_dot from the end of the string. You'll get the answer in the first group. If you need the TLD also, you can enclose everything in () except the $.

([^.]+)\.[^.]+$

In SQL, that gives:

SQL> select regexp_replace('sub1.sub2.domain.com', '^.*?([^.]+)\.[^.]+$', '\1') from dual;

REGEXP
------
domain

The non-greedy .*? at the start allows you to ignore the start of the string.

To get the domain name plus the TLD:

SQL> select regexp_replace('sub1.sub2.domain.com', '^.*?([^.]+\.[^.]+)$', '\1') from dual;

REGEXP_REP
----------
domain.com

To take into account co.uk:

SQL> select regexp_replace('sub1.sub2.domain.co.uk', '^.*?([^.]+\.(co\.uk|[^.]+))$', '\1') from dual;

REGEXP_REPLA
------------
domain.co.uk

Source

Comments

0

Why not using (http)uritype and extract host from that?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.