Postgres - how to split and join?

Question

Is there a way to split a column into tokens, and join them (like you can in other programming languages such as Python, Java, Ruby)

I have a column with urls such as "http://www.Yahoo.com", and I want to extract "Yahoo.com" from it (the main domain, NOT the subdomain). The urls can be of the forms:

I was planning on using a regex to extract everything after http:// and before the next slash. Then splitting the url by the period (.), then joining the last 2 tokens.

With the regex, I can extract www.yahoo.com from http://www.yahoo.com. With the splits/joins, I can get yahoo.com from www.yahoo.com. Problem is I don't know how to do split/joins with Postgres.

Anyone know of a way? Or better alternative?

Vinod Kurup · Accepted Answer · 2013-07-27 02:31:59Z

4

This isn't quite the approach you asked for, but should get what you want:

vinod=# select * from table;
            url                
----------------------------------
 http://www.domain.com
 http://domain.com
 http://domain.com/page/page1
 http://www.domain.com/page/page2
 http://www.domain.com/
(5 rows)

vinod=# select substring(substring(url from 'http[s]*://([^/]+)') from '\w+\.\w+$') from table;
 substring  
------------
 domain.com
 domain.com
 domain.com
 domain.com
 domain.com
(5 rows)

The inner substring command pulls out the full domain, and the outer substring command pulls out the last two fragments. The Postgresql split and join commands are not as powerful as in your average scripting language, so I tend to do this kind of stuff after I pull things out of the DB, if I can.

answered Jul 27, 2013 at 2:31

Vinod Kurup

2,7861 gold badge23 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Craig Ringer · Accepted Answer · 2013-07-27 14:28:01Z

Splitting things into tokens can be accomplished in quite a few ways:

regexp_split_to_table / regexp_split_to_array
string_to_array (for simple fixed delimter splits)
Manual substring extraction or substring(... from 'pattern')
Full text search's to_tsvector and to_tsquery
Procedural language libraries, like Perl or Python URL libraries, Python + NLTK for natural language processing, etc

In this case you could do your URL splitting with a regular expression using regexp_split_.... and that's probably OK for many uses - but probably not this one. Consider:

My domain, ringerc.id.au (that is the "main" domain)
www.ecu.edu.au ("main" domain is ecu.edu.au)
www.transperth.wa.gov.au ("main" domain is transperth.wa.gov.au)
tartarus.uwa.edu.au ("main" domain is uwa.edu.au)

Good luck dealing with all the national registry and sub-registry variations using a regular expression. Use a proper URL parser to extract the domain, then a proper domain-aware library to work out what the "main" domain is for your purposes. I'd recommend using plperl and the URL::Split or URI modules to start with. Or the URL parser of whatever supported procedural language (Python, TCL, whatever) you want. Then find a suitable library for that language that can identify domains and subdomains meaningfully according to the criteria you want and use that, rather than just relying on a regular expression.

When joining you similarly have many options:

array_to_string
string_agg
The || concatenation operator
procedural language string operations and libraries

For URL work, again I'd suggest doing this with a PL that has a proper native URL library.

Snow Blind · Accepted Answer · 2013-07-27 01:22:40Z

0

You can match them with \w+.[^.]+$

http://www.domain.com -> domain.com
http://domain.com -> domain.com
http://domain.com/page/page1 -> domain.com/page/page1
http://www.domain.com/ -> domain.com/
http://www.domain.com/page/page2 -> domain.com/page/page2

answered Jul 27, 2013 at 1:22

Snow Blind

1,1647 silver badges12 bronze badges

1 Comment

Snow Blind Over a year ago

Then what about something like select substring(subtstring(url from '(\w+[.])?\w+[.]\w+') from '\w+[.]\w+$') from table

Collectives™ on Stack Overflow

Postgres - how to split and join?

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related