2

Is there a way to split a column into tokens, and join them (like you can in other programming languages such as Python, Java, Ruby)

I have a column with urls such as "http://www.Yahoo.com", and I want to extract "Yahoo.com" from it (the main domain, NOT the subdomain). The urls can be of the forms:

I was planning on using a regex to extract everything after http:// and before the next slash. Then splitting the url by the period (.), then joining the last 2 tokens.

With the regex, I can extract www.yahoo.com from http://www.yahoo.com. With the splits/joins, I can get yahoo.com from www.yahoo.com. Problem is I don't know how to do split/joins with Postgres.

Anyone know of a way? Or better alternative?

3 Answers 3

4

This isn't quite the approach you asked for, but should get what you want:

vinod=# select * from table;
            url                
----------------------------------
 http://www.domain.com
 http://domain.com
 http://domain.com/page/page1
 http://www.domain.com/page/page2
 http://www.domain.com/
(5 rows)

vinod=# select substring(substring(url from 'http[s]*://([^/]+)') from '\w+\.\w+$') from table;
 substring  
------------
 domain.com
 domain.com
 domain.com
 domain.com
 domain.com
(5 rows)

The inner substring command pulls out the full domain, and the outer substring command pulls out the last two fragments. The Postgresql split and join commands are not as powerful as in your average scripting language, so I tend to do this kind of stuff after I pull things out of the DB, if I can.

Sign up to request clarification or add additional context in comments.

Comments

1

Splitting things into tokens can be accomplished in quite a few ways:

  • regexp_split_to_table / regexp_split_to_array
  • string_to_array (for simple fixed delimter splits)
  • Manual substring extraction or substring(... from 'pattern')
  • Full text search's to_tsvector and to_tsquery
  • Procedural language libraries, like Perl or Python URL libraries, Python + NLTK for natural language processing, etc

In this case you could do your URL splitting with a regular expression using regexp_split_.... and that's probably OK for many uses - but probably not this one. Consider:

  • My domain, ringerc.id.au (that is the "main" domain)
  • www.ecu.edu.au ("main" domain is ecu.edu.au)
  • www.transperth.wa.gov.au ("main" domain is transperth.wa.gov.au)
  • tartarus.uwa.edu.au ("main" domain is uwa.edu.au)

Good luck dealing with all the national registry and sub-registry variations using a regular expression. Use a proper URL parser to extract the domain, then a proper domain-aware library to work out what the "main" domain is for your purposes. I'd recommend using plperl and the URL::Split or URI modules to start with. Or the URL parser of whatever supported procedural language (Python, TCL, whatever) you want. Then find a suitable library for that language that can identify domains and subdomains meaningfully according to the criteria you want and use that, rather than just relying on a regular expression.

When joining you similarly have many options:

  • array_to_string
  • string_agg
  • The || concatenation operator
  • procedural language string operations and libraries

For URL work, again I'd suggest doing this with a PL that has a proper native URL library.

Comments

0

You can match them with \w+.[^.]+$

http://www.domain.com -> domain.com
http://domain.com -> domain.com
http://domain.com/page/page1 -> domain.com/page/page1
http://www.domain.com/ -> domain.com/
http://www.domain.com/page/page2 -> domain.com/page/page2

1 Comment

Then what about something like select substring(subtstring(url from '(\w+[.])?\w+[.]\w+') from '\w+[.]\w+$') from table

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.