1

I have been struggling with something that should be quite simple for hours now and I would appreciate any advice that could help. I have a Postgres database with addresses, I have a field, building_name which actually contains in many cases, building or apartment numbers. These numbers may or may not be suffixed with a letter e.g. 32A, 24b etc. These combinations could be anywhere in the string including the start or end. They may be followed by whitespace or some other non alphanumeric separator such as a slash or dash. Some examples below:

  • '11B' should return '11B'
  • 'BURNFOOT COTTAGE' should return nothing as there are no numbers
  • '2/1' should return '2'
  • '15a' should return '15a'
  • '6 CAROLINA COURT' should return '6'
  • 'PATRICK THOMAS COURT 83B' should return '83B'
  • 'UNIT 51' should return '51'
  • '1/6 NEW ASSEMBLY CLOSE' should return '1'
  • '15E GREENVALE' should return '15E'

I am trying to achieve this using a regular expression. The closest I can get is '(\d+\w+)' which works for some of the above but does not work for:

'2/1' or '6 CAROLINA COURT' or '1/6 NEW ASSEMBLY CLOSE'

I have followed the advice here SQL split string at first occurance of a number but it does not work for my requirements.

Any advice would be hugely appreciated, I am completely stuck!

Many thanks in advance,

Mark

4
  • 2
    Is this raw PostGres SQL, or are you using a language with API, and if so, what flavour of RegEx? Commented Jan 29, 2015 at 16:21
  • Hi - I have been testing out my regex logic using regexr.com and will use the regex in my Postgres query e.g. select substring(building_name from '(\d+\w+)') AS building_num Commented Jan 29, 2015 at 16:38
  • You need to define in English the rules that you want to follow before you can implement them in a regex. Also, this may be illuminating: mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses Commented Jan 29, 2015 at 18:19
  • Thanks for the advice Andy and that link is excellent - will be very useful indeed for my project Commented Jan 30, 2015 at 17:32

3 Answers 3

1

Your regexp doesn't quite work because you use the + qualifier, which searches for one or more letter. If you want to look for one or none, use the ? qualifier: '\d+\w?'.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you all for your help, the suggestion Ainar-G worked perfectly for my particular problem. I appreciate it is quite tricky to test pg-flavoured regexes but very grateful for your time
0

As mentioned by Nick B, it would be better to specify the RegEx implementation you are using. As a general answer though, you could try something like this:

(^|\s)(\d+[a-Z]?\b)

and take the second group from the result.

(^|\s) matches the line start or a whitespace. This allowes to exclude from the output the number 1 in the 2/1 testcase.

Then \d+[a-Z]? should match any sequence of at least one number followed by at most one letter.

Hope this helps!

1 Comment

Thank you very much for your suggestion Daniele, did not quite work for me in pg but much appreciated all the same
0

You're forcing a word character, when this is optional (and not catering for non alpha-numeric non-numerics).

So, assuming you're using POSIX regexes in PostGres, try something like this:

(\d+\w*)[ /\\\-]|$

making sure you capture group 1 as your output.

This involved a bit of guesswork, there aren't many PG-flavoured online testers.

Note it seems PostGres doesn't support Perl-flavoured regexes, so your \b won't ever work here, hence me avoiding it.

1 Comment

Thank you very much for your suggestion Nick B, did not quite work for me in pg but much appreciated all the same

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.