0

How can I split the following using regex.

'abcd1234567ef' into 'abcd', '1234567', 'ef'
'abcd1234567.89ef' into 'abcd', '1234567.89', 'ef'

I need to split a string in Postgres SQL which may or may not have decimal numbers. I have tried this SELECT regexp_match('abcd1234567ef', '(?:(.*?)(\d+)(.*)){1,1}'); from the Postgres document, works only for the first case.

EDIT: After getting that working with rewritten's answer, There were some columns which have data in the form 'abcd12efg34567hij' which needs to be split as either 'abcd', '12, 'efg', '34567', 'hij' or 'abcd', '12efg', '34567', 'hij'. Either will work for me

1 Answer 1

1

You need to escape the dot, otherwise it means "any character".

The regexp to match and capture the numeric part is is:

(\d+(?:\.\d+)?)

Notice the optional "dot and digits" that is grouped with a non-capturing pair of parentheses.

To match one substring, you can use regexp_match, if you have multiple substrings you will need regexp_matches(data, regexp, 'g') that will return multiple rows. If you need one row only, combine with ARRAY_AGG.

The example abcd12efg34567hij will give:

SELECT regexp_matches('abcd12efg34567hij', '\d+(?:\.\d+)?', 'g')

12
34567

SELECT array_agg(regexp_matches('abcd12efg34567hij', '\d+(?:\.\d+)?', 'g')

{12, 34567}

If you need also the intermediate pieces, you need to use many more capturing groups:

([^\d]*)(?:(\d+(?:\.\d+)?)([^\d]*))*

([^\d]*) // group 1 non-digits
        (?: // non-capturing group
           (\d+(?:\.\d+)?) // group 2, 4, 6, … digits with one dot
                          ([^\d]*) // group 3, 5, 7, … non-digits
                                  )* // end group, repeat 0 or more times

(spaces, newlines and coments added for clarity, they should not be in the regexp)

Sign up to request clarification or add additional context in comments.

6 Comments

if you can help once more, after getting that working, some columns have data in the form 'abcd12efg34567hij' which needs to be split as either 'abcd', '12, 'efg', '34567', 'hij' or 'abcd', '12efg', '34567', 'hij'. either will work for me.
You should ask a separate question, with a list of test cases, so it will be easier to answer
Just added the part to the question.
Answered in edit
SELECT (regexp_matches('abcd12efg34567hij', '(?:(.*?)(\d+(?:\.\d+)?)(.*)){1,1}', 'g')) gives {abcd,12,efg34567hij} needs to be either 'abcd', '12, 'efg', '34567', 'hij' or 'abcd', '12efg', '34567', 'hij'. SELECT (regexp_matches('abcd1234567hij', '(?:(.*?)(\d+(?:\.\d+)?)(.*)){1,1}', 'g')) give correctly {abcd,1234567,hij}
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.