1

How to get these data I want in Oracle, using REGEXP_SUBSTR

SPRINTMVNO_PM_CDR_IWIRELESS_20121110_0813.csv get '08'in last four digits
RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt  get '0043722' in the middle(between'_')
wireless_201211120015_201211120515            get '0515' (last four digits)

I have tried many times, but some expression works fine in PHP or other language but not work in ORACLE. Maybe the syntax is different.

For example : the second one I can use /(?<=_)[0-9]*(?=_)/ to get the number in php, but this does not work in Oracle.
I tried

SELECT REGEXP_SUBSTR('RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt','(?<=_)[0-9]*(?=_)') 
  FROM dual;

No output. So it's not the problem of the two slash lines

An alternate formulation of this question would be "how to get content between or start with a character but not include it, with Oracle's regex ? "

I know I can get those data easily by using string functions, the problem is there are tons of different strings to handle, each of them have different data to retrieve. So I want to store the patterns into database, and use one regexp_substr to get all data. Otherwise I need to hard code those rules.

6
  • 1
    Can you explain what rules you're trying to apply? You present three strings, each with a different output. Presumably you want three different search pattens. Commented Nov 17, 2012 at 8:09
  • Your question appears to be how can I get non-random strings out of a regular expression when the size, location and meaning of these strings are completely random.? Commented Nov 17, 2012 at 10:11
  • @APC Yes, I want three different patterns to find the specific data from these three strings. There will be more similar strings for each example I need to search. For example: RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt this string may have other similar ones like RK_IPDR_RKMSG2_0043724_DT_20121113162712.txt ;RK_IPDR_RKMSG2_0043725_DT_20121113162711.txt the different part is what I want to get, which is the 7 digits number for this example, but the length may change Commented Nov 17, 2012 at 20:42
  • @Ben Sorry to make you confused, these are three separate groups of strings. Commented Nov 17, 2012 at 20:44
  • What's your use case? You get a random string and apply all these patterns to see which one matches? You get a bunch of strings and apply just one pattern to see which match? Or some other permutation? What sort of data volumes are you handling in a single search? Also, what version of the database? Commented Nov 18, 2012 at 4:00

3 Answers 3

1

Oracle practitioners survived for years without regular expressions because Oracle provides some simple string functions which we can combine for some nifty manipulation.

For instance, to find the first two characters after the last underscore in a string use SUBSTR() and INSTR() like this:

with t as (select 'SPRINTMVNO_PM_CDR_IWIRELESS_20121110_0813.csv' str from dual)
select substr(str, instr(str, '_', -1)+1, 2)
from t
/

Note the INSTR() call has a negative offset to start counting from the back. Getting the last four characters of a string employs the same trick:

with t as (select 'iwireless_201211120015_201211120515' str from dual)
select substr(str, -4)
from t
/

The easiest way to identify a pattern of underscore followed by digits followed by underscore is with a regex but we can use a TRIM() to remove the underscores from the result.

with t as (select 'RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt' str from dual)
select trim('_' from regexp_substr(str, '_([0-9]+)_'))
from t
/

Here's a SQL Fiddle to prove that these techniques work.

Oracle has a vast array of functions, which are described in the documentation. Find out more.


" please ignore the cases, I just need a solution of this 'how to get content between or start with a character but not include it, with Oracle's regex ?'"

There is a way to exclude characters from the start or end of the result, and that is to break up the search pattern into sub-expressions. This will work for the string you provide, because we can separate the leading and trailing underscores from the required numbers. Unfortunately, the subexpressions parameter is the last parameter in the REGEXP_SUBSTR() signature, and as SQL functions don't accept named parameters this means we have to explicitly pass default values for all the other parameters.

Anyway, this call will return the second subexpression, which is the desired string, 0043722:

with t as (select 'RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt' str from dual)
select regexp_substr(str, '(_)([0-9]+)(_)', 1,1,'i',2)
from t
/

The use cases do matter. The REGEXP functions perform slower than the simpler equivalents. In 10gR2 REGEXP_SUBSTR() is at least an order of magnitude slower than SUBSTR(). The difference is noticeable when searching large numbers of strings, and crippling when that number becomes millions (disclosure: recent pain).

Sign up to request clarification or add additional context in comments.

2 Comments

I know I can get those data easily by using string functions, the problem is there are tons of different strings to handle, each of them have different data to retrieve, so I want to store the patterns into database, and use one regexp_substr to get all data. Otherwise I need to hard code those rules, it's not a good solution.
You are so patient, buddy. Thank you for your solution. trim('' from regexp_substr(str, '_([0-9]+)')) this inspires me that I just need to add this trim function in my code to escape all '_'
1

The leading and trailing slashes around your regex have nothing to do with regex.

They are a perl/javascript language artefact.

Try without the slashes

1 Comment

I tried SELECT REGEXP_SUBSTR('RK_IPDR_RKMSG2_0043722_DT_20121113162710.txt','(?<=_)[0-9]*(?=_)') FROM dual, no output.
0

Oracle uses POSIX ERE (Extended Regular Expressions) - with the notable exception that it adds backreferences. But POSIX ERE is very limited - it only takes very few things. Try the following regular expressions:

/([0-9]{2}80|[0-9]80[0-9]|80[0-9]{2})$/

That'll get you 80 in the last four digits.

/0515$/

That'll get you 0515 as the last four digits.

Now, I've never used Oracle, so I don't know if you need the delimiters, but those two will work. The middle one is a bit trickier. If you can live with just "yes it's there", you should be able to get away with

/_0043722_/

But if you need to extract it, you should be able to find some trim function that will let you specify what to trim. You can't to it with regular expressions in Oracle.

Oh, and if you need to combine all three of those into one regular expression:

/([0-9]{2}80|[0-9]80[0-9]|80[0-9]{2}|0515)$|_0043722_/

And if you need a Regex reference in the future, try this site.

1 Comment

These strings are just examples, the numbers are dynamically generated and I want to extract them out. Thanks for your answer anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.