Substring using Oracle SQL regex

Question

I've created a regex which would capture the string I need. When I am testing regex on websites such as rubular.com then everything works, however when I put the same regex into REGEXP_SUBSTR function then it doesn't work.

Here are 2 SQL examples (one with text in English and another one in Kristaps Porzingis' language):

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">This is the text I would like to substr! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

I am trying to select text between <li class="b"> and next html tag, which in this case is <p class="tutorial">.

Any advice on what am I doing wrong?

Wiktor Stribiżew · Accepted Answer · 2017-06-15 08:26:00Z

1

Parsing HTML with regex is not advisable, you'd better fetch the strings and parse them with a language that has convenient means to parse HTML.

If you only have Oracle DBMS at hand, for a one-off job, you may consider using the following regexp_substr:

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>',
      '<li\s+class="b">([^<]+)', 1, 1, NULL, 1) as RESULT from dual

See the REXTESTER demo:

Here,

<li\s+class="b"> - matches <li, 1+ whitespaces, class="b">literal substring
([^<]+) - captures into Group #1 one or more chars other than <

The last 1 argument lets you access the contents of this Group 1.

answered Jun 15, 2017 at 8:26

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alexander Gusev Over a year ago

Thank you. Unfortunately, for this job I have only Oracle DBMS. I completely agree with you that parsing HTML with regex must be done in other, more appropriate languages.

LukStorms Over a year ago

@Alex FYI, depending on the version, the Oracle DBMS has build in methods to work with XML. For example EXTRACT which allows you to use an xpath. But yeah, HTML isn't exactly XML.

LukStorms · Accepted Answer · 2017-06-15 13:35:25Z

You can simplify that regex.
Instead of looking for specific characters, look for characters that are not < or >

For example:

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">([^<>]+)',1,1,'i',1) as b_class
FROM dual

Where [^<>] matches any character that's not < or >

Or you could lazy match the characters till the first <

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">(.*?)<',1,1,'ni',1) as b_class
FROM dual

The .*? will try to consume characters till the first <
And by adding the match parameter n it'll also match if there's a multiline text after the tag.

'n' allows the period (.), which is the match-any-character character, to match the newline character. If you omit this parameter, the period does not match the newline character.

RayCW · Accepted Answer · 2017-06-22 15:05:00Z

0

I would use Instr to search for the position of the first html tag and do a substring after that position to get the tail of the text. The next step is to search for '<' on this tail and to use substring again.

Something like

select substring(mytext, 1, instr(mytext, '<')) from 
(
 select substring(text, instr(text, '<li class="b">') + 
 length('<li class="b">') +1) as mytext from table
)

edited Jun 22, 2017 at 15:05

answered Jun 22, 2017 at 14:59

RayCW

1842 silver badges10 bronze badges

Collectives™ on Stack Overflow

Substring using Oracle SQL regex

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related