1

I've created a regex which would capture the string I need. When I am testing regex on websites such as rubular.com then everything works, however when I put the same regex into REGEXP_SUBSTR function then it doesn't work.

Here are 2 SQL examples (one with text in English and another one in Kristaps Porzingis' language):

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">This is the text I would like to substr! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

I am trying to select text between <li class="b"> and next html tag, which in this case is <p class="tutorial">.

Any advice on what am I doing wrong?

3 Answers 3

1

Parsing HTML with regex is not advisable, you'd better fetch the strings and parse them with a language that has convenient means to parse HTML.

If you only have Oracle DBMS at hand, for a one-off job, you may consider using the following regexp_substr:

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>',
      '<li\s+class="b">([^<]+)', 1, 1, NULL, 1) as RESULT from dual

See the REXTESTER demo:

enter image description here

Here,

  • <li\s+class="b"> - matches <li, 1+ whitespaces, class="b">literal substring
  • ([^<]+) - captures into Group #1 one or more chars other than <

The last 1 argument lets you access the contents of this Group 1.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. Unfortunately, for this job I have only Oracle DBMS. I completely agree with you that parsing HTML with regex must be done in other, more appropriate languages.
@Alex FYI, depending on the version, the Oracle DBMS has build in methods to work with XML. For example EXTRACT which allows you to use an xpath. But yeah, HTML isn't exactly XML.
1

You can simplify that regex.
Instead of looking for specific characters, look for characters that are not < or >

For example:

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">([^<>]+)',1,1,'i',1) as b_class
FROM dual

Where [^<>] matches any character that's not < or >

Or you could lazy match the characters till the first <

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">(.*?)<',1,1,'ni',1) as b_class
FROM dual

The .*? will try to consume characters till the first <
And by adding the match parameter n it'll also match if there's a multiline text after the tag.

'n' allows the period (.), which is the match-any-character character, to match the newline character. If you omit this parameter, the period does not match the newline character.

Comments

0

I would use Instr to search for the position of the first html tag and do a substring after that position to get the tail of the text. The next step is to search for '<' on this tail and to use substring again.

Something like

select substring(mytext, 1, instr(mytext, '<')) from 
(
 select substring(text, instr(text, '<li class="b">') + 
 length('<li class="b">') +1) as mytext from table
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.