4

How can I get the text between two constant text?

Example:

<rate curr="KRW" unit="100">19,94</rate>

19,94

is between

"<rate curr="KRW" unit="100">"

and

"</rate>"

Other example:

ABCDEF

getting substring between AB and EF= CD

2
  • 2
    he com̡e̶s Commented Jan 30, 2012 at 12:45
  • 1
    What language/tool are you using? Commented Jan 30, 2012 at 12:48

6 Answers 6

5

Try with:

/<rate[^>]*>(.*?)<\/rate>/

However it is better NOT TO USE REGEX WITH HTML.

Sign up to request clarification or add additional context in comments.

1 Comment

I'm expecting something like /<rate curr="KRW" unit="100">(.*?)</rate> not the general form, despite of it may works
2

The way I do it is using the match all

matched = Regex.Matches(result, @"(?<=<rate curr=\"KRW\" unit=\"100\">)(.*?)(?=</rate>)");

Then get one by one using match[i].Groups[1].value

Comments

1

If you're analyzing HTML, you're probably better off going with javascript and .innerHTML(). Regex is a bit overkill.

1 Comment

+1 and I would express the same sentiment with PHP and strip_tags.
0

If you want a generic solution, i.e to find a string between two strings You may use Pattern.quote() [or wrap string with \Q and \E around] to quote start and end strings and use (.*?) for a non greedy match.

See an example of its use in below snippet

@Test
public void quoteText(){
    String str1 = "<rate curr=\"KRW\" unit=\"100\">";
    String str2 = "</rate>";

    String input = "<rate curr=\"KRW\" unit=\"100\">19,94</rate>"
                      +"<rate curr=\"KRW\" unit=\"100\"></rate>"
                      +"<rate curr=\"KRW\" unit=\"100\">19,96</rate>";

    String regex = Pattern.quote(str1)+"(.*?)"+Pattern.quote(str2);
    System.out.println("regex:"+regex);

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(input);
    while(m.find()){
        String group = m.group(1);
        System.out.println("--"+group);
    }

Output

regex:\Q<rate curr="KRW" unit="100">\E(.*?)\Q</rate>\E
--19,94
--
--19,96

Note:Though its not recommended to use regex to parse entire HTML, I think there is no harm in conscious use of regex while treating HTML as plain text

Comments

0

The simple regex matching string you're looking for is:

(?<=<rate curr=\"KRW\" unit=\"100\">)(.*?)(?=</rate>)

In Ruby, for example, this would translate to:

string = '<rate curr="KRW" unit="100">19,94</rate>'

string.match("(?<=<rate curr=\"KRW\" unit=\"100\">)(.*?)(?=</rate>)").to_s
# => "19,94"

Thanks to Will Yu.

Comments

-1

I suggest that you use an HTML parser. The grammar that defines HTML is a context-free grammar, which is fundamentally too complex to be parsed by regular expressions. Even if you manage to write a regular expression that will achieve what you want, but will probably fail on some corner cases.

For instance, what if you are expected to parse the following HTML?

<rate curr="KRW" unit="100"><rate curr="KRW" unit="100">19,94</rate></rate>

A regular expression may not handle this corner case properly.

1 Comment

It will never look like that in this case, and if it would, I dont care. I just want to know how to use regexp to find the text between two texts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.