Java Regex for String extraction

Question

I want to extract "Little League World Series" from the input below:

<li><span class="Spicy new"><a href="http://www.google.com/trends/hottrends#a=20120825-Little%2BLeague%2BWorld%2BSeries">Little League World Series</a></span></li>

I can either replace the strings before and after it with "", or I can extract the string. I am not able to get the right regex to do this. I am using line.replace(" <li><span class=\"[\\w]+\"", ""); to replace the part before "Little League World Series", but it does not match correctly.

Would appreciate any help.

because I just want the terms ( one of the values) its easier to use regex string parsing, rather than including an extra library. — user441170
– user441170, Commented Aug 30, 2012 at 18:40

MJB · Accepted Answer · 2012-08-30 18:49:47Z

1

If this is not a well formed trusted html source, use an html parser like JSOUP. Regex cannot protect you against many malformed html issues.

answered Aug 30, 2012 at 18:49

MJB

9,4196 gold badges37 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

zb226 · Accepted Answer · 2012-08-31 02:09:20Z

1

You can use this to remove the stuff in front of the line:

line.replaceFirst("<li><span class=\"[^\"]+\"><a href=\"[^\"]+\">", "");

Try it on regexr

Edit: String.replace does not accept regexes, String.replaceFirst does.

edited Aug 31, 2012 at 2:09

answered Aug 30, 2012 at 18:45

zb226

10.7k6 gold badges57 silver badges90 bronze badges

1 Comment

zb226 Over a year ago

Damn, String.replace doesn't accept regexes, you need to use String.replaceFirst. Well, that's what I get for only trying it on regexr, I suppose :)

Musfiqur rahman · Accepted Answer · 2012-08-30 19:13:39Z

0

Use

<li><span class="[^"]+"><a href="[^"]+">[^>]+</a></span></li>

to get the whole line. Then replace

<li><span class="[^"]+"><a href="[^"]+">

with "" and replace

</a></span></li>

with ""

Try the below link.it also shows the java string required. http://www.regexplanet.com/advanced/java/index.html

For use of the java function check this link: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceFirst(java.lang.String)

edited Aug 30, 2012 at 19:13

answered Aug 30, 2012 at 18:48

Musfiqur rahman

7394 silver badges12 bronze badges

3 Comments

user441170 Over a year ago

cant use full string match, I want to match multiple strings in this format returned from google.com/trends/hottrends/atom/hourly?country=usa

user441170 Over a year ago

actually I was able to use line = line.replace(line.substring(line.indexOf("</a>")), ""); line = line.replace(line.substring(0,line.lastIndexOf(">"))+ 1, "");

user441170 Over a year ago

Its a hack, not pretty but it serves the purpose for me.

Maciej · Accepted Answer · 2012-08-30 19:03:43Z

0

This one seems to pass:

    @Test
    public void patternTest() {
        final String text = "<li><span class=\"Spicy new\"><a href=\"http://www.google.com/trends/hottrends#a=20120825-Little%2BLeague%2BWorld%2BSeries\">Little League World Series</a></span></li>";
        final Pattern pattern = Pattern.compile("^.*>([^<>]+)<.*$");
        final Matcher matcher = pattern.matcher(text);
        assertTrue(matcher.matches());
        assertEquals("Little League World Series", matcher.group(1));
    }

It extracts last non-empty text that goes between ">" and "<"

answered Aug 30, 2012 at 19:03

Maciej

6186 silver badges14 bronze badges

Collectives™ on Stack Overflow

Java Regex for String extraction

4 Answers 4

Comments

1 Comment

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related