2

I want to extract "Little League World Series" from the input below:

<li><span class="Spicy new"><a href="http://www.google.com/trends/hottrends#a=20120825-Little%2BLeague%2BWorld%2BSeries">Little League World Series</a></span></li>

I can either replace the strings before and after it with "", or I can extract the string. I am not able to get the right regex to do this. I am using line.replace(" <li><span class=\"[\\w]+\"", ""); to replace the part before "Little League World Series", but it does not match correctly.

Would appreciate any help.

2
  • 3
    Any reason you're using RegEx and not a DOM or XML parser? Commented Aug 30, 2012 at 18:35
  • 1
    because I just want the terms ( one of the values) its easier to use regex string parsing, rather than including an extra library. Commented Aug 30, 2012 at 18:40

4 Answers 4

1

If this is not a well formed trusted html source, use an html parser like JSOUP. Regex cannot protect you against many malformed html issues.

Sign up to request clarification or add additional context in comments.

Comments

1

You can use this to remove the stuff in front of the line:

line.replaceFirst("<li><span class=\"[^\"]+\"><a href=\"[^\"]+\">", "");

Try it on regexr

Edit: String.replace does not accept regexes, String.replaceFirst does.

1 Comment

Damn, String.replace doesn't accept regexes, you need to use String.replaceFirst. Well, that's what I get for only trying it on regexr, I suppose :)
0

Use

<li><span class="[^"]+"><a href="[^"]+">[^>]+</a></span></li> 

to get the whole line. Then replace

<li><span class="[^"]+"><a href="[^"]+"> 

with "" and replace

</a></span></li> 

with ""

Try the below link.it also shows the java string required. http://www.regexplanet.com/advanced/java/index.html

For use of the java function check this link: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceFirst(java.lang.String)

3 Comments

cant use full string match, I want to match multiple strings in this format returned from google.com/trends/hottrends/atom/hourly?country=usa
actually I was able to use line = line.replace(line.substring(line.indexOf("</a>")), ""); line = line.replace(line.substring(0,line.lastIndexOf(">"))+ 1, "");
Its a hack, not pretty but it serves the purpose for me.
0

This one seems to pass:

    @Test
    public void patternTest() {
        final String text = "<li><span class=\"Spicy new\"><a href=\"http://www.google.com/trends/hottrends#a=20120825-Little%2BLeague%2BWorld%2BSeries\">Little League World Series</a></span></li>";
        final Pattern pattern = Pattern.compile("^.*>([^<>]+)<.*$");
        final Matcher matcher = pattern.matcher(text);
        assertTrue(matcher.matches());
        assertEquals("Little League World Series", matcher.group(1));
    }

It extracts last non-empty text that goes between ">" and "<"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.