Java Regex Not Matching When it Should

Question

I'm looking to pull out a specific HTML a tag from some HTML that contains a specific date.

The HTML supplied to this in the unit test is:

Here is the Unit Test in question:

public void testParseBasePage(){
    defenseGovContractsParser a = new defenseGovContractsParser("060613");
    String expected = "http://www.defense.gov/contracts/contract.aspx?contractid=5059";
    String result = a.parseBasePage("<td><a id=\"ctl00_ContentPlaceHolder_Body_ContractSummary_dgPRItems_ctl02_lnkPressItem\" title=\"Click for Contracts for June 06, 2013\" class=\"Link12\" href=\"http://www.defense.gov/contracts/contract.aspx?contractid=5059\">Contracts for June 06, 2013</a><span id=\"ctl00_ContentPlaceHolder_Body_ContractSummary_dgPRItems_ctl02_lblSubTitle\" class=\"MoreNews3a\"></span></td>");
    assertEquals(expected,result);
}

Here's the code in question:

public String parseBasePage(String HTML) {
    String contractUrl;
    String yr = date.substring(4, 6);
    String day = date.substring(2, 4);
    String month = getMonthForInt(Integer.parseInt(date.substring(0, 2)));
    Pattern getLink = Pattern.compile("<a.*?" + month + ".*?" + day + ".*?20" + yr + ".*?>");
    Matcher match = getLink.matcher(HTML);
    String link = match.group();
    contractUrl = link.substring(link.indexOf("href") + 6);
    contractUrl = contractUrl.replaceFirst("\">", "");
    return contractUrl;
}

private String getMonthForInt(int m) {
    String month = "invalid";
    m = m - 1;
    DateFormatSymbols dfs = new DateFormatSymbols();
    String[] months = dfs.getMonths();
    if (m >= 0 && m <= 11) {
        month = months[m];
    }
    return month;
}

The resulting regex is:

<a.*?June.*?06.*?2013.*?>

which, when I use any online regex tester, matches as expected

Getting to read that monologue was worth this question without an introduction to jsoup. I'll use jsoup and not consume all living tissue in the world. — NolanPower
– NolanPower, Commented Jun 7, 2013 at 16:33

Brian Agnew · Accepted Answer · 2013-06-07 16:03:18Z

4

I would really recommend a decent HTML parser such as JSoup or JTidy (perhaps confusingly named in this scenario), rather than use regepxs for this purpose.

For all but the simplest cases regexps will not work on HTML, and a decent HTML parser is going to be a much more reliable solution.

answered Jun 7, 2013 at 16:03

Brian Agnew

273k38 gold badges342 silver badges443 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

NolanPower Over a year ago

Just for anybody who sees this. The actual mistake in this code that caused it not to work is that I never invoked match.find() before calling match.group().

Collectives™ on Stack Overflow

Java Regex Not Matching When it Should

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related