0

I'm looking to pull out a specific HTML a tag from some HTML that contains a specific date.

The HTML supplied to this in the unit test is:

Here is the Unit Test in question:

public void testParseBasePage(){
    defenseGovContractsParser a = new defenseGovContractsParser("060613");
    String expected = "http://www.defense.gov/contracts/contract.aspx?contractid=5059";
    String result = a.parseBasePage("<td><a id=\"ctl00_ContentPlaceHolder_Body_ContractSummary_dgPRItems_ctl02_lnkPressItem\" title=\"Click for Contracts for June 06, 2013\" class=\"Link12\" href=\"http://www.defense.gov/contracts/contract.aspx?contractid=5059\">Contracts for June 06, 2013</a><span id=\"ctl00_ContentPlaceHolder_Body_ContractSummary_dgPRItems_ctl02_lblSubTitle\" class=\"MoreNews3a\"></span></td>");
    assertEquals(expected,result);
}

Here's the code in question:

public String parseBasePage(String HTML) {
    String contractUrl;
    String yr = date.substring(4, 6);
    String day = date.substring(2, 4);
    String month = getMonthForInt(Integer.parseInt(date.substring(0, 2)));
    Pattern getLink = Pattern.compile("<a.*?" + month + ".*?" + day + ".*?20" + yr + ".*?>");
    Matcher match = getLink.matcher(HTML);
    String link = match.group();
    contractUrl = link.substring(link.indexOf("href") + 6);
    contractUrl = contractUrl.replaceFirst("\">", "");
    return contractUrl;
}

private String getMonthForInt(int m) {
    String month = "invalid";
    m = m - 1;
    DateFormatSymbols dfs = new DateFormatSymbols();
    String[] months = dfs.getMonths();
    if (m >= 0 && m <= 11) {
        month = months[m];
    }
    return month;
}

The resulting regex is:

<a.*?June.*?06.*?2013.*?>

which, when I use any online regex tester, matches as expected

2
  • 4
    Have you seen this and/or this? Commented Jun 7, 2013 at 16:03
  • Getting to read that monologue was worth this question without an introduction to jsoup. I'll use jsoup and not consume all living tissue in the world. Commented Jun 7, 2013 at 16:33

1 Answer 1

4

I would really recommend a decent HTML parser such as JSoup or JTidy (perhaps confusingly named in this scenario), rather than use regepxs for this purpose.

For all but the simplest cases regexps will not work on HTML, and a decent HTML parser is going to be a much more reliable solution.

Sign up to request clarification or add additional context in comments.

1 Comment

Just for anybody who sees this. The actual mistake in this code that caused it not to work is that I never invoked match.find() before calling match.group().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.