0

Let me get straight to my problem.

public static final String EXAMPLE_TEST = "<span id=\"lblObject\"><a href=\"http://www.guideline.gov/content.aspx?id=15135\" alt=\"View object\">Manual medicine guidelines for musculoskeletal injuries.</a></span>";

    //public static final String EXAMPLE_TEST ="<a href=\"http://www.guideline.gov/content.aspx?id=1112\"></a>";
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("<a href=\"http://www.guideline.gov/content.aspx?id=(\\d+)\"");
        // in case you would like to ignore case sensitivity,
        // you could use this statement:
        // Pattern pattern = Pattern.compile("\\s+", Pattern.CASE_INSENSITIVE);
        Matcher matcher = pattern.matcher(EXAMPLE_TEST);
        // check all occurance
        while (matcher.find()) {
            System.out.print("Start index: " + matcher.start());
            System.out.print(" End index: " + matcher.end() + " ");
            System.out.println(matcher.group());
        }


    }

There is some problem with the regex. The example string I have used is just a dummy string. Actually I will have a html file in which there are many url links which have the following pattern http://www.guideline.gov/content.aspx?id=some_number. I need to grab those links from that html file. Please guys can you help me find whats wrong with my regex.

4 Answers 4

2

The problem is that the question mark ? is a regex quantifier meaniong "one or none", but you are using it as a literal character: You must escape the question mark:

Pattern pattern = Pattern.compile("<a href=\"http://www.guideline.gov/content.aspx\\?id=(\\d+)\"");

The key difference here is:

...content.aspx\\?id...

Notice the double backslash before the question mark, which is how in java you code a single backslash for the regex, so the pattern is ...content.aspx\?id...

You regex doesn't have a question mark, but instead has zero-or-one x then id.

You should probably escape your dots too, but it's probably close enough as is.

Sign up to request clarification or add additional context in comments.

2 Comments

. is also used by OP as a literal character here. Hopefully, it matches in the EXAMPLE_TEST a ... dot. :)
@Alex see the last sentence of my answer :)
2

You can quote your regex like this:

Pattern pattern = Pattern.compile("<a href=\"\\Qhttp://www.guideline.gov/content.aspx?id=\\E(\\d+)\"");

\Q tells the regex engine to quote the next part of the regex (ie ignore any metacharacter)
\E tells the regex engine that the quoted part is ended.

1 Comment

+1 For the often neglected \Q and \E. I find myself for too often escaping everything in sight when \Q .... \E would've been clearer.
1

Use the below program.

String htmlText = "<span id=\"lblObject\"><a href=\"http://www.guideline.gov/content.aspx?id=15135\" alt=\"View object\">Manual medicine guidelines for musculoskeletal injuries.</a></span>";
    Pattern pattern = Pattern.compile( "href=\"(http://www.guideline.gov/content.aspx\\?id=.*?)\"" );

    Matcher matcher = pattern.matcher( htmlText );
    while ( matcher.find() )
    {
        String matchedText = matcher.group( 0 );
        Pattern p = Pattern.compile("href=\"(.*?)\"");
        Matcher m = p.matcher(matchedText);
        String url = null;
        if (m.find()) {
            url = m.group(1);
            System.out.println(url);
        }
    }

// output : http://www.guideline.gov/content.aspx?id=15135

Comments

0

Your attempt was almost correct. The only error you made is not to escape the ? in .aspx?id=. If you only wanted to get the urls your pattern did also contain a bit to much information (the <a href=\"and the last \"). The correct patter to get only the URLs would be

"http://www.guideline.gov/content.aspx\\?id=\\d+"

So using the following code snippet you should be able to extract all URLs

Pattern pattern = 
              Pattern.compile("http://www.guideline.gov/content.aspx\\?id=\\d+");

Matcher matcher = pattern.matcher(htmlText);
while (matcher.find()) {
    // do something
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.