0

I'm trying to use Regex in Java for the first time. I want to get some parts of a string. The string is a little complex:

<description>
  &lt;a href='http://testlink.html' alt='some text'&gt;&lt;img border='0'
  src='http://s2.glbimg.com/zzag70iNYX-QK24sUp0YXQmmXhx7yb8j2Sq2YK7tvX3A6vCwEUOFnFTBONQFT-
  ni/s.glbimg.com/es/ge/f/original/2012/04/25/image.jpg' 
  alt='some' title='text' /&gt;&lt;/a&gt;&lt;br /&gt;some text; some text
</description>

My needs is to get the strings that lies in href and alt. For this I'm doing this code:

for(Element element : elements)
{
    //Elements children = element.children();
Pattern pattern = Pattern.compile("a\\bhref=*(.html|.htm)>");
String[] data = pattern.split(element.text()); ...
}

And so on. At the moment I'm trying to get only href without success. The return is always the whole string. Isn't correct? I've put the html extension to guarantee and nothing occurs.

1
  • 4
    If you're going to be parsing html, why not use an existing html parser? Commented Aug 8, 2012 at 20:36

3 Answers 3

1
public static void main(String[] args){
  String sourcestring = "<description>&lt;a href='http://testlink.html' alt='some text'&gt;&lt;img border='0' src='http://s2.glbimg.com/zzag70iNYX-QK24sUp0YXQmmXhx7yb8j2Sq2YK7tvX3A6vCwEUOFnFTBONQFT-
ni/s.glbimg.com/es/ge/f/original/2012/04/25/image.jpg' 
alt='some' title='text' /&gt;&lt;/a&gt;&lt;br /&gt;some text; some text</description>";
  Pattern re = Pattern.compile("(?<=href='|alt=')[^']*|(?<=href=\"|alt=\")[^\"]*");
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
Sign up to request clarification or add additional context in comments.

5 Comments

that is not quite what I was working towards but yeah I think that looks pretty good.
What were you looking for, more exactly?
Me? I was just commenting - not my question ;-) I was only solving the problem of the href grab and not the alt grab.
Sharing a link to this awesome tool by Doug Drudik, called My Regex Tester: myregextester.com
Algomorph, I'm avoiding to use any kind of loop for performance reasons. But thanks for your reply.
1

Do not use regular expressions for this task, unless you absolutely know that the text format will not change. You seem to want to parse (X|HT)ML using regexps, and that is a bad thing. I'd suggest parsing as XML and using XPath.

1 Comment

Tassos, I didn't know about this. I'll try this approach. Thanks for your reply. Definetely I'll try.
1

Your regular expression will not be finding things that are useful to you and may even be broken.

The following are true in regular expressions:

* matches 0 or more of the preceding character

. is any character

So your current regex is trying to locate strings that match a pattern where there is an a, a word boundary, the string href, 0 or more = characters, and then any character followed by html or any character followed by htm and then a > character. If you want to use those special characters you will need to escape them

A better way of forming your regex is like Alogomorph's example above.

Please look at the Java documentation for regular expressions for more information on what is allowed: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

There are also plenty of other tutorials and examples available on the web.

1 Comment

I'm trying to avoid loops. Like I said, this is my first try with regex. Thanks for your precise observations.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.