java regex get some parts of a string

Question

I'm trying to use Regex in Java for the first time. I want to get some parts of a string. The string is a little complex:

<description>
  &lt;a href='http://testlink.html' alt='some text'&gt;&lt;img border='0'
  src='http://s2.glbimg.com/zzag70iNYX-QK24sUp0YXQmmXhx7yb8j2Sq2YK7tvX3A6vCwEUOFnFTBONQFT-
  ni/s.glbimg.com/es/ge/f/original/2012/04/25/image.jpg' 
  alt='some' title='text' /&gt;&lt;/a&gt;&lt;br /&gt;some text; some text
</description>

My needs is to get the strings that lies in href and alt. For this I'm doing this code:

for(Element element : elements)
{
    //Elements children = element.children();
Pattern pattern = Pattern.compile("a\\bhref=*(.html|.htm)>");
String[] data = pattern.split(element.text()); ...
}

And so on. At the moment I'm trying to get only href without success. The return is always the whole string. Isn't correct? I've put the html extension to guarantee and nothing occurs.

If you're going to be parsing html, why not use an existing html parser? — Thomas
– Thomas, Commented Aug 8, 2012 at 20:36

Greg Kramida · Accepted Answer · 2012-08-08 20:50:32Z

1

public static void main(String[] args){
  String sourcestring = "<description>&lt;a href='http://testlink.html' alt='some text'&gt;&lt;img border='0' src='http://s2.glbimg.com/zzag70iNYX-QK24sUp0YXQmmXhx7yb8j2Sq2YK7tvX3A6vCwEUOFnFTBONQFT-
ni/s.glbimg.com/es/ge/f/original/2012/04/25/image.jpg' 
alt='some' title='text' /&gt;&lt;/a&gt;&lt;br /&gt;some text; some text</description>";
  Pattern re = Pattern.compile("(?<=href='|alt=')[^']*|(?<=href=\"|alt=\")[^\"]*");
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }

answered Aug 8, 2012 at 20:50

Greg Kramida

4,2845 gold badges32 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Matt Over a year ago

that is not quite what I was working towards but yeah I think that looks pretty good.

Greg Kramida Over a year ago

What were you looking for, more exactly?

Matt Over a year ago

Me? I was just commenting - not my question ;-) I was only solving the problem of the href grab and not the alt grab.

Greg Kramida Over a year ago

Sharing a link to this awesome tool by Doug Drudik, called My Regex Tester: myregextester.com

learner Over a year ago

Algomorph, I'm avoiding to use any kind of loop for performance reasons. But thanks for your reply.

Community · Accepted Answer · 2017-05-23 12:12:18Z

1

Do not use regular expressions for this task, unless you absolutely know that the text format will not change. You seem to want to parse (X|HT)ML using regexps, and that is a bad thing. I'd suggest parsing as XML and using XPath.

edited May 23, 2017 at 12:12

CommunityBot

11 silver badge

answered Aug 8, 2012 at 21:03

Tassos Bassoukos

16.2k2 gold badges39 silver badges42 bronze badges

1 Comment

learner Over a year ago

Tassos, I didn't know about this. I'll try this approach. Thanks for your reply. Definetely I'll try.

BoltClock · Accepted Answer · 2012-08-10 06:47:16Z

1

Your regular expression will not be finding things that are useful to you and may even be broken.

The following are true in regular expressions:

* matches 0 or more of the preceding character

. is any character

So your current regex is trying to locate strings that match a pattern where there is an a, a word boundary, the string href, 0 or more = characters, and then any character followed by html or any character followed by htm and then a > character. If you want to use those special characters you will need to escape them

A better way of forming your regex is like Alogomorph's example above.

Please look at the Java documentation for regular expressions for more information on what is allowed: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

There are also plenty of other tutorials and examples available on the web.

edited Aug 10, 2012 at 6:47

BoltClock

728k165 gold badges1.4k silver badges1.4k bronze badges

answered Aug 8, 2012 at 20:54

Matt

1,19616 silver badges44 bronze badges

1 Comment

learner Over a year ago

I'm trying to avoid loops. Like I said, this is my first try with regex. Thanks for your precise observations.

Collectives™ on Stack Overflow

java regex get some parts of a string

3 Answers 3

5 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related