java: regular expression

Question

I have a Html string which include lots of image tag, I need to get the tag and change it. for example:

String imageRegex = "(<img.+(src=\".+\").+/>){1}";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
Matcher matcher = Pattern.compile(imageRegex, Pattern.CASE_INSENSITIVE).matcher(msg);
int i = 0;
while (matcher.find()) {
    i++;
    Log.i("TAG", matcher.group());
}

the result is :

<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />hello world<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />

but it's not I want, I want the result is

<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />

what's wrong with my regular expression?

Can I refer you to this answer: stackoverflow.com/a/1732454/83109 — David M
– David M, Commented Jul 10, 2012 at 13:14
Is there anything wrong with regexing out only <img> tags though? — namenamename
– namenamename, Commented Jul 10, 2012 at 13:20
Yes, there is. The problem is that HTML isn't a regular language, and so it's not a good candidate for analysis with a regular expression. Sometimes you can make it work in a pinch (this may be one of those cases), but it's a little like driving nails with an old shoe. It may get the job done, but it's not really the right tool. — Ian McLaird
– Ian McLaird, Commented Jul 10, 2012 at 13:23
As the comments to the question I've linked to say, there is a big difference between PARSING and MATCHING. I just like that answer. — David M
– David M, Commented Jul 10, 2012 at 13:24
regular expression handle strings, the HTML is constructed by strings, why can't use regular expression to handle HTML? "HTML isn't a regular language" there is nothing with to do language, just strings, so why can't? — Mejonzhan
– Mejonzhan, Commented Jul 10, 2012 at 13:36

GrayFox374 · Accepted Answer · 2012-07-10 13:32:29Z

Try (<img)(.*?)(/>), this should do the trick, although yes, you shouldn't use Regex for parsing HTML, as people will tell you over and over.

I don't have eclipse installed, but I have VS2010, and this works for me.

        String imageRegex = "(<img)(.*?)(/>)";
        String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
        System.Text.RegularExpressions.MatchCollection match = System.Text.RegularExpressions.Regex.Matches(str, imageRegex, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        StringBuilder sb = new StringBuilder();
        foreach (System.Text.RegularExpressions.Match m in match)
        {
            sb.AppendLine(m.Value);
        }
        System.Windows.MessageBox.Show(sb.ToString());

Result:

<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" /> 
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />

Ian McLaird · Accepted Answer · 2012-07-10 13:21:46Z

0

David M is correct, you really shouldn't try to do this, but your specific problem is that the + quantifier in your regex is greedy, so it will match the longest possible substring that could match.

See The regex tutorial for more details on the quantifiers.

answered Jul 10, 2012 at 13:21

Ian McLaird

5,5852 gold badges25 silver badges31 bronze badges

Comments

Anton · Accepted Answer · 2012-07-10 13:38:27Z

0

I'd NOT recommend to use regex for parsing HTML. Please consider JSoup or similar solutions

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements images = doc.select("img");

Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

answered Jul 10, 2012 at 13:38

Anton

97813 silver badges18 bronze badges

Collectives™ on Stack Overflow

java: regular expression

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related