1

I have a Html string which include lots of image tag, I need to get the tag and change it. for example:

String imageRegex = "(<img.+(src=\".+\").+/>){1}";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
Matcher matcher = Pattern.compile(imageRegex, Pattern.CASE_INSENSITIVE).matcher(msg);
int i = 0;
while (matcher.find()) {
    i++;
    Log.i("TAG", matcher.group());
}

the result is :

<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />hello world<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />

but it's not I want, I want the result is

<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" /> 

what's wrong with my regular expression?

6
  • 2
    Can I refer you to this answer: stackoverflow.com/a/1732454/83109 Commented Jul 10, 2012 at 13:14
  • Is there anything wrong with regexing out only <img> tags though? Commented Jul 10, 2012 at 13:20
  • Yes, there is. The problem is that HTML isn't a regular language, and so it's not a good candidate for analysis with a regular expression. Sometimes you can make it work in a pinch (this may be one of those cases), but it's a little like driving nails with an old shoe. It may get the job done, but it's not really the right tool. Commented Jul 10, 2012 at 13:23
  • As the comments to the question I've linked to say, there is a big difference between PARSING and MATCHING. I just like that answer. Commented Jul 10, 2012 at 13:24
  • regular expression handle strings, the HTML is constructed by strings, why can't use regular expression to handle HTML? "HTML isn't a regular language" there is nothing with to do language, just strings, so why can't? Commented Jul 10, 2012 at 13:36

3 Answers 3

1

Try (<img)(.*?)(/>), this should do the trick, although yes, you shouldn't use Regex for parsing HTML, as people will tell you over and over.

I don't have eclipse installed, but I have VS2010, and this works for me.

        String imageRegex = "(<img)(.*?)(/>)";
        String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
        System.Text.RegularExpressions.MatchCollection match = System.Text.RegularExpressions.Regex.Matches(str, imageRegex, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        StringBuilder sb = new StringBuilder();
        foreach (System.Text.RegularExpressions.Match m in match)
        {
            sb.AppendLine(m.Value);
        }
        System.Windows.MessageBox.Show(sb.ToString());

Result:

<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" /> 
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
Sign up to request clarification or add additional context in comments.

Comments

0

David M is correct, you really shouldn't try to do this, but your specific problem is that the + quantifier in your regex is greedy, so it will match the longest possible substring that could match.

See The regex tutorial for more details on the quantifiers.

Comments

0

I'd NOT recommend to use regex for parsing HTML. Please consider JSoup or similar solutions

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements images = doc.select("img");

Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.