0

I have this regular expression:

(\S+)=[""']?((?:.(?![""']?\s+(?:\S+)=|[>""']))+.)[""']?

This regex expression will extract the name of the tag and the value from HTML string, everything is working fine, but, when I have a single char the regex will trap the left side quote and the character.

This is my string:

<select title="Campo" id="6:7" style="width: auto; cursor: pointer;" runat="server" controltype="DropDownList" column="Dummy_6"><option value="0">Value:0</option><option selected="selected" value='1'>Value:1Selected!</option></select>

I don't know how to modify this regex expression to capture the char correctly even there is only one character.

7
  • 2
    What language are you using and what exactly are you trying to match? I would consider using a parser instead of regular expression for this task. Commented Jun 25, 2015 at 13:50
  • 7
    Don't parse HTML with regular expressions! Commented Jun 25, 2015 at 13:52
  • For everyone blithely tossing out "don't parse html with regex", it's completely fine to retrieve single tags or content from html with regex. As it would from any other type of text. This is not parsing. Commented Jun 25, 2015 at 14:09
  • I'm getting matches on attributes and the attribute value, is that what you wanted to match? Could you provide examples of desired output, as well as what you are getting right now? Commented Jun 25, 2015 at 14:12
  • with a single character in value i'm obtaining somehing like this: "1, but i want to obtain this: 1 But, when the string is bigger than one char everything goes well. Commented Jun 25, 2015 at 14:13

3 Answers 3

1

You should be using HTML parser for this task, regex cannot handle HTML properly.

To collect all tag names and there attribute names and values, I recommend the following HtmlAgilityPack-based solution:

var tags = new List<string>();
var result = new List<KeyValuePair<string, string>>();
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL 
    var doc = new HtmlAgilityPack.HtmlWeb();
    hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
    hap = new HtmlAgilityPack.HtmlDocument();
    hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.Descendants().Where(p => p.NodeType == HtmlAgilityPack.HtmlNodeType.Element);
if (nodes != null)
   foreach (var node in nodes)
   {
      tags.Add(node.Name);
      foreach (var attribute in node.Attributes)
         result.Add(new KeyValuePair<string, string>(attribute.Name, attribute.Value));
   }

enter image description here
enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

I myself like regex, but it is not a correct solution for your case. You could even use your regex with a slight modification, but it is not the right tool.
0

I think you're trying something overly intricate and, ultimately, incorrect, with your regex.

If you want to naively parse an HTML attribute: this regex should do the trick:

(\S+)=(?:"([^"]+)"|'([^']+)')

Note that it parses single-quoted and double-quoted values in different legs of the regex. Your regex would find that in the following code:

<foo bar='fu"bar'>

the attribute's value is fu when it really is fu"bar.

2 Comments

How i can use only two capturing groups?
You can't, because regular expressions are regular. Mathematically speaking, you can't use them to parse balanced expressions.
0

There are better ways to parse HTML, but here's my take at your question anyway.

(?<attr>(?<=\s).+?(?==['"]))|(?<val>(?<=\s.+?=['"]).+?(?=['"]))

Without capture group names:

((?<=\s).+?(?==['"]))|((?<=\s.+?=['"]).+?(?=['"]))

quotes included:

((?<=\s).+?(?==['"]))|((?<=\s.+?=)['"].+?['"])

Update: For more in-depth usage, do give HTML Agility Pack a try.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.