0

I need to parse html that is formatted in the manner of the code sample below. The issue I have is that the field name can be wrapped in tags that have variable background or color styles. The pattern I am looking for is
tag, ignore any span that wraps text followed by a colon (this is the pattern
id: without an span tag wrapping). Matching this pattern should give me the key name and whatever follows the key name is the key value, until the next key name is hit. Below is a sample of the html I need to parse.

string source = "
<br />id: Value here
        <br /><SPAN style=\"background-color: #A0FFFF; color: #000000\">community</SPAN>: Value here
        <br /><SPAN style=\"background-color: #A0FFFF; color: #000000\">content</SPAN><SPAN style=\"background-          color: #A0FFFF; color: #000000\">title</SPAN>: Value here
"
//split the source into key value pairs based on the pattern match.

Thanks for any help.

3
  • 3
    take a look here: stackoverflow.com/a/1732454/3227403 Commented Aug 23, 2014 at 13:24
  • @pid, he's just trying to parse a well defined structure where the delimiters happen to be shaped like HTML elements, so I don't think we need to worry about accidentally summoning Cthulhu. In other words: stackoverflow.com/a/1733489/2611587 Commented Aug 23, 2014 at 13:37
  • @SteveRuble in fact mine was not an answer but a comment :) Commented Aug 23, 2014 at 13:50

1 Answer 1

2

Here's some code that'll parse it, assuming that your example HTML should have another <br /> element after `content'.

string source = @"
  <br />id: Value here
  <br /><SPAN style=""background-color: #A0FFFF; color: #000000"">community</SPAN>: Value here
  <br /><SPAN style=""background-color: #A0FFFF; color: #000000"">content</SPAN>
  <br /><SPAN style=""background-color: #A0FFFF; color: #000000"">title</SPAN>: Value here";

var items = Regex.Matches(source,@"<br />(?:<SPAN[^>]*>)?([^<:]+)(?:</SPAN>)?:?\s?(.*)")
         .OfType<Match>()
         .ToDictionary (m => m.Groups[1].Value, m => m.Groups[2].Value)
         .ToList();
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks. This doesn't look like it is parsing all the key value pairs as I expect. I was hoping this would be generic enough to have it return the key/val pairs based on the pattern. In my example string it would be parsed as: items[0].Key = "id" items[0].Value = "Value here" items[1].Key = "community" items[1].Value = "Value here" items[2].Key = "content" items[2].Value = "" items[3].Key = "title" items[3].Value = "Value here"
@user971823, a Dictionary<K,V> is a list of key/val pairs. I've added a ToList() call so that the value of items will conform to the example result in your comment.
Thanks again. Is there a way to make this more generic so that the key name is the value based on the pattern (without having to specify the key names)? For example: instead of .ToDictionary (m => m.Groups[Use Pattern to Derive Key Name and Values].Value.
@user971823, I'm not sure if I understand your question. I've updated my answer code to remove the named captures, if that's what you were worried about.
I needed to copy your entire code block with the string source. That's exactly what I needed! Thanks Steve!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.