Regex for matching attribute values in invalid xml file

Question

I have some invalid XML's ( < > & "" characters inside the attribute value). I need to parse them to a correct XML file in C#.

The only way I can think of is escaping the invalid characters inside the attributes. This works fine for < > and & (&lt ;, &gt ;, &amp ;). However I have problems detecting and changing the "" inside the attributes.

Right now I am using this regex for matching attribute values:

/="(.*?)"

My test case is this:

<add sqlQuery="select blaat from test where count == "1"" test="dfsdf"/>
<add sqlQuery="select blaat from test where count == "1"" test="dfsdf" />
<add sqlQuery="select blaat from test where count == "1" and blaat > 3" test="dfsdf"/>
<add xmlDiff_action="MoveNodeFrom('1')" alias="jkhkjh" />
<add xmlDiff_action="MoveNodeFrom('1')" />

RegEx test link with not greedy

As you can see in the test the matching stops at the quote "1""

If I change the regex to greedy /="(.*)" I match the whole line (so including the other attributes on the same line.

RegEx test link with greedy:

It is hard to define the "end quote" of an xml attribute. In my test cases it can end in:

" (space)
"/>
"
" otherAttribute="value"

I know that the it looks unnecessary that I want to parse this invalid xml (even invalid sql query because it uses double spaces and quotes for == "1". Thas is because it comes from another application which saves all the data in a CDATA section. But for what I am doing I need to parse that CDATA section into correct XML (with escaping the invalid characters)

Huge thanks in advance if somebody could solve this in RegEx or combination of RegEx and C#!

Where do these not-quite-XML files come from? If at all possible, I'd try to fix the problem at its source rather than trying to handle the outcome. If the problem is due to someone else's code, you should at least express your displeasure with them. It's worrying when you see things like this, as it suggests they're producing XML by hand (rather than using an XML API) - it makes me wonder what else they're doing that's a really bad idea. — Jon Skeet
– Jon Skeet, Commented Nov 27, 2014 at 9:43
Maybe it would be easier to just match the invalid values? What other cases but "1" are possible"? If it's not ="1" but == "1" then it's invalid and shold be fixed... mhmmm — t3chb0t
– t3chb0t, Commented Nov 27, 2014 at 10:10
Not that easy. Cannot change where the source XML comes from. The XML further is totally variable with no real logic. Even found attributes that have a whole new XML document inside them, including xml declaration.... Also, we are talking about "XML" files over 100.000 lines of xml... — Wouter van Slooten
– Wouter van Slooten, Commented Nov 27, 2014 at 10:21
For starters, please don't refer to this stuff as XML. It will only get people confused. — Michael Kay
– Michael Kay, Commented Nov 27, 2014 at 12:05

Aleksei Matiushkin · Accepted Answer · 2014-11-27 11:30:25Z

1

Considering that the SQL statement is expected inside params, we could come to the following regexp using captured groups:

(?<match>"((\g<match>|[^"]*))*?")(?=\s|\/|>)/gm

Proof somehow works, but it’s insane to even try those regexps.

edited Nov 27, 2014 at 11:30

answered Nov 27, 2014 at 10:04

Aleksei Matiushkin

121k12 gold badges109 silver badges174 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Wouter van Slooten Over a year ago

Thanks a lot for this. It seems to work in the tests. I know this is pretty insane but right now I don't see another way. I cannot change the source where the invalid XML comes from.

Aleksei Matiushkin Over a year ago

Was glad to help, but, please, never mention I shared the code catching SQL inside invalid XML parameters with you :)

Wouter van Slooten Over a year ago

Haha, ok. Yeah, is not nice coding but sometimes you can't change it. Especially in a short time. One question, something I didn't experience before: My Regex parse (Matches) takes forever (never ending) in .NET while if I parse the whole document online it goes very fast finding and marking all the attributes. Any idea where to start looking? I escaped the RegEx string in C# like: string regexCondition = "(?<match>\"((\\g<match>|[^\"]*))*?\")(?=\\s|\\/)/gm"; Regex regex = new Regex(regexCondition, RegexOptions.Compiled);

Aleksei Matiushkin Over a year ago

Sorry, I never worked with regexps in C#, but I can’t find in the documentation whether /gm modifiers are allowed. Shouldn’t you use Regex.Multiline etc instead? Escaping looks reasonable to me.

Wouter van Slooten Over a year ago

I found the problem why it takes so long. Your regEx works perfectly except it goes wrong when the tag ends on > instead of "space" > regex101.com/r/kY0bS8/3 I am trying to fix that problem but it is hard for me. (meaning: trying without real understanding). If you would have the time to take a fast look I would be very thankful!

|

Collectives™ on Stack Overflow

Regex for matching attribute values in invalid xml file

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related