0

I have some invalid XML's ( < > & "" characters inside the attribute value). I need to parse them to a correct XML file in C#.

The only way I can think of is escaping the invalid characters inside the attributes. This works fine for < > and & (&lt ;, &gt ;, &amp ;). However I have problems detecting and changing the "" inside the attributes.

Right now I am using this regex for matching attribute values:

/="(.*?)"

My test case is this:

<add sqlQuery="select blaat from test where count == "1"" test="dfsdf"/>
<add sqlQuery="select blaat from test where count == "1"" test="dfsdf" />
<add sqlQuery="select blaat from test where count == "1" and blaat > 3" test="dfsdf"/>
<add xmlDiff_action="MoveNodeFrom('1')" alias="jkhkjh" />
<add xmlDiff_action="MoveNodeFrom('1')" />

RegEx test link with not greedy

As you can see in the test the matching stops at the quote "1""

If I change the regex to greedy /="(.*)" I match the whole line (so including the other attributes on the same line.

RegEx test link with greedy:

It is hard to define the "end quote" of an xml attribute. In my test cases it can end in:

  • " (space)
  • "/>
  • "
  • " otherAttribute="value"

I know that the it looks unnecessary that I want to parse this invalid xml (even invalid sql query because it uses double spaces and quotes for == "1". Thas is because it comes from another application which saves all the data in a CDATA section. But for what I am doing I need to parse that CDATA section into correct XML (with escaping the invalid characters)

Huge thanks in advance if somebody could solve this in RegEx or combination of RegEx and C#!

4
  • Where do these not-quite-XML files come from? If at all possible, I'd try to fix the problem at its source rather than trying to handle the outcome. If the problem is due to someone else's code, you should at least express your displeasure with them. It's worrying when you see things like this, as it suggests they're producing XML by hand (rather than using an XML API) - it makes me wonder what else they're doing that's a really bad idea. Commented Nov 27, 2014 at 9:43
  • Maybe it would be easier to just match the invalid values? What other cases but "1" are possible"? If it's not ="1" but == "1" then it's invalid and shold be fixed... mhmmm Commented Nov 27, 2014 at 10:10
  • Not that easy. Cannot change where the source XML comes from. The XML further is totally variable with no real logic. Even found attributes that have a whole new XML document inside them, including xml declaration.... Also, we are talking about "XML" files over 100.000 lines of xml... Commented Nov 27, 2014 at 10:21
  • For starters, please don't refer to this stuff as XML. It will only get people confused. Commented Nov 27, 2014 at 12:05

1 Answer 1

1

Considering that the SQL statement is expected inside params, we could come to the following regexp using captured groups:

(?<match>"((\g<match>|[^"]*))*?")(?=\s|\/|>)/gm

Proof somehow works, but it’s insane to even try those regexps.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks a lot for this. It seems to work in the tests. I know this is pretty insane but right now I don't see another way. I cannot change the source where the invalid XML comes from.
Was glad to help, but, please, never mention I shared the code catching SQL inside invalid XML parameters with you :)
Haha, ok. Yeah, is not nice coding but sometimes you can't change it. Especially in a short time. One question, something I didn't experience before: My Regex parse (Matches) takes forever (never ending) in .NET while if I parse the whole document online it goes very fast finding and marking all the attributes. Any idea where to start looking? I escaped the RegEx string in C# like: string regexCondition = "(?<match>\"((\\g<match>|[^\"]*))*?\")(?=\\s|\\/)/gm"; Regex regex = new Regex(regexCondition, RegexOptions.Compiled);
Sorry, I never worked with regexps in C#, but I can’t find in the documentation whether /gm modifiers are allowed. Shouldn’t you use Regex.Multiline etc instead? Escaping looks reasonable to me.
I found the problem why it takes so long. Your regEx works perfectly except it goes wrong when the tag ends on > instead of "space" > regex101.com/r/kY0bS8/3 I am trying to fix that problem but it is hard for me. (meaning: trying without real understanding). If you would have the time to take a fast look I would be very thankful!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.