1

Does anybody know of a JavaScript function anywhere which takes a string and returns it stripped of invalid XML 1.0 characters?

I'm trying to create valid XML 1.0 from content extracted from a database containing utf-8 data, but some of the data contains invalid characters so the xml I create won't validate.

The language used for accessing the data and creating the xml is server-side JavaScript.

6
  • Define "invalid characters". Commented Mar 12, 2013 at 11:08
  • "The language used…" — Show us some code. What libraries are you using to build the XML? Commented Mar 12, 2013 at 11:08
  • Invalid characters are what an XML 1.0 reader, such as Chrome or FireFox define as invalid. In a browser they look like ? characters, as in this snippet: when working with the poorest of the poor, 70% are women.� Unless we target these beneficiaries and. The language is JavaScript. I'm not sure the code itself would help because the XML structure is not the issue. The problem is that some characters in the content are considered by XML 1.0 to be "invalid." Here is a line of code if that helps: latestPosts += '<body><![CDATA[' + body + ']]></body>' + crlf;. Commented Mar 12, 2013 at 13:47
  • In the above code sample, the variable body contains text. Sometimes the text will have a character which is not allowed by XML 1.0. Commented Mar 12, 2013 at 13:48
  • Build XML using an XML library, not with string concatenation. The library will probably take care of properly encoding those characters. Commented Mar 12, 2013 at 13:56

1 Answer 1

2

I found a way of stripping out at least those characters which were causing the XML 1.0 to be invalid. It looks rather like a kludge, and I'm sure there must be a better way of doing it, and it looks somewhat repetitive with the last line. But it works.

If I have more time, or somebody has a better answer, please let me know. Thanks.

str = str.replace(/\u00B7/g,'');
str = str.replace(/\u00C2/g,'');
str = str.replace(/\u00A0/g,'');
str = str.replace(/\u00A2/g,'');
str = str.replace(/\u00A3/g,'');
str = str.replace(/[^\u000D\u00B7\u0020-\u007E\u00A2-\u00A4]/g,'');
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.