Parsing xml string to an xml document fails if the string begins with <?xml... ?> section

Question

I have an XML file begining like this:

<?xml version="1.0" encoding="utf-8"?>
<Report xmlns:rd="http://schemas.microsoft.com/SQLServer/reporting/reportdesigner" xmlns="http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition">
  <DataSources>

When I run following code:

byte[] fileContent = //gets bytes
            string stringContent = Encoding.UTF8.GetString(fileContent);
            XDocument xml = XDocument.Parse(stringContent);

I get following XmlException:

Data at the root level is invalid. Line 1, position 1.

Cutting out the version and encoding node fixes the problem. Why? How to process this xml correctly?

Ian Kemp - SO dead by AI greed · Accepted Answer · 2017-01-03 10:36:14Z

27

My first thought was that the encoding is Unicode when parsing XML from a .NET string type. It seems, though that XDocument's parsing is quite forgiving with respect to this.

The problem is actually related to the UTF8 preamble/byte order mark (BOM), which is a three-byte signature optionally present at the start of a UTF-8 stream. These three bytes are a hint as to the encoding being used in the stream.

You can determine the preamble of an encoding by calling the GetPreamble method on an instance of the System.Text.Encoding class. For example:

// returns { 0xEF, 0xBB, 0xBF }
byte[] preamble = Encoding.UTF8.GetPreamble();

The preamble should be handled correctly by XmlTextReader, so simply load your XDocument from an XmlTextReader:

XDocument xml;
using (var xmlStream = new MemoryStream(fileContent))
using (var xmlReader = new XmlTextReader(xmlStream))
{
    xml = XDocument.Load(xmlReader);
}

edited Jan 3, 2017 at 10:36

Ian Kemp - SO dead by AI greed

1

answered Jan 21, 2010 at 18:04

Dave Cluderay

7,4661 gold badge33 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

bobince Over a year ago

Note that the UTF-8 ‘pre-amble’ is a Microsoft invention that is not endorsed by any Unicode standard, unlike the normal UTF-16 BOMs. It should never be used on writing, though you will have to handle it on reading as you will often meet the pesky blighter in the wild.

Dave Cluderay Over a year ago

@bobince - I agree (although it is allowed for by the Unicode standard, but its use is discouraged - see page 36 of unicode.org/versions/Unicode5.0.0/ch02.pdf#G19273 for more information).

Dave Cluderay Over a year ago

I've amended the answer - see the last paragraph.

stevehipwell · Accepted Answer · 2010-01-22 15:55:33Z

17

If you only have bytes you could either load the bytes into a stream:

XmlDocument oXML;

using (MemoryStream oStream = new MemoryStream(oBytes))
{
  oXML = new XmlDocument();
  oXML.Load(oStream);
}

Or you could convert the bytes into a string (presuming that you know the encoding) before loading the XML:

string sXml;
XmlDocument oXml;

sXml = Encoding.UTF8.GetString(oBytes);
oXml = new XmlDocument();
oXml.LoadXml(sXml);

I've shown my example as .NET 2.0 compatible, if you're using .NET 3.5 you can use XDocument instead of XmlDocument.

Load the bytes into a stream:

XDocument oXML;

using (MemoryStream oStream = new MemoryStream(oBytes))
using (XmlTextReader oReader = new XmlTextReader(oStream))
{
  oXML = XDocument.Load(oReader);
}

Convert the bytes into a string:

string sXml;
XDocument oXml;

sXml = Encoding.UTF8.GetString(oBytes);
oXml = XDocument.Parse(sXml);

edited Jan 22, 2010 at 15:55

answered Jan 22, 2010 at 9:12

stevehipwell

57.8k6 gold badges46 silver badges61 bronze badges

3 Comments

agnieszka Over a year ago

the problem is I need to use XDocument

stevehipwell Over a year ago

@agnieszka - I've updated my answer to walk you through how to use the XDocument.

oleksa Over a year ago

string has to be modified if original oBytes contains Byte order mark sequence. I had to call sXml = sXml.Substring(1); otherwise error Data at the root level is invalid. Line 1, position 1. is thrown on XDocument.Parse. BOM bytes are not visible so can be checked using .WriteLine("first char '{0}'", sXml[0])

Brian Agnew · Accepted Answer · 2010-01-21 18:00:10Z

7

Do you have a byte-order-mark (BOM) at the beginning of your XML, and does it match your encoding ? If you chop out your header, you'll also chop out the BOM and if that is incorrect, then subsequent parsing may work.

You may need to inspect your document at the byte level to see the BOM.

answered Jan 21, 2010 at 18:00

Brian Agnew

273k38 gold badges342 silver badges443 bronze badges

2 Comments

agnieszka Over a year ago

what is a byte-order-mark...? and how can I find out document's encoding? I just suspect it is utf-8 (read text is readable)

Brian Agnew Over a year ago

See the link I posted. It's a sequence of bytes before the header that acts as a directive to the encoding of the document.

Darin Dimitrov · Accepted Answer · 2010-01-21 18:02:13Z

7

Why bothering to read the file as a byte sequence and then converting it to string while it is an xml file? Just leave the framework do the loading for you and cope with the encodings:

var xml = XDocument.Load("test.xml");

answered Jan 21, 2010 at 18:02

Darin Dimitrov

1.0m276 gold badges3.3k silver badges3k bronze badges

2 Comments

agnieszka Over a year ago

Because I don't get the xml from a path. I just have bytes content

Darin Dimitrov Over a year ago

And where are those bytes coming from? Database, network stream, ...?

Filburt · Accepted Answer · 2016-07-29 09:54:19Z

2

Try this:

int startIndex = xmlString.IndexOf('<');
if (startIndex > 0)
{
    xmlString = xmlString.Remove(0, startIndex);
}

edited Jul 29, 2016 at 9:54

Filburt

18.1k14 gold badges106 silver badges167 bronze badges

answered Jul 9, 2013 at 15:38

eugene.sushilnikov

1,8652 gold badges13 silver badges9 bronze badges

1 Comment

binki Over a year ago

Would help if you explained that this was to forcefully skip the preamble/BOM.

Alexei - check Codidact · Accepted Answer · 2021-05-27 09:55:40Z

1

I have also encountered this error because the source XML was a string that somehow got some non-printable characters that seemed to break XmlDocument or XDocument parsing. Stripping them fixed the issue:

string sanitized = Regex.Replace(part, @"\p{C}+", string.Empty);

Credit: C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

answered May 27, 2021 at 9:55

Alexei - check Codidact

23.2k22 gold badges159 silver badges179 bronze badges

Collectives™ on Stack Overflow

Parsing xml string to an xml document fails if the string begins with <?xml... ?> section

6 Answers 6

3 Comments

3 Comments

2 Comments

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

3 Comments

2 Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related