DATA AT THE ROOT LEVEL IS INVALID. LINE 1, POSITION 1.

[Originally Posted By]: http://www.ipreferjim.com/2014/09/data-at-the-root-level-is-invalid-line-1-position-1/

Recently, I encountered a really weird problem with an XML document. I was trying to load a document from a string:

var doc = XDocument.parse(someString);

I received this unhelpful exception message:

Data at the root level is invalid. Line 1, position 1.

I verified the XML document and retried two or three times with and without the XML declaration (both of which should work with XDocument). Nothing helped, so I googled for an answer. I found the following answer on StackOverflow by James Brankin:

I eventually figured out there was a byte mark exception and removed it using this code:

string _byteOrderMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (xml.StartsWith(_byteOrderMarkUtf8))
{
    xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
}

This solution worked. I was happy. I discussed it with a coworker and he had never heard of a BOM character before, so I thought “I should blog about this”.

Byte-Order Mark

The BOM is the character returned by Encoding.UTF8.GetPreamble(). Microsoft’s documentation explains:

The Unicode byte order mark (BOM) is serialized as follows (in hexadecimal):

  • UTF-8: EF BB BF
  • UTF-16 big endian byte order: FE FF
  • UTF-16 little endian byte order: FF FE
  • UTF-32 big endian byte order: 00 00 FE FF
  • UTF-32 little endian byte order: FF FE 00 00

Converting these bytes to a string (Encoding.UTF8.GetString) allows us to check if the xml string starts with the BOM or not. The code then removes that BOM from the xml string.

A BOM is a bunch of characters, so what? What does it do?

From Wikipedia:

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

This explanation is better than the explanation from Microsoft. The BOM is (1) an indicator that a stream of bytes is Unicode and (2) a reference to the endianess of the encoding. UTF8 is agnostic of endianness (reference), so the fact that the BOM is there and causing problems in C# code is annoying. I didn’t research why the UTF8 BOM wasn’t stripped from the string (XML is coming directly from SQL Server).

What is ‘endianness’?

Text is a string of bytes, where one or more bytes represents a single character. When text is transferred from one medium to another (from a flash drive to a hard drive, across the internet, between web services, etc.), it is transferred as stream of bytes. Not all machines understand bytes in the same way, though. Some machines are ‘little-endian’ and some are ‘big-endian’.

Wikipedia explains the etymology of ‘endianness’:

In 1726, Jonathan Swift described in his satirical novel Gulliver’s Travels tensions in Lilliput and Blefuscu: whereas royal edict in Lilliput requires cracking open one’s soft-boiled egg at the small end, inhabitants of the rival kingdom of Blefuscu crack theirs at the big end (giving them the moniker Big-endians).

For text encoding, ‘endianness’ simply means ‘which end goes first into memory’. Think of this as a direction for a set of bytes. The word ‘Example’ can be represented by the following bytes (example taken from StackOverflow):

45 78 61 6d 70 6c 65

‘Big Endian’ means the first bytes go first into memory:

45 78 61 6d 70 6c 65
<-------------------

‘Little Endian’ means the text goes into memory with the small-end first:

45 78 61 6d 70 6c 65
------------------->

So, when ‘Example’ is transferred as ‘Big-Endian’, it looks exactly as the bytes in the above examples:

45 78 61 6d 70 6c 65

But, when it’s transferred in ‘Little Endian’, it looks like this:

65 6c 70 6d 61 78 45

Users of digital technologies don’t need to care about this, as long as they see ‘Example’ where they should see ‘Example’. Many engineers don’t need to worry about endianness because it is abstracted away by many frameworks to the point of only needing to know which type of encoding (UTF8 vs UTF16, for example). If you’re into network communications or dabbling in device programming, you’ll almost definitely need to be aware of endianness.

In fact, the endianness of text isn’t constrained by the system interacting with the text. You can work on a Big Endian operating system and install VoIP software that transmits Little Endian data. Understanding endianness also makes you cool.

Summary

I don’t have any code to accompany this post, but I hope the discussion of BOM and endianness made for a great read!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s