WHY DOES “É” BECOME “é”?

[Origin]: http://www.weblogism.com/item/270/why-does-e-become-a

As I said before, encoding issues are quite common, and yet, they can be very tricky to debug: the reason is that any link in the long chain between the data storage (sql or not) and the client can be the culprit and has to be investigated. I have recently experienced this first hand, and it was tricky enough to be the object of a future post.

In short, the problem was that a PDF document produced by PDFLaTeX in iso-8859-1 was incorrectly forced into UTF-8, therefore corrupting the binary file as a result. The sure sign of this was that single characters were “converted” into 2 or more characters, for example: “é” was displayed as “é”. Anybody who’s worked on non-ASCII projects (probably 98% of the non English-speaking world) has had a similar problem, I’m sure.

But why does “é” become “é”, why that particular sequence:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
é

?

The reason lies in the UTF-8 representation. Characters below or equal to 127 (0x7F) are represented with 1 byte only, and this is equivalent to the ASCII value. Characters below or equal to 2047 are written on two bytes of the form 110yyyyy 10xxxxxx where the scalar representation of the character is: 0000000000yyyyyxxxxxx (see here for more details).

“é” is U+00E9 (LATIN SMALLER LETTER E WITH ACUTE), which in binary representation is: 00000000 11101001. “é” is therefore between 127 and 2027 (233), so it will be coded on 2 bytes. Therefore its UTF-8 representation is 11000011 10101001.

Now let’s imagine that this “é” sits in a document that’s believed to be latin-1, and we want to convert it to UTF-8. iso-8859-1 characters are coded on 8 bits, so the 2-byte character “é” will become 2 1-byte-long latin-1 characters. The first character is 11000011, i.e. C3, which, when checking the table corresponds to “Ô (U+00C3); the second one is 10101001, i.e. A9, which corresponds to “©” (U+00A9).

What happens if you convert “é” to UTF-8… again? You get something like “Ã?©” (the second character can vary). Why? Exactly the same reason: “Ô (U+00C3) is represented on 2 bytes, so it becomes 11000011 10000010 (C3 82), and “©” (U+00A9) becomes 11000010 10101001(C2 A9). U+00C3 is, as we saw Ã, U+0082 is BPH (“Break Permitted Here”, which does not represent a graphic character), U+00C2 is Â, and U+00A9 is, as we saw, ©.

Update:

Just a few points to clarify the above, as the use of iconv above may be slightly confusing.

  • The problem is caused when UTF-8 “é” is literally interpreted as latin-1, that is 11000011 10101001 is read as the two 1-byte latin-1 characters é, rather than the 2-byte UTF-8 character é
  • This only happens when UTF-8 is mistakenly taken as latin-1.
  • iconv converts from one character code to another. This means that an UTF-8 “é” becomes an iso-8859-1 “é” when converting from UTF-8 to another. The sequence is therefore converted from 0xC3 0xA9 to 0xE9. Let’s see this:
sebastien@greystones:~$ echo é > /tmp/test.txt
sebastien@greystones:~$ xxd /tmp/test.txt
0000000: c3a9 0a                                  ...
sebastien@greystones:~$ iconv -f utf8 -t iso-8859-1 /tmp/test.txt --output=/tmp/test_1.txt
sebastien@greystones:~$ xxd /tmp/test_1.txt 
0000000: e90a                                     ..
sebastien@greystones:~$ 

In the example in the post:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
é

I know that the character entered on the console is UTF-8, but I ask iconv to consider it as latin-1, and then to convert it to UTF-8 to illustrate the problem.

I hope this clarifies things a bit.

Update: second part of the article here.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s