WHY DOES “É” BECOME “é”?

[Origin]: http://www.weblogism.com/item/270/why-does-e-become-a

As I said before, encoding issues are quite common, and yet, they can be very tricky to debug: the reason is that any link in the long chain between the data storage (sql or not) and the client can be the culprit and has to be investigated. I have recently experienced this first hand, and it was tricky enough to be the object of a future post.

In short, the problem was that a PDF document produced by PDFLaTeX in iso-8859-1 was incorrectly forced into UTF-8, therefore corrupting the binary file as a result. The sure sign of this was that single characters were “converted” into 2 or more characters, for example: “é” was displayed as “é”. Anybody who’s worked on non-ASCII projects (probably 98% of the non English-speaking world) has had a similar problem, I’m sure.

But why does “é” become “é”, why that particular sequence:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
é

?

The reason lies in the UTF-8 representation. Characters below or equal to 127 (0x7F) are represented with 1 byte only, and this is equivalent to the ASCII value. Characters below or equal to 2047 are written on two bytes of the form 110yyyyy 10xxxxxx where the scalar representation of the character is: 0000000000yyyyyxxxxxx (see here for more details).

“é” is U+00E9 (LATIN SMALLER LETTER E WITH ACUTE), which in binary representation is: 00000000 11101001. “é” is therefore between 127 and 2027 (233), so it will be coded on 2 bytes. Therefore its UTF-8 representation is 11000011 10101001.

Now let’s imagine that this “é” sits in a document that’s believed to be latin-1, and we want to convert it to UTF-8. iso-8859-1 characters are coded on 8 bits, so the 2-byte character “é” will become 2 1-byte-long latin-1 characters. The first character is 11000011, i.e. C3, which, when checking the table corresponds to “Ô (U+00C3); the second one is 10101001, i.e. A9, which corresponds to “©” (U+00A9).

What happens if you convert “é” to UTF-8… again? You get something like “Ã?©” (the second character can vary). Why? Exactly the same reason: “Ô (U+00C3) is represented on 2 bytes, so it becomes 11000011 10000010 (C3 82), and “©” (U+00A9) becomes 11000010 10101001(C2 A9). U+00C3 is, as we saw Ã, U+0082 is BPH (“Break Permitted Here”, which does not represent a graphic character), U+00C2 is Â, and U+00A9 is, as we saw, ©.

Update:

Just a few points to clarify the above, as the use of iconv above may be slightly confusing.

  • The problem is caused when UTF-8 “é” is literally interpreted as latin-1, that is 11000011 10101001 is read as the two 1-byte latin-1 characters é, rather than the 2-byte UTF-8 character é
  • This only happens when UTF-8 is mistakenly taken as latin-1.
  • iconv converts from one character code to another. This means that an UTF-8 “é” becomes an iso-8859-1 “é” when converting from UTF-8 to another. The sequence is therefore converted from 0xC3 0xA9 to 0xE9. Let’s see this:
sebastien@greystones:~$ echo é > /tmp/test.txt
sebastien@greystones:~$ xxd /tmp/test.txt
0000000: c3a9 0a                                  ...
sebastien@greystones:~$ iconv -f utf8 -t iso-8859-1 /tmp/test.txt --output=/tmp/test_1.txt
sebastien@greystones:~$ xxd /tmp/test_1.txt 
0000000: e90a                                     ..
sebastien@greystones:~$ 

In the example in the post:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
é

I know that the character entered on the console is UTF-8, but I ask iconv to consider it as latin-1, and then to convert it to UTF-8 to illustrate the problem.

I hope this clarifies things a bit.

Update: second part of the article here.

Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

[Origin]: http://stackoverflow.com/questions/2595442/using-net-how-to-convert-iso-8859-1-encoded-text-files-that-contain-latin-1-acc

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?

I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.

What step am I missing?

shareedit

You need to get the proper Encoding object. ASCII is just as it’s named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.

using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
                                       Encoding.GetEncoding("iso-8859-1")))
{
    using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
                                           outFileName, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }
}

However, if you want to have the byte arrays yourself, it’s easy enough to do with Encoding.Convert.

byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
    Encoding.UTF8, data);

It’s important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.

In the interest of fully exploring the issue, something like this would work:

using (System.IO.FileStream input = new System.IO.FileStream(fileName,
                                    System.IO.FileMode.Open, 
                                    System.IO.FileAccess.Read))
{
    byte[] buffer = new byte[input.Length];

    int readLength = 0;

    while (readLength < buffer.Length) 
        readLength += input.Read(buffer, readLength, buffer.Length - readLength);

    byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
                       Encoding.UTF8, buffer);

    using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
                                         System.IO.FileMode.Create, 
                                         System.IO.FileAccess.Write))
    {
        output.Write(converted, 0, converted.Length);
    }
}

In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named…converted. This is then written to the output file directly.

Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you’re doing, but the latter example should give you more of a hint as to what’s actually going on.

shareedit

Encoding Problem: Treating UTF-8 Bytes as Windows-1252 or ISO-8859-1

[Origin]: http://www.i18nqa.com/debug/bug-utf-8-latin1.html

Symptom

Instead of an expected character, a sequence of Latin characters is shown, typically starting with à or Â. For example, instead of “è” these characters occur: “è”.

Explanation

A common problem is for characters encoded as UTF-8 to have their individual bytes interpreted as ISO-8859-1 or Windows-1252. For example:

  • A Web page is encoded as UTF-8 characters. The Web server mistakenly declares the charset to be ISO-8859-1 in the HTTP protocol that delivers the page to the browser. The browser will then display each of the UTF-8 bytes in the Web page as Latin-1 characters.
  • A file such as a Java property file, which is encoded with UTF-8, is incorrectly converted as it is imported. As it is read in by Java it is converted from ISO-8859-1 to UTF-8.

A character such as è (e-Grave, U+00E8) consists of two bytes in UTF-8: 0xC3 and 0xA8. If each of these bytes are treated as either ISO-8859-1 or Wiindows-1252 code points, then the displayed characters will be à and ¨.

Table 1
Example Treating UTF-8 Bytes as
Windows-1252 or ISO-8859-1
Character UTF-8 Bytes Bytes viewed in Latin-1
è 0xC3, 0xA8 Ã, ¨

You can use the Encoding Debug Table to look up any erroneous sequence of Latin characters and find out the UTF-8 character that it corresponds to and that generated it.

References

The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.

http://stackoverflow.com/questions/696659/why-is-this-appearing-in-my-c-sharp-strings-%C3%82%C2%A3

The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.

It’s not the same as UTF-8, and it’s not the same as ASCII, but it does fit into one-byte-per-character. The range 0 to 127 is a lot like ASCII, and the whole range 0 to 255 is the same as the range 0000-00FF of Unicode.

So you can generate it from a C# string by casting each character to a byte, or you can use Encoding.GetEncoding("iso-8859-1") to get an object to do the conversion for you.

(In this character set, the UK pound symbol is 163.)

Background

The RFC says that unencoded text must be limited to the traditional 7-bit US ASCII range, and anything else (plus the special URL delimiter characters) must be encoded. But it leaves open the question of what character set to use for the upper half of the 8-bit range, making it dependent on the context in which the URL appears.

And that context is defined by two other standards, HTTP and HTML, which do specify the default character set, and which together create a practically irresistable force on implementers to assume that the address bar contains percent-encodings that refer to ISO-8859-1.

ISO-8859-1 is the character set of text-based content sent via HTTP except where otherwise specified. So by the time a URL string appears in the HTTP GET header, it ought to be in ISO-8859-1.

The other factor is that HTML also uses ISO-8859-1 as its default, and URLs typically originate as links in HTML pages. So when you craft a simple minimal HTML page in Notepad, the URLs you type into that file are in ISO-8859-1.

It’s sometimes described as “hole” in the standards, but it’s not really; it’s just that HTML/HTTP fill in the blank left by the RFC for URLs.

Hence, for example, the advice on this page:

URL encoding of a character consists of a “%” symbol, followed by the two-digit hexadecimal representation (case-insensitive) of the ISO-Latin code point for the character.

(ISO-Latin is another name for IS-8859-1).

So much for the theory. Paste this into notepad, save it as an .html file, and open it in a few browsers. Click the link and Google should search for UK pound.

<HTML>
  <BODY>
    <A href="http://www.google.com/search?q=%a3">Test</A>
  </BODY>
</HTML>

It works in IE, Firefox, Apple Safari, Google Chrome – I don’t have any others available right now.

shareedit