4) and ISO-8859-1?

Posted by tangjunjie

Origin:https://stackoverflow.com/questions/19109899/what-is-the-exact-difference-between-windows-12521-3-4-and-iso-8859-1

We are hosting PHP apps on a Debian based LAMP installation. Everything is quite ok – performance, administrative and management wise. However being a somewhat new devs (we’re still in high-school) we’ve run into some problems with the character encoding for Western Charsets.

After doing a lot of researches I have come to the conclusion that the information online is somewhat confusing. It’s talking about Windows-1252 being ANSI and totally ISO-8859-1 compatible.

So anyway, What is the difference between Windows-1252(1/3/4) and ISO-8859-1? And where does ANSI come into this anyway?

What encoding should we use on our Debian servers (and workstations) in order to ensure that clients get all information in the intended way and that we don’t lose any chars on the way?

edited Oct 1 ’13 at 8:24

Benjamin

13k20101193

asked Oct 1 ’13 at 7:00

user2831360

I’d like to answer this in a more web-like manner and in order to answer it so we need a little history. Joel Spolsky has written a very good introductionary article on the absolute minimum every dev should know on Unicode Character Encoding. Bear with me here because this is going to be somewhat of a looong answer. 🙂

As a history I’ll point to some quotes from there: (Thank you very much Joel! 🙂 )

The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes.

And all was good, assuming you were an English speaker. Because bytes have room for up to eight bits, lots of people got to thinking, “gosh, we can use the codes 128-255 for our own purposes.” The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.

So now “OEM character sets” were distributed with PCs and these were still all different and incompatible. And to our contemporary amazement – it was all fine! They didn’t have the Internet back than and people rarely exchanged files between systems with different locales.

Joel goes on saying:

In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.

And this is how the “Windows Code pages” were born, eventually. They were actually “parented” by the DOS code pages. And then Unicode was born! 🙂 and UTF-8 is “another system for storing your string of Unicode code points” and actually “every code point from 0-127 is stored in a single byte” and is the same as ASCII. I will not go into anymore specifics of Unicode and UTF-8, but you should read up on the BOM, Endianness and Character Encoding as a general.

@Jukka K. Korpela is “right-on the money” saying that most-probably you are referring to Windows-1252.

On “the ANSI conspiracy”, Microsoft actually admits the miss-labeling in a glossary of terms:

The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is NOT identical with ISO 8859-1. The Windows character set is often called “ANSI character set”, but this is SERIOUSLY MISLEADING. It has NOT been approved by ANSI.

So, ANSI when refering to Windows character sets is not ANSI-certified! 🙂

As Jukka pointed out (credits go to you for the nice answer )

Windows-1252 ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.

However my personal opinion and technical understanding is that both Windows-1252 and ISO-8859-1 ARE NOT WEB ENCODINGS! 🙂 So:

For web pages please use UTF-8 as encoding for the content So store data as UTF-8 and “spit it out” with the HTTP Header: Content-Type: text/html; charset=utf-8.
There is also a thing called the HTML content-type meta-tag: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> Now, what browsers actually do when they encounter this tag is that they start from the beginning of the HTML document again so that they could reinterpret the document in the declared encoding. This should happen only if there is no ‘Content-type’ header.
Use other specific encodings if the users of your system need files generated from it. For example some western users may need Excel generated files, or CSVs in Windows-1252. If this is the case, encode text in that locale and then store it on the fs and serve it as a download-able file.
There is another thing to be aware of in the design of HTTP: The content-encoding distribution mechanism should work like this.
I. The client requests a web page in a specific content-types and encodings via: the ‘Accept’ and ‘Accept-Charset’ request headers.

II. Then the server (or web application) returns the content trans-coded to that encoding and character set.

This is NOT THE CASE in most modern web apps. What actually happens it that web applications serve (force the client) content as UTF-8. And this works because browsers interpret received documents based on the response headers and not on what they actually expected.

We should all go Unicode, so please, please, please use UTF-8 to distribute your content wherever possible and most of all applicable. Or else the elders of the Internet will haunt you! 🙂

P.S. Some more nice articles on using MS Windows characters in Web Pages can be found here and here.

edited Mar 23 ’16 at 21:02

Joseph Dykstra

7331915

answered Oct 1 ’13 at 8:11

Borislav Sabev

4,00011229

WHY DOES “É” BECOME “Ã©”?

Posted by tangjunjie

[Origin]: http://www.weblogism.com/item/270/why-does-e-become-a

As I said before, encoding issues are quite common, and yet, they can be very tricky to debug: the reason is that any link in the long chain between the data storage (sql or not) and the client can be the culprit and has to be investigated. I have recently experienced this first hand, and it was tricky enough to be the object of a future post.

In short, the problem was that a PDF document produced by PDFLaTeX in iso-8859-1 was incorrectly forced into UTF-8, therefore corrupting the binary file as a result. The sure sign of this was that single characters were “converted” into 2 or more characters, for example: “é” was displayed as “Ã©”. Anybody who’s worked on non-ASCII projects (probably 98% of the non English-speaking world) has had a similar problem, I’m sure.

But why does “é” become “Ã©”, why that particular sequence:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
Ã©

The reason lies in the UTF-8 representation. Characters below or equal to 127 (0x7F) are represented with 1 byte only, and this is equivalent to the ASCII value. Characters below or equal to 2047 are written on two bytes of the form 110yyyyy 10xxxxxx where the scalar representation of the character is: 0000000000yyyyyxxxxxx (see here for more details).

“é” is U+00E9 (LATIN SMALLER LETTER E WITH ACUTE), which in binary representation is: 00000000 11101001. “é” is therefore between 127 and 2027 (233), so it will be coded on 2 bytes. Therefore its UTF-8 representation is 11000011 10101001.

Now let’s imagine that this “é” sits in a document that’s believed to be latin-1, and we want to convert it to UTF-8. iso-8859-1 characters are coded on 8 bits, so the 2-byte character “é” will become 2 1-byte-long latin-1 characters. The first character is 11000011, i.e. C3, which, when checking the table corresponds to “Ã” (U+00C3); the second one is 10101001, i.e. A9, which corresponds to “©” (U+00A9).

What happens if you convert “Ã©” to UTF-8… again? You get something like “Ã?Â©” (the second character can vary). Why? Exactly the same reason: “Ã” (U+00C3) is represented on 2 bytes, so it becomes 11000011 10000010 (C3 82), and “©” (U+00A9) becomes 11000010 10101001(C2 A9). U+00C3 is, as we saw Ã, U+0082 is BPH (“Break Permitted Here”, which does not represent a graphic character), U+00C2 is Â, and U+00A9 is, as we saw, ©.

Update:

Just a few points to clarify the above, as the use of iconv above may be slightly confusing.

The problem is caused when UTF-8 “é” is literally interpreted as latin-1, that is 11000011 10101001 is read as the two 1-byte latin-1 characters Ã©, rather than the 2-byte UTF-8 character é
This only happens when UTF-8 is mistakenly taken as latin-1.
iconv converts from one character code to another. This means that an UTF-8 “é” becomes an iso-8859-1 “é” when converting from UTF-8 to another. The sequence is therefore converted from 0xC3 0xA9 to 0xE9. Let’s see this:

sebastien@greystones:~$ echo é > /tmp/test.txt
sebastien@greystones:~$ xxd /tmp/test.txt
0000000: c3a9 0a                                  ...
sebastien@greystones:~$ iconv -f utf8 -t iso-8859-1 /tmp/test.txt --output=/tmp/test_1.txt
sebastien@greystones:~$ xxd /tmp/test_1.txt 
0000000: e90a                                     ..
sebastien@greystones:~$

In the example in the post:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
Ã©

I know that the character entered on the console is UTF-8, but I ask iconv to consider it as latin-1, and then to convert it to UTF-8 to illustrate the problem.

I hope this clarifies things a bit.

Update: second part of the article here.

Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

Posted by tangjunjie

[Origin]: http://stackoverflow.com/questions/2595442/using-net-how-to-convert-iso-8859-1-encoded-text-files-that-contain-latin-1-acc

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?

I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.

What step am I missing?

edited Dec 20 ’13 at 15:38

Peter Mortensen

10.2k1369107

asked Apr 7 ’10 at 19:50

Tim

3,8071860109

You need to get the proper Encoding object. ASCII is just as it’s named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.

using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
                                       Encoding.GetEncoding("iso-8859-1")))
{
    using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
                                           outFileName, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }
}

However, if you want to have the byte arrays yourself, it’s easy enough to do with Encoding.Convert.

byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
    Encoding.UTF8, data);

It’s important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.

In the interest of fully exploring the issue, something like this would work:

using (System.IO.FileStream input = new System.IO.FileStream(fileName,
                                    System.IO.FileMode.Open, 
                                    System.IO.FileAccess.Read))
{
    byte[] buffer = new byte[input.Length];

    int readLength = 0;

    while (readLength &lt; buffer.Length) 
        readLength += input.Read(buffer, readLength, buffer.Length - readLength);

    byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
                       Encoding.UTF8, buffer);

    using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
                                         System.IO.FileMode.Create, 
                                         System.IO.FileAccess.Write))
    {
        output.Write(converted, 0, converted.Length);
    }
}

In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named…converted. This is then written to the output file directly.

Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you’re doing, but the latter example should give you more of a hint as to what’s actually going on.

edited Dec 20 ’13 at 17:33

answered Apr 7 ’10 at 19:59

Adam Robinson

117k19210290

Encoding Problem: Treating UTF-8 Bytes as Windows-1252 or ISO-8859-1

Posted by tangjunjie

[Origin]: http://www.i18nqa.com/debug/bug-utf-8-latin1.html

Symptom

Instead of an expected character, a sequence of Latin characters is shown, typically starting with Ã or Â. For example, instead of “è” these characters occur: “Ã¨”.

Explanation

A common problem is for characters encoded as UTF-8 to have their individual bytes interpreted as ISO-8859-1 or Windows-1252. For example:

A Web page is encoded as UTF-8 characters. The Web server mistakenly declares the charset to be ISO-8859-1 in the HTTP protocol that delivers the page to the browser. The browser will then display each of the UTF-8 bytes in the Web page as Latin-1 characters.
A file such as a Java property file, which is encoded with UTF-8, is incorrectly converted as it is imported. As it is read in by Java it is converted from ISO-8859-1 to UTF-8.

A character such as è (e-Grave, U+00E8) consists of two bytes in UTF-8: 0xC3 and 0xA8. If each of these bytes are treated as either ISO-8859-1 or Wiindows-1252 code points, then the displayed characters will be Ã and ¨.

Table 1
Example Treating UTF-8 Bytes as
Windows-1252 or ISO-8859-1
Character	UTF-8 Bytes	Bytes viewed in Latin-1
è	0xC3, 0xA8	Ã, ¨

You can use the Encoding Debug Table to look up any erroneous sequence of Latin characters and find out the UTF-8 character that it corresponds to and that generated it.

References

The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.

Posted by tangjunjie

http://stackoverflow.com/questions/696659/why-is-this-appearing-in-my-c-sharp-strings-%C3%82%C2%A3

The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.

It’s not the same as UTF-8, and it’s not the same as ASCII, but it does fit into one-byte-per-character. The range 0 to 127 is a lot like ASCII, and the whole range 0 to 255 is the same as the range 0000-00FF of Unicode.

So you can generate it from a C# string by casting each character to a byte, or you can use Encoding.GetEncoding("iso-8859-1") to get an object to do the conversion for you.

(In this character set, the UK pound symbol is 163.)

Background

The RFC says that unencoded text must be limited to the traditional 7-bit US ASCII range, and anything else (plus the special URL delimiter characters) must be encoded. But it leaves open the question of what character set to use for the upper half of the 8-bit range, making it dependent on the context in which the URL appears.

And that context is defined by two other standards, HTTP and HTML, which do specify the default character set, and which together create a practically irresistable force on implementers to assume that the address bar contains percent-encodings that refer to ISO-8859-1.

ISO-8859-1 is the character set of text-based content sent via HTTP except where otherwise specified. So by the time a URL string appears in the HTTP GET header, it ought to be in ISO-8859-1.

The other factor is that HTML also uses ISO-8859-1 as its default, and URLs typically originate as links in HTML pages. So when you craft a simple minimal HTML page in Notepad, the URLs you type into that file are in ISO-8859-1.

It’s sometimes described as “hole” in the standards, but it’s not really; it’s just that HTML/HTTP fill in the blank left by the RFC for URLs.

Hence, for example, the advice on this page:

URL encoding of a character consists of a “%” symbol, followed by the two-digit hexadecimal representation (case-insensitive) of the ISO-Latin code point for the character.

(ISO-Latin is another name for IS-8859-1).

So much for the theory. Paste this into notepad, save it as an .html file, and open it in a few browsers. Click the link and Google should search for UK pound.

<HTML>
  <BODY>
    <A href="http://www.google.com/search?q=%a3">Test</A>
  </BODY>
</HTML>

It works in IE, Firefox, Apple Safari, Google Chrome – I don’t have any others available right now.

edited Mar 30 ’09 at 12:20

answered Mar 30 ’09 at 10:39

Daniel Earwicker
72k23141229

Sai – A Programmer's FAQ

青海长云暗雪山，孤城遥望玉门关；黄沙百战穿金甲，不破楼兰终不还。— 王昌龄【从军行】

Encoding

What is the exact difference between Windows-1252(1/3/4) and ISO-8859-1?

WHY DOES “É” BECOME “Ã©”?

Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

Encoding Problem: Treating UTF-8 Bytes as Windows-1252 or ISO-8859-1

Symptom

Explanation

References

The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.