We are hosting PHP apps on a Debian based LAMP installation. Everything is quite ok – performance, administrative and management wise. However being a somewhat new devs (we’re still in high-school) we’ve run into some problems with the character encoding for Western Charsets.
After doing a lot of researches I have come to the conclusion that the information online is somewhat confusing. It’s talking about Windows-1252 being ANSI and totally ISO-8859-1 compatible.
So anyway, What is the difference between Windows-1252(1/3/4) and ISO-8859-1? And where does ANSI come into this anyway?
What encoding should we use on our Debian servers (and workstations) in order to ensure that clients get all information in the intended way and that we don’t lose any chars on the way?
I’d like to answer this in a more web-like manner and in order to answer it so we need a little history. Joel Spolsky has written a very good introductionary article on the absolute minimum every dev should know on Unicode Character Encoding. Bear with me here because this is going to be somewhat of a
looong answer. 🙂
As a history I’ll point to some quotes from there: (Thank you very much Joel! 🙂 )
The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes.
And all was good, assuming you were an English speaker. Because bytes have room for up to eight bits, lots of people got to thinking, “gosh, we can use the codes 128-255 for our own purposes.” The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.
So now “OEM character sets” were distributed with PCs and these were still all different and incompatible. And to our contemporary amazement – it was all fine! They didn’t have the Internet back than and people rarely exchanged files between systems with different locales.
Joel goes on saying:
In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.
And this is how the “Windows Code pages” were born, eventually. They were actually “parented” by the DOS code pages. And then Unicode was born! 🙂 and UTF-8 is “another system for storing your string of Unicode code points” and actually “every code point from 0-127 is stored in a single byte” and is the same as ASCII. I will not go into anymore specifics of Unicode and UTF-8, but you should read up on the BOM, Endianness and Character Encoding as a general.
@Jukka K. Korpela is “right-on the money” saying that most-probably you are referring to Windows-1252.
On “the ANSI conspiracy”, Microsoft actually admits the miss-labeling in a glossary of terms:
The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is NOT identical with ISO 8859-1. The Windows character set is often called “ANSI character set”, but this is SERIOUSLY MISLEADING. It has NOT been approved by ANSI.
So, ANSI when refering to Windows character sets is not ANSI-certified! 🙂
As Jukka pointed out (credits go to you for the nice answer )
Windows-1252 ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.
However my personal opinion and technical understanding is that both Windows-1252 and ISO-8859-1 ARE NOT WEB ENCODINGS! 🙂 So:
- For web pages please use UTF-8 as encoding for the content So store data as UTF-8 and “spit it out” with the HTTP Header:
Content-Type: text/html; charset=utf-8.
There is also a thing called the HTML content-type meta-tag:
<html>Now, what browsers actually do when they encounter this tag is that they start from the beginning of the HTML document again so that they could reinterpret the document in the declared encoding. This should happen only if there is no ‘Content-type’ header.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- Use other specific encodings if the users of your system need files generated from it. For example some western users may need Excel generated files, or CSVs in Windows-1252. If this is the case, encode text in that locale and then store it on the fs and serve it as a download-able file.
- There is another thing to be aware of in the design of HTTP: The content-encoding distribution mechanism should work like this.
I. The client requests a web page in a specific content-types and encodings via: the ‘Accept’ and ‘Accept-Charset’ request headers.
II. Then the server (or web application) returns the content trans-coded to that encoding and character set.
This is NOT THE CASE in most modern web apps. What actually happens it that web applications serve (force the client) content as UTF-8. And this works because browsers interpret received documents based on the response headers and not on what they actually expected.
We should all go Unicode, so please, please, please use UTF-8 to distribute your content wherever possible and most of all applicable. Or else the elders of the Internet will haunt you! 🙂