What is the exact difference between Windows-1252(1/3/4) and ISO-8859-1?

Origin:https://stackoverflow.com/questions/19109899/what-is-the-exact-difference-between-windows-12521-3-4-and-iso-8859-1

We are hosting PHP apps on a Debian based LAMP installation. Everything is quite ok – performance, administrative and management wise. However being a somewhat new devs (we’re still in high-school) we’ve run into some problems with the character encoding for Western Charsets.

After doing a lot of researches I have come to the conclusion that the information online is somewhat confusing. It’s talking about Windows-1252 being ANSI and totally ISO-8859-1 compatible.

So anyway, What is the difference between Windows-1252(1/3/4) and ISO-8859-1? And where does ANSI come into this anyway?

What encoding should we use on our Debian servers (and workstations) in order to ensure that clients get all information in the intended way and that we don’t lose any chars on the way?

shareimprove this question

I’d like to answer this in a more web-like manner and in order to answer it so we need a little history. Joel Spolsky has written a very good introductionary article on the absolute minimum every dev should know on Unicode Character Encoding. Bear with me here because this is going to be somewhat of a looong answer. 🙂

As a history I’ll point to some quotes from there: (Thank you very much Joel! 🙂 )

The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes.

And all was good, assuming you were an English speaker. Because bytes have room for up to eight bits, lots of people got to thinking, “gosh, we can use the codes 128-255 for our own purposes.” The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.

So now “OEM character sets” were distributed with PCs and these were still all different and incompatible. And to our contemporary amazement – it was all fine! They didn’t have the Internet back than and people rarely exchanged files between systems with different locales.

Joel goes on saying:

In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.

And this is how the “Windows Code pages” were born, eventually. They were actually “parented” by the DOS code pages. And then Unicode was born! 🙂 and UTF-8 is “another system for storing your string of Unicode code points” and actually “every code point from 0-127 is stored in a single byte” and is the same as ASCII. I will not go into anymore specifics of Unicode and UTF-8, but you should read up on the BOM, Endianness and Character Encoding as a general.

@Jukka K. Korpela is “right-on the money” saying that most-probably you are referring to Windows-1252.

On “the ANSI conspiracy”, Microsoft actually admits the miss-labeling in a glossary of terms:

The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is NOT identical with ISO 8859-1. The Windows character set is often called “ANSI character set”, but this is SERIOUSLY MISLEADING. It has NOT been approved by ANSI.

So, ANSI when refering to Windows character sets is not ANSI-certified! 🙂

As Jukka pointed out (credits go to you for the nice answer )

Windows-1252 ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.

However my personal opinion and technical understanding is that both Windows-1252 and ISO-8859-1 ARE NOT WEB ENCODINGS! 🙂 So:

  • For web pages please use UTF-8 as encoding for the content So store data as UTF-8 and “spit it out” with the HTTP Header: Content-Type: text/html; charset=utf-8.

    There is also a thing called the HTML content-type meta-tag: <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    Now, what browsers actually do when they encounter this tag is that they start from the beginning of the HTML document again so that they could reinterpret the document in the declared encoding. This should happen only if there is no ‘Content-type’ header.

  • Use other specific encodings if the users of your system need files generated from it. For example some western users may need Excel generated files, or CSVs in Windows-1252. If this is the case, encode text in that locale and then store it on the fs and serve it as a download-able file.
  • There is another thing to be aware of in the design of HTTP: The content-encoding distribution mechanism should work like this.

    I. The client requests a web page in a specific content-types and encodings via: the ‘Accept’ and ‘Accept-Charset’ request headers.

    II. Then the server (or web application) returns the content trans-coded to that encoding and character set.

This is NOT THE CASE in most modern web apps. What actually happens it that web applications serve (force the client) content as UTF-8. And this works because browsers interpret received documents based on the response headers and not on what they actually expected.

We should all go Unicode, so please, please, please use UTF-8 to distribute your content wherever possible and most of all applicable. Or else the elders of the Internet will haunt you! 🙂

P.S. Some more nice articles on using MS Windows characters in Web Pages can be found here and here.

shareimprove this answer
Advertisements

WHY DOES “É” BECOME “é”?

[Origin]: http://www.weblogism.com/item/270/why-does-e-become-a

As I said before, encoding issues are quite common, and yet, they can be very tricky to debug: the reason is that any link in the long chain between the data storage (sql or not) and the client can be the culprit and has to be investigated. I have recently experienced this first hand, and it was tricky enough to be the object of a future post.

In short, the problem was that a PDF document produced by PDFLaTeX in iso-8859-1 was incorrectly forced into UTF-8, therefore corrupting the binary file as a result. The sure sign of this was that single characters were “converted” into 2 or more characters, for example: “é” was displayed as “é”. Anybody who’s worked on non-ASCII projects (probably 98% of the non English-speaking world) has had a similar problem, I’m sure.

But why does “é” become “é”, why that particular sequence:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
é

?

The reason lies in the UTF-8 representation. Characters below or equal to 127 (0x7F) are represented with 1 byte only, and this is equivalent to the ASCII value. Characters below or equal to 2047 are written on two bytes of the form 110yyyyy 10xxxxxx where the scalar representation of the character is: 0000000000yyyyyxxxxxx (see here for more details).

“é” is U+00E9 (LATIN SMALLER LETTER E WITH ACUTE), which in binary representation is: 00000000 11101001. “é” is therefore between 127 and 2027 (233), so it will be coded on 2 bytes. Therefore its UTF-8 representation is 11000011 10101001.

Now let’s imagine that this “é” sits in a document that’s believed to be latin-1, and we want to convert it to UTF-8. iso-8859-1 characters are coded on 8 bits, so the 2-byte character “é” will become 2 1-byte-long latin-1 characters. The first character is 11000011, i.e. C3, which, when checking the table corresponds to “Ô (U+00C3); the second one is 10101001, i.e. A9, which corresponds to “©” (U+00A9).

What happens if you convert “é” to UTF-8… again? You get something like “Ã?©” (the second character can vary). Why? Exactly the same reason: “Ô (U+00C3) is represented on 2 bytes, so it becomes 11000011 10000010 (C3 82), and “©” (U+00A9) becomes 11000010 10101001(C2 A9). U+00C3 is, as we saw Ã, U+0082 is BPH (“Break Permitted Here”, which does not represent a graphic character), U+00C2 is Â, and U+00A9 is, as we saw, ©.

Update:

Just a few points to clarify the above, as the use of iconv above may be slightly confusing.

  • The problem is caused when UTF-8 “é” is literally interpreted as latin-1, that is 11000011 10101001 is read as the two 1-byte latin-1 characters é, rather than the 2-byte UTF-8 character é
  • This only happens when UTF-8 is mistakenly taken as latin-1.
  • iconv converts from one character code to another. This means that an UTF-8 “é” becomes an iso-8859-1 “é” when converting from UTF-8 to another. The sequence is therefore converted from 0xC3 0xA9 to 0xE9. Let’s see this:
sebastien@greystones:~$ echo é > /tmp/test.txt
sebastien@greystones:~$ xxd /tmp/test.txt
0000000: c3a9 0a                                  ...
sebastien@greystones:~$ iconv -f utf8 -t iso-8859-1 /tmp/test.txt --output=/tmp/test_1.txt
sebastien@greystones:~$ xxd /tmp/test_1.txt 
0000000: e90a                                     ..
sebastien@greystones:~$ 

In the example in the post:

sebastien@greystones:~$ iconv -f latin1 -t utf8
é
é

I know that the character entered on the console is UTF-8, but I ask iconv to consider it as latin-1, and then to convert it to UTF-8 to illustrate the problem.

I hope this clarifies things a bit.

Update: second part of the article here.

How to convert byte[] to string?

[Origin]: http://stackoverflow.com/questions/1003275/how-to-convert-byte-to-string

I have a byte[] array that is loaded from a file that I happen to known contains UTF-8. In some debugging code, I need to convert it to a string. Is there a one liner that will do this?

Under the covers it should be just an allocation and a memcopy, so even if it is not implemented, it should be possible.

shareedit
string result = System.Text.Encoding.UTF8.GetString(byteArray);
shareedit

Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

[Origin]: http://stackoverflow.com/questions/2595442/using-net-how-to-convert-iso-8859-1-encoded-text-files-that-contain-latin-1-acc

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?

I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.

What step am I missing?

shareedit

You need to get the proper Encoding object. ASCII is just as it’s named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.

using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
                                       Encoding.GetEncoding("iso-8859-1")))
{
    using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
                                           outFileName, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }
}

However, if you want to have the byte arrays yourself, it’s easy enough to do with Encoding.Convert.

byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
    Encoding.UTF8, data);

It’s important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.

In the interest of fully exploring the issue, something like this would work:

using (System.IO.FileStream input = new System.IO.FileStream(fileName,
                                    System.IO.FileMode.Open, 
                                    System.IO.FileAccess.Read))
{
    byte[] buffer = new byte[input.Length];

    int readLength = 0;

    while (readLength &lt; buffer.Length) 
        readLength += input.Read(buffer, readLength, buffer.Length - readLength);

    byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
                       Encoding.UTF8, buffer);

    using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
                                         System.IO.FileMode.Create, 
                                         System.IO.FileAccess.Write))
    {
        output.Write(converted, 0, converted.Length);
    }
}

In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named…converted. This is then written to the output file directly.

Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you’re doing, but the latter example should give you more of a hint as to what’s actually going on.

shareedit

Encoding Problem: Treating UTF-8 Bytes as Windows-1252 or ISO-8859-1

[Origin]: http://www.i18nqa.com/debug/bug-utf-8-latin1.html

Symptom

Instead of an expected character, a sequence of Latin characters is shown, typically starting with à or Â. For example, instead of “è” these characters occur: “è”.

Explanation

A common problem is for characters encoded as UTF-8 to have their individual bytes interpreted as ISO-8859-1 or Windows-1252. For example:

  • A Web page is encoded as UTF-8 characters. The Web server mistakenly declares the charset to be ISO-8859-1 in the HTTP protocol that delivers the page to the browser. The browser will then display each of the UTF-8 bytes in the Web page as Latin-1 characters.
  • A file such as a Java property file, which is encoded with UTF-8, is incorrectly converted as it is imported. As it is read in by Java it is converted from ISO-8859-1 to UTF-8.

A character such as è (e-Grave, U+00E8) consists of two bytes in UTF-8: 0xC3 and 0xA8. If each of these bytes are treated as either ISO-8859-1 or Wiindows-1252 code points, then the displayed characters will be à and ¨.

Table 1
Example Treating UTF-8 Bytes as
Windows-1252 or ISO-8859-1
Character UTF-8 Bytes Bytes viewed in Latin-1
è 0xC3, 0xA8 Ã, ¨

You can use the Encoding Debug Table to look up any erroneous sequence of Latin characters and find out the UTF-8 character that it corresponds to and that generated it.

References

Lock, mutex, semaphore… what’s the difference?

[Originally Posted By]: http://stackoverflow.com/questions/2332765/lock-mutex-semaphore-whats-the-difference

A lock allows only one thread to enter the part that’s locked and the lock is not shared with any other processes.

A mutex is the same as a lock but it can be system wide (shared by multiple processes).

A semaphore does the same as a mutex but allows x number of threads to enter.

You also have read/write locks that allows either unlimited number of readers or 1 writer at any given time.

shareedit

UTF8, UTF16, and UTF32

From http://stackoverflow.com/questions/496321/utf8-utf16-and-utf32

UTF-8 has an advantage where ASCII are most prevalent characters. In that case most characters only occupy one byte each. It is also advantageous that UTF-8 file containing only ASCII characters has the same encoding as an ASCII file.

UTF-16 is better where ASCII is not predominant, it uses 2 bytes per character primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 most of the time.

UTF-32 will cover all possible characters in 4 bytes each which makes it pretty bloated, I can’t think of any advantage to use it.

shareedit

In short:

  • UTF8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.
  • UTF16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.
  • UTF32: Fixed-width encoding. All code points take 4 bytes. An enormous memory hog, but fast to operate on. Rarely used.

In long: see Wikipedia: UTF-8, UTF-16, and UTF-32

shareedit

Unicode is a standard and about UTF-x you can think as a technical implementation for some practical purposes:

  • UTF-8 – “size optimized“: best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety (and in worst case could grow up to 6 bytes per character)
  • UTF-16 – “balance“: it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling (but size is still variable and can grow up to 4 bytes per character)
  • UTF-32 – “performance“: allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage
shareedit