October 11, 2003
Character sets, leeches and weird IE behavior...
Joel Spolsky has written an excellent piece describing the history of why there are different character sets in the world and why he true way spells Unicode.
Not only is it a great summary and introduction to the importance of using Unicode in internationalized software (which software isn't in this age of networking...), Joel also explains the weird behavior that IE shows when encountering pages with undefined character sets:
What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 byte code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It's truly weird, but it does seem to work often enough that na�ve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn't exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it's Korean and displays it thusly, proving, I think, the point that Larry Wall's quote about "be strict in what you emit and liberal in what you accept" is quite frankly not a good engineering principle.
Here is another valuable nugget of knowledge: always put your content-type declaration first in he head of your HTML-page. Why? To make a classic "hen and egg problem" easier to handle for the browser:
It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
Great reading on a Friday night with a heavy cold. ;)
Posted by manne at October 11, 2003 12:26 AM | TrackBack
forex trading system Heh. How it goes? Buy it all. ASAP. Last discount in your live (AAAAA!!!!!). Take a rest.
Posted by: forex trading system at January 23, 2006 11:28 PM