Greg's Internet mail
character sets
Greg's email pages
Greg's rants
Greg's product reviews
Greg's home page
Greg's diary
Greg's photos
Greg's links
Google

The original RFC 822 standard didn't define alternate character sets. The only character set available was US-ASCII, which is sufficient for English and maybe one or two other languages. It's not enough for nearly any other language, even including European languages.

MIME changed that. One of the additional MIME headers specifies the character set to use. Unfortunately, a large number of mailers (including all text-based mailers) aren't able to change their character set to match the message. This is definitely an area that needs improvement.

If you speak a Western European language, though, there is one thing that you can do, though: choose the ISO 8859-1 character set. It's a superset of US-ASCII, and it also supports other Western European languages. Check your mailer documentation for how to set it. Once you have, you won't have any more problems.

Microsoft's contribution

As elsewhere, Microsoft has added problems in the area of character sets. Many messages claim to use the ISO 8859-1 character set, but in fact use a number of additional characters in a range which ISO 8859 considers to be control characters. As a result, many MUAs and web browsers display them as ?. For example, Microsoftclaims:

broken Microsoft display

See that text today?s? That's obviously not what was intended. But how do we know this is Microsoft's problem, and not Netscape's? Obviously some software can display these characters as Microsoft intended. To be sure, we need to look at the source for this page, which you can do by saving the document and looking at it with an appropriate tool. Here I use Emacs, which does in fact display the character as intended. The inverse video is because the cursor is positioned on the character in order to display the information at the bottom line.

broken Microsoft display

The sequences <p><b> at the beginning of the line are HTML; there's nothing wrong with them in this example. The command c-x = (press control-X, then the = key) displays more details about the character on which the cursor is positioned on the bottom line of the display. Here it's the in today’s. We see that it's the character 0x92, which is reserved by ISO for control characters. You'll also notice the \ characters at the end of each line, and the way the text wraps around: like many other Microsoft products, the application that produced this document doesn't believe in line breaks. This line is in fact 341 characters long, ending in a blank space and a carriage return character (^M).

Avoiding ? breakage

Having ? symbols in your text doesn't convince people of your professionalism. What can you do about it? If you're using Microsoft products, you will have difficulties, because there's no obvious way of recognizing the problem. With a UNIX system, you can search for them with a command like grepand replace them with sed, but wouldn't it just be easier to avoid Microsoft altogether?


Greg's home page Greg's diary Greg's photos Copyright

Valid XHTML 1.0!

$Id: email-charset.php,v 1.5 2009/01/23 02:40:35 grog Exp $