Charsets

Goals

Concepts

Library

Lesson

Storing and working with written information should be easy. It's just working with a lot of letters, right? But it's not easy, and there are many details and complications to handling the symbols that make up human languages. And most developers get it wrong.

First of all, when we deal with text we are dealing with more than letters—we are dealing with other symbols such as punctuation. And we should take care to handle letters in other languages—there must be a lot of them. But how do we represent them? Unfortunately over the years different groups of people have used different approaches to representing these things—and usually they resorted to simplistic ways of storing them in files. We have to untangle the whole mess or we'll have yet another program that can display Hello World! but that has a problem handling even English words that come from other languages.

To illustrate a bit of the complication, consider a file containing the following bytes:

0xEF 0xBB 0xBF 0x74 0x6F 0x75 0x63 0x68 0xC3 0xA9

What do these bytes mean? Maybe the number represent some letters. But which letters? And what numbers are we dealing with, anyway—are these eight-bit numbers, or are these 16-bit numbers (with each number taking two bytes)?

Characters

To start attacking the problem, we have to first determine what we're dealing with. We're going to call the basic unit of text a character. This could be the letter 'A' or the asterisk * character. Even the spaces between these words could be considered characters.

Character Sets

A character set is simply an identified group of characters. If we assign some code or number to each character, then we can represent them on a computer; the set of characters then becomes a coded character set. But what codes should we assign to the characters? As you might have guessed, others have made proposals on what codes to use. Here are a few interesting ones through history.

ASCII

One of the most famous coded characters sets is the American Standard Code for Information Interchange, or ASCII (pronounced ass-kee). It was made in America, and it's as if no one at the time thought anyone other than Americans would ever use computers—it only supported the 26 letters of the English alphabet and some punctuation. In all it maps out 128 codes (ending at 0x7F) to represent characters (some of which are control characters representing non-displayable actions such as a BS, a backspace).

ASCII codes.
row+col 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x0C 0x0D 0x0E 0x0F
0x00 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
0x10 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
0x20 SP ! " # $ % & ' ( ) * + , - . /
0x30 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
0x40 @ A B C D E F G H I J K L M N O
0x50 P Q R S T U V W X Y Z [ \ ] ^ _
0x60 ` a b c d e f g h i j k l m n o
0x70 p q r s t u v w x y z { | } ~ DEL

The good news is that with 128 combinations, ASCII only uses seven bits of information, which can fit in a single byte. The bad news is that with only 128 combinations, what it can show is extremely limited. There is no c-cedilla ç character, for example, so we can't even represent the word "façade".

The official name given by the Internet Assigned Numbers Authority (IANA) is US-ASCII.

ISO-8859-1

The International Organization for Standardization (ISO) came up with a larger set of codes called ISO/IEC 8859-1, which IANA officially refers to as ISO-8859-1. This set comprises 256 different codes, each of which takes up eight bits instead of ASCII's seven bits—but each of which can still fit in a single byte
ISO-8859-1 codes.
row+col 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x0C 0x0D 0x0E 0x0F
0x00
0x10
0x20 SP ! " # $ % & ' ( ) * + , - . /
0x30 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
0x40 @ A B C D E F G H I J K L M N O
0x50 P Q R S T U V W X Y Z [ \ ] ^ _
0x60 ` a b c d e f g h i j k l m n o
0x70 p q r s t u v w x y z { | } ~
0x80
0x90
0xA0 NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
0xB0 ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
0xC0 À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
0xD0 Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
0xE0 à á â ã ä å æ ç è é ê ë ì í î ï
0xF0 ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Windows-1252

The Microsoft Windows operating system, depending on the country for which it is installed, uses a code page (essentially a coded character set) to represent characters. In the United States Windows uses CP-1252, or simply Windows-1252, which is a set of 128 codes that are almost exactly the same as ISO-8859-1. Almost, but not quite: note, for example, the addition of the euro € character for code 0x80, which does not exist in ISO-8859-1.

Windows-1252 codes, with differences with ASCII and ISO-8859-1 highlighted.
row+col 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x0C 0x0D 0x0E 0x0F
0x00 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
0x10 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
0x20 SP ! " # $ % & ' ( ) * + , - . /
0x30 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
0x40 @ A B C D E F G H I J K L M N O
0x50 P Q R S T U V W X Y Z [ \ ] ^ _
0x60 ` a b c d e f g h i j k l m n o
0x70 p q r s t u v w x y z { | } ~ DEL
0x80 ƒ ˆ Š Œ Ž
0x90 ˜ š œ ž Ÿ
0xA0 NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
0xB0 ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
0xC0 À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
0xD0 Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
0xE0 à á â ã ä å æ ç è é ê ë ì í î ï
0xF0 ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Unicode

For years ISO has been working with the Unicode Consortium to produce ISO/IEC 10646, the Universal Coded Character Set (UCS). This character set is meant to assign codes to all knows characters used by all human languages. The Unicode Consortium produces The Unicode Standard, often referred to as simply Unicode, which contains identical code mappings as ISO/IEC 10646 with the addition of rules about how to use and manipulate those codes in applications.

Each code in the UCS is referred to as a code point. Unicode code points range from 0x00 to 0x10FFFF and are divided into 16 different Unicode planes numbered 0–15. The most common plane, covering code points 0x0000–0xFFFF, is called the Basic Multilingual Plane (BMP). Code points in the BMP are frequently represented as U+XXXX, where XXXX is the hexadecimal representation of the code point.

Each plane is subdivided into ranges of related code points called a Unicode block. These include Basic Latin (U+0000U+007F), coding the ASCII characters; Devanagari (U+0900U+097F), coding characters used in Hindi, Hiragana (U+3040U+309F), coding characters for one component of the Japanese writing system; and CJK Unified Ideographs (U+4E00U+9FFF), a very large block containing a unified coding of the characters used by Chinese, Japanese, and Korean.

The Unicode Standard not only defines which characters are assigned which code points, it also provides a database of extensive properties about the identified characters. Each character has an official Unicode name, as well as a General Category. Just a few examples of categories include Ll (Letter, lowercase) for lowercase letters, Nd (Number, decimal digit) for numeric digits in various languages, Pd (Punctuation, dash) for different types of dashes with different meanings, Sm (Symbol, math) for mathematical symbols such as operators, and Zs (Separator, space) for many types of spaces such as a nonbreaking space. Other Unicode character properties indicate how a character is rendered in other writing systems, letter casing and script association.

Character Encodings

Many computer systems have either fully embraced Unicode or are in the process of switching. Microsoft Windows uses Unicode, and Java uses Unicode exclusively to represent characters and strings. So choosing a coded character set for a modern application is simple: use Unicode.

Now that you've determined which codes represent which characters, if you store those codes somewhere (such as reading them from an java.io.InputStream or writing them to a java.io.OutputStream) you'll need to find some way to convert those codes into individual bytes. The approach used to encode character codes into a byte stream is called a character encoding, and there several of those as well.

One Byte per Character Code

One byte per character encoding of touché using the ISO-8859-1 character set.
0x74 0x6F 0x75 0x63 0x68 0xE9
t o u c h é

The simplest character encoding is to place each code into a single byte. With ASCII this is simple, as we only need seven bits after all to store each ASCII code. Even ISO-8859-1 never uses more than eight bits for any character code. We could therefore encode the word touché, using ISO-8859-1, using the byte per code encoding shown in the figure.

Two Bytes per Character Code

The Universal Coded Character Set used by Unicode contains many, many more codes than can be represented by a single byte. We could therefore decide to use two bytes to represent each code. With this approach we are essentially using 16 bits to represent each code—the equivalent of a Java short type. Stored with the high-order bits first, our encoding of the word touché using two bytes to encode each Unicode code point would look like this:

Two bytes per character big-endian encoding of touché using the ISO-8859-1 character set.
0x00 0x74 0x00 0x6F 0x00 0x75 0x00 0x63 0x00 0x68 0x00 0xE9
t o u c h é

Using two bytes to represent each code brings up another question: which of those two bytes should be placed first in the stream? The above example shows, for each pair of bytes, the first byte in the stream as the one containing the high-order bits (which in this case are each 0x00 because the code values are low). This in fact may seem logical; if we write the decimal number 555, for example, the higher-order digits (e.g. the one representing 500) comes before the lower-order digits (the one representing 50, for instance). But some platforms have traditionally placed the byte containing the low-order bits first in memory.

Two bytes per character little-endian encoding of touché using the ISO-8859-1 character set.
0x74 0x00 0x6F 0x00 0x75 0x00 0x63 0x00 0x68 0x00 0xE9 0x00
t o u c h é

We call these two approaches the byte order, and we have two names for them based upon which end of the value comes first. If the big part of the value comes first, we call it big-endian byte order; if the little part of the value comes first, we call it little-endian byte order. The byte order is therefore sometimes referred to as endianness.

UTF-8

So far using two bytes to encode each character code works fine, but you might have noticed a problem: we've doubled the amount of storage space we need! For English and even general Latin-based alphabets, most of the time we don't even need more than a single byte for each character code. It's only in those rare instances in which we need to represent non-ASCII characters are we forced to use more than one byte.

The UTF-8 encoding was invented to solve this problem. It follows a slightly more complicated set of rules, but supports all Unicode code points:

  1. If the code point is less than 0x80 (i.e. if it is an US-ASCII character code), use one byte to represent the character.
  2. If the code point is 0x80 or above (including ISO-8859-1 codes above the ASCII range) , use two, three, or more bytes to encode the code point using the UTF-8 algorithm.
UTF-8 encoding of touché using Universal Coded Character Set of Unicode.
0x74 0x6F 0x75 0x63 0x68 0xC3 0xA9
t o u c h é

Thus the character A (U+0041) would be encoded as the single byte 0x41, while the character é (U+0xE9) would be encoded as the two bytes 0xC3 and 0xA9. The Hindi letter म (U+092E) would be encoded as three bytes: 0xE0 0xA4 0xAE.

Revisiting the word touché, still representing the characters in Unicode but encoding them in UTF-8 produces the series of bytes in the figure on the side. The figure below provides more in-depth examples of the distribution of the bits of several code points across their mult-byte encoding in UTF-8.

Detailed UTF-8 encoding of several example characters (Wikipedia).
Character Code Point Binary code point Binary UTF-8 Hexadecimal UTF-8
$ U+0024 010 0100 00100100 0x24
¢ U+00A2 000 1010 0010 11000010 10100010 0xC2 0xA2
U+20AC 0010 0000 1010 1100 11100010 10000010 10101100 0xE2 0x82 0xAC
𐍈 U+10348 0 0001 0000 0011 0100 1000 11110000 10010000 10001101 10001000 0xF0 0x90 0x8D 0x88

UTF-16

UTF-16 is a very efficient encoding scheme for text that is likely to be made up primarily of English and Latin words. But if you know that much of your text will be using high Unicode code point values, you might as well use two bytes for each character. The UTF-16 encoding is very similar to the two-bytes-per-character-code encoding above, except that it is also a variable-length encoding scheme that will use more than two bytes in some situations.

Like the two-bytes-per-code encoding above, UTF-16 will use at least two bytes to represent each code. UTF-16 similarly has two possible byte orders. Big-endian UTF-16 is referred to as UTF-16BE, and little-endian UTF-16 is referred to as UTF-16LE.

UTF-32

Unlike UTF-8 and UTF-16, the UTF-32 encoding scheme uses a fixed length encoding of exactly four bytes per code point. Being fixed-length means that UTF-32 code points can be indexed by their position in a stream of bytes. Its use of four bytes makes it a very inefficient encoding for storing long strings of Latin characters, however.

UTF-32 also comes in UTF32BE and UTF-32LE big-endian and little-endian variations, respectively.

Byte Order Marks

Common signatures and Byte Order Marks.
Signature / BOM Character Encoding Endianness
0xEF 0xBB 0xBF UTF-8 N/A
0xFE 0xFF UTF-16 BE
0xFF 0xFE UTF-16 LE
0x00 0x00 0xFE 0xFF UTF-32 BE
0xFF 0xFE 0x00 0x00 UTF-32 LE

So if we have a sequence of bytes, even if we know that it represents the Unicode character set, how do we know which character encoding is being used so that we can extract those Unicode code points? For files in a file system, a standard approach exists for placing a special series of bytes called a signature at the beginning of the byte stream.

For encodings that support endianness, the Unicode byte order mark (BOM) character U+FEFF is used to signal not only the character encoding in use, but also the byte order of the encoding (e.g. big-endian or little-endian). When an application reads a text file starting with a byte order mark, the application uses the BOM to determine how the remaining bytes should be interpreted. However after reading the file the BOM is not included in the actual content!

Adding a BOM to our UTF-8 encoded Unicode characters touché now provides us with the bytes that were presented in the lesson's introduction.

UTF-8 encoding of touché with a UTF-8 signature.
0xEF 0xBB 0xBF 0x74 0x6F 0x75 0x63 0x68 0xC3 0xA9
UTF-8 BOM t o u c h é

Charsets

You've therefore learned that to store text in a file, you need to know both the character set (the codes) you're using, as well as the encoding scheme (how to convert those codes to bytes)—including the byte order. Thus UTF-16BE indicates the charset of 1) The Unicode character set, 2) the UTF-16 character encoding, and 3) the big-endian byte order.

Java has a class java.nio.charsets.Charset to represent a charset. You can ask for a Charset instance using the Charset.forName(String charsetName) static factory method, passing in the charset name. All JVMs are required to support the charsets identified by the following names: US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-16.

Review

Summary

In order to store text in a file, you need to know:

  1. The character set (the codes) you're using. (Recommended: Unicode)
  2. The character encoding scheme, or how to convert those character codes to bytes (Recommended: UTF-8)
    1. The byte order of the encoding scheme.

All of this information is encapsulated by a named charset.

Gotchas

In the Real World

Self Evaluation

Task

Add the capability to the Booker application to print the name of the application user, loaded from a configuration file.

See Also

References

Resources

Acknowledgments