Charsets

Goals

Understand text and characters.
Distinguish between character sets and character encodings.
Work with standard charsets.

Concepts

American Standard Code for Information Interchange (ASCII)
American National Standards Institute (ANSI)
Basic Multilingual Plane (BMP)
big-endian
code page
code point
byte order
byte order mark (BOM)
character
character encoding
character set
coded character set
control character
endianness
International Organization for Standardization (ISO)
Internet Assigned Numbers Authority (IANA)
little-endian
signature
text
Unicode
Unicode block
Unicode Consortium
Unicode plane
Universal Coded Character Set (UCS)
variable-length encoding

Library

Lesson

Storing and working with written information should be easy. It's just working with a lot of letters, right? But it's not easy, and there are many details and complications to handling the symbols that make up human languages. And most developers get it wrong.

First of all, when we deal with text we are dealing with more than letters—we are dealing with other symbols such as punctuation. And we should take care to handle letters in other languages—there must be a lot of them. But how do we represent them? Unfortunately over the years different groups of people have used different approaches to representing these things—and usually they resorted to simplistic ways of storing them in files. We have to untangle the whole mess or we'll have yet another program that can display Hello World! but that has a problem handling even English words that come from other languages.

To illustrate a bit of the complication, consider a file containing the following bytes:

0xEF 0xBB 0xBF 0x74 0x6F 0x75 0x63 0x68 0xC3 0xA9

What do these bytes mean? Maybe the number represent some letters. But which letters? And what numbers are we dealing with, anyway—are these eight-bit numbers, or are these 16-bit numbers (with each number taking two bytes)?

Characters

To start attacking the problem, we have to first determine what we're dealing with. We're going to call the basic unit of text a character. This could be the letter 'A' or the asterisk * character. Even the spaces between these words could be considered characters.

A character is the same character regardless of how it's written. It doesn't matter if it is A or A or A—we still consider it the same character. Note that we consider uppercase A and lowercase a to be separate characters.

Character Sets

A character set is simply an identified group of characters. If we assign some code or number to each character, then we can represent them on a computer; the set of characters then becomes a coded character set. But what codes should we assign to the characters? As you might have guessed, others have made proposals on what codes to use. Here are a few interesting ones through history.

ASCII

One of the most famous coded characters sets is the American Standard Code for Information Interchange, or ASCII (pronounced ass-kee). It was made in America, and it's as if no one at the time thought anyone other than Americans would ever use computers—it only supported the 26 letters of the English alphabet and some punctuation. In all it maps out 128 codes (ending at 0x7F) to represent characters (some of which are control characters representing non-displayable actions such as a BS, a backspace).

ASCII codes.

row+col	`0x00`	`0x01`	`0x02`	`0x03`	`0x04`	`0x05`	`0x06`	`0x07`	`0x08`	`0x09`	`0x0A`	`0x0B`	`0x0C`	`0x0D`	`0x0E`	`0x0F`
`0x00`	`NUL`	`SOH`	`STX`	`ETX`	`EOT`	`ENQ`	`ACK`	`BEL`	`BS`	`HT`	`LF`	`VT`	`FF`	`CR`	`SO`	`SI`
`0x10`	`DLE`	`DC1`	`DC2`	`DC3`	`DC4`	`NAK`	`SYN`	`ETB`	`CAN`	`EM`	`SUB`	`ESC`	`FS`	`GS`	`RS`	`US`
`0x20`	`SP`	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
`0x30`	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
`0x40`	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
`0x50`	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
`0x60`	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
`0x70`	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	`DEL`

The good news is that with 128 combinations, ASCII only uses seven bits of information, which can fit in a single byte. The bad news is that with only 128 combinations, what it can show is extremely limited. There is no c-cedilla ç character, for example, so we can't even represent the word "façade".

The official name given by the Internet Assigned Numbers Authority (IANA) is US-ASCII.

ISO-8859-1

The International Organization for Standardization (ISO) came up with a larger set of codes called ISO/IEC 8859-1, which IANA officially refers to as ISO-8859-1. This set comprises 256 different codes, each of which takes up eight bits instead of ASCII's seven bits—but each of which can still fit in a single byte

ISO-8859-1 codes.

row+col	`0x00`	`0x01`	`0x02`	`0x03`	`0x04`	`0x05`	`0x06`	`0x07`	`0x08`	`0x09`	`0x0A`	`0x0B`	`0x0C`	`0x0D`	`0x0E`	`0x0F`
`0x00`
`0x10`
`0x20`	`SP`	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
`0x30`	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
`0x40`	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
`0x50`	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
`0x60`	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
`0x70`	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~
`0x80`
`0x90`
`0xA0`	`NBSP`	¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬	`SHY`	®	¯
`0xB0`	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿
`0xC0`	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
`0xD0`	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
`0xE0`	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
`0xF0`	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

Windows-1252

The Microsoft Windows operating system, depending on the country for which it is installed, uses a code page (essentially a coded character set) to represent characters. In the United States Windows uses CP-1252, or simply Windows-1252, which is a set of 128 codes that are almost exactly the same as ISO-8859-1. Almost, but not quite: note, for example, the addition of the euro € character for code 0x80, which does not exist in ISO-8859-1.

Windows-1252 codes, with differences with ASCII and ISO-8859-1 highlighted.

row+col	`0x00`	`0x01`	`0x02`	`0x03`	`0x04`	`0x05`	`0x06`	`0x07`	`0x08`	`0x09`	`0x0A`	`0x0B`	`0x0C`	`0x0D`	`0x0E`	`0x0F`
`0x00`	`NUL`	`SOH`	`STX`	`ETX`	`EOT`	`ENQ`	`ACK`	`BEL`	`BS`	`HT`	`LF`	`VT`	`FF`	`CR`	`SO`	`SI`
`0x10`	`DLE`	`DC1`	`DC2`	`DC3`	`DC4`	`NAK`	`SYN`	`ETB`	`CAN`	`EM`	`SUB`	`ESC`	`FS`	`GS`	`RS`	`US`
`0x20`	`SP`	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
`0x30`	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
`0x40`	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
`0x50`	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
`0x60`	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
`0x70`	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	`DEL`
`0x80`	€		‚	ƒ	„	…	†	‡	ˆ	‰	Š	‹	Œ		Ž
`0x90`		‘	’	“	”	•	–	—	˜	™	š	›	œ		ž	Ÿ
`0xA0`	`NBSP`	¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬	`SHY`	®	¯
`0xB0`	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿
`0xC0`	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
`0xD0`	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
`0xE0`	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
`0xF0`	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

It is incorrect therefore to say that an American computer defaults to the ISO-8859-1 character set. It really uses Windows-1252.

Unicode

For years ISO has been working with the Unicode Consortium to produce ISO/IEC 10646, the Universal Coded Character Set (UCS). This character set is meant to assign codes to all knows characters used by all human languages. The Unicode Consortium produces The Unicode Standard, often referred to as simply Unicode, which contains identical code mappings as ISO/IEC 10646 with the addition of rules about how to use and manipulate those codes in applications.

Each code in the UCS is referred to as a code point. Unicode code points range from 0x00 to 0x10FFFF and are divided into 16 different Unicode planes numbered 0–15. The most common plane, covering code points 0x0000–0xFFFF, is called the Basic Multilingual Plane (BMP). Code points in the BMP are frequently represented as U+XXXX, where XXXX is the hexadecimal representation of the code point.

Each plane is subdivided into ranges of related code points called a Unicode block. These include Basic Latin (U+0000–U+007F), coding the ASCII characters; Devanagari (U+0900–U+097F), coding characters used in Hindi, Hiragana (U+3040–U+309F), coding characters for one component of the Japanese writing system; and CJK Unified Ideographs (U+4E00–U+9FFF), a very large block containing a unified coding of the characters used by Chinese, Japanese, and Korean.

Similarly ISO and the Unicode Consortium were kind enough to make the first 256 Unicode code points identical to ISO-8859-1, with the addition of the US-ASCII control codes, so that if you know the ISO-8859-1 (or ASCII, as ASCII is a subset of ISO-8859-1) code of a character, you automatically know its Unicode code point.

The Unicode Standard not only defines which characters are assigned which code points, it also provides a database of extensive properties about the identified characters. Each character has an official Unicode name, as well as a General Category. Just a few examples of categories include Ll (Letter, lowercase) for lowercase letters, Nd (Number, decimal digit) for numeric digits in various languages, Pd (Punctuation, dash) for different types of dashes with different meanings, Sm (Symbol, math) for mathematical symbols such as operators, and Zs (Separator, space) for many types of spaces such as a nonbreaking space. Other Unicode character properties indicate how a character is rendered in other writing systems, letter casing and script association.

Character Encodings

Many computer systems have either fully embraced Unicode or are in the process of switching. Microsoft Windows uses Unicode, and Java uses Unicode exclusively to represent characters and strings. So choosing a coded character set for a modern application is simple: use Unicode.

Now that you've determined which codes represent which characters, if you store those codes somewhere (such as reading them from an java.io.InputStream or writing them to a java.io.OutputStream) you'll need to find some way to convert those codes into individual bytes. The approach used to encode character codes into a byte stream is called a character encoding, and there several of those as well.

One Byte per Character Code

One byte per character encoding of touché using the ISO-8859-1 character set.

`0x74`	`0x6F`	`0x75`	`0x63`	`0x68`	`0xE9`
`t`	`o`	`u`	`c`	`h`	`é`

The simplest character encoding is to place each code into a single byte. With ASCII this is simple, as we only need seven bits after all to store each ASCII code. Even ISO-8859-1 never uses more than eight bits for any character code. We could therefore encode the word touché, using ISO-8859-1, using the byte per code encoding shown in the figure.

C/C++ programmers originally used byte arrays to store characters, so strings in C and C++ often used the ASCII or ISO-8859-1 characters sets (if the programmer even gave any thought to character sets). The full Unicode character set was not an option when using only one byte to represent each character code.

The character set being used here is ISO-8859-1 is not ASCII (even though the first five characters are found in ASCII), because the last character code is greater than 0x7E, the highest character code found in ASCII. If the last character code were not present, you would not know which character set was being used, as both US-ASCII and ISO-8859-1 would result in the same bytes using byte-per-code encoding.

Two Bytes per Character Code

The Universal Coded Character Set used by Unicode contains many, many more codes than can be represented by a single byte. We could therefore decide to use two bytes to represent each code. With this approach we are essentially using 16 bits to represent each code—the equivalent of a Java short type. Stored with the high-order bits first, our encoding of the word touché using two bytes to encode each Unicode code point would look like this:

Two bytes per character big-endian encoding of touché using the ISO-8859-1 character set.

`0x00`	`0x74`	`0x00`	`0x6F`	`0x00`	`0x75`	`0x00`	`0x63`	`0x00`	`0x68`	`0x00`	`0xE9`
`t`		`o`		`u`		`c`		`h`		`é`

Using two bytes to represent each code brings up another question: which of those two bytes should be placed first in the stream? The above example shows, for each pair of bytes, the first byte in the stream as the one containing the high-order bits (which in this case are each 0x00 because the code values are low). This in fact may seem logical; if we write the decimal number 555, for example, the higher-order digits (e.g. the one representing 500) comes before the lower-order digits (the one representing 50, for instance). But some platforms have traditionally placed the byte containing the low-order bits first in memory.

Two bytes per character little-endian encoding of touché using the ISO-8859-1 character set.

`0x74`	`0x00`	`0x6F`	`0x00`	`0x75`	`0x00`	`0x63`	`0x00`	`0x68`	`0x00`	`0xE9`	`0x00`
`t`		`o`		`u`		`c`		`h`		`é`

We call these two approaches the byte order, and we have two names for them based upon which end of the value comes first. If the big part of the value comes first, we call it big-endian byte order; if the little part of the value comes first, we call it little-endian byte order. The byte order is therefore sometimes referred to as endianness.

UTF-8

So far using two bytes to encode each character code works fine, but you might have noticed a problem: we've doubled the amount of storage space we need! For English and even general Latin-based alphabets, most of the time we don't even need more than a single byte for each character code. It's only in those rare instances in which we need to represent non-ASCII characters are we forced to use more than one byte.

The UTF-8 encoding was invented to solve this problem. It follows a slightly more complicated set of rules, but supports all Unicode code points:

If the code point is less than 0x80 (i.e. if it is an US-ASCII character code), use one byte to represent the character.
If the code point is 0x80 or above (including ISO-8859-1 codes above the ASCII range) , use two, three, or more bytes to encode the code point using the UTF-8 algorithm.

UTF-8 encoding of touché using Universal Coded Character Set of Unicode.

`0x74`	`0x6F`	`0x75`	`0x63`	`0x68`	`0xC3`	`0xA9`
`t`	`o`	`u`	`c`	`h`	`é`

Thus the character A (U+0041) would be encoded as the single byte 0x41, while the character é (U+0xE9) would be encoded as the two bytes 0xC3 and 0xA9. The Hindi letter म (U+092E) would be encoded as three bytes: 0xE0 0xA4 0xAE.

Revisiting the word touché, still representing the characters in Unicode but encoding them in UTF-8 produces the series of bytes in the figure on the side. The figure below provides more in-depth examples of the distribution of the bits of several code points across their mult-byte encoding in UTF-8.

Detailed UTF-8 encoding of several example characters (Wikipedia).

Character	Code Point	Binary code point	Binary UTF-8	Hexadecimal UTF-8
$	`U+0024`	`010 0100`	`00100100`	`0x24`
¢	`U+00A2`	`000 1010 0010`	`11000010 10100010`	`0xC2 0xA2`
€	`U+20AC`	`0010 0000 1010 1100`	`11100010 10000010 10101100`	`0xE2 0x82 0xAC`
𐍈	`U+10348`	`0 0001 0000 0011 0100 1000`	`11110000 10010000 10001101 10001000`	`0xF0 0x90 0x8D 0x88`

UTF-16

UTF-16 is a very efficient encoding scheme for text that is likely to be made up primarily of English and Latin words. But if you know that much of your text will be using high Unicode code point values, you might as well use two bytes for each character. The UTF-16 encoding is very similar to the two-bytes-per-character-code encoding above, except that it is also a variable-length encoding scheme that will use more than two bytes in some situations.

Like the two-bytes-per-code encoding above, UTF-16 will use at least two bytes to represent each code. UTF-16 similarly has two possible byte orders. Big-endian UTF-16 is referred to as UTF-16BE, and little-endian UTF-16 is referred to as UTF-16LE.

UTF-32

Unlike UTF-8 and UTF-16, the UTF-32 encoding scheme uses a fixed length encoding of exactly four bytes per code point. Being fixed-length means that UTF-32 code points can be indexed by their position in a stream of bytes. Its use of four bytes makes it a very inefficient encoding for storing long strings of Latin characters, however.

UTF-32 also comes in UTF32BE and UTF-32LE big-endian and little-endian variations, respectively.

Byte Order Marks

Common signatures and Byte Order Marks.

Signature / BOM	Character Encoding	Endianness
`0xEF 0xBB 0xBF`	UTF-8	N/A
`0xFE 0xFF`	UTF-16	BE
`0xFF 0xFE`	UTF-16	LE
`0x00 0x00 0xFE 0xFF`	UTF-32	BE
`0xFF 0xFE 0x00 0x00`	UTF-32	LE

So if we have a sequence of bytes, even if we know that it represents the Unicode character set, how do we know which character encoding is being used so that we can extract those Unicode code points? For files in a file system, a standard approach exists for placing a special series of bytes called a signature at the beginning of the byte stream.

For encodings that support endianness, the Unicode byte order mark (BOM) character U+FEFF is used to signal not only the character encoding in use, but also the byte order of the encoding (e.g. big-endian or little-endian). When an application reads a text file starting with a byte order mark, the application uses the BOM to determine how the remaining bytes should be interpreted. However after reading the file the BOM is not included in the actual content!

Adding a BOM to our UTF-8 encoded Unicode characters touché now provides us with the bytes that were presented in the lesson's introduction.

UTF-8 encoding of touché with a UTF-8 signature.

`0xEF`	`0xBB`	`0xBF`	`0x74`	`0x6F`	`0x75`	`0x63`	`0x68`	`0xC3`	`0xA9`
UTF-8 BOM			`t`	`o`	`u`	`c`	`h`	`é`

The UTF-8 signature is merely the UTF-8 encoding of 0xFEFF.

The use of a BOM for UTF-8 is somewhat falling of of favor, and there is some dispute in the community about whether it was ever a good idea for UTF-8 encoded text files to include a signature.

The Unicode Standard indicates in passing that it does not recommend a BOM for UTF-8.
Many tools do not support a UTF-8 BOM, including significantly even javac; see JDK-4508058: UTF-8 encoding does not recognize initial BOM.
Any BOM changes the content by adding an initial ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This is less of an issue now that the Unicode Standard recommends the WORD JOINER (U+2060) character instead, leaving U+FEFF to be used exclusively as a byte order mark.

Nevertheless the presence of a signature helps clarify that a text file is encoded in UTF-8 and uses Unicode rather than some platform-specific single-byte character set such as Windows-1252, and you should be prepared to support it. See What's different between UTF-8 and UTF-8 without BOM? for more discussion.

Charsets

You've therefore learned that to store text in a file, you need to know both the character set (the codes) you're using, as well as the encoding scheme (how to convert those codes to bytes)—including the byte order. Thus UTF-16BE indicates the charset of 1) The Unicode character set, 2) the UTF-16 character encoding, and 3) the big-endian byte order.

Java has a class java.nio.charsets.Charset to represent a charset. You can ask for a Charset instance using the Charset.forName(String charsetName) static factory method, passing in the charset name. All JVMs are required to support the charsets identified by the following names: US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-16.

There's usually little need to create an instance of Charset directly, as the java.nio.charsets.StandardCharsets class provides constant instances of all the standard supported charsets for you to use, such as StandardCharsets.US_ASCII and StandardCharsets.UTF_8.

Many older Java APIs, especially those before the introduction of NIO, may ask for charsets by name. Most of these API method have been supplemented with versions that use Charset, which provides consistency, type safety, and more features. You may still come across old code still using strings, or old APIs that don't yet support the Charset class. If an API insists on a character encoding name rather than a Charset instance, you can get the official charset name using Charset.name().

The UTF-8 signature is merely the UTF-8 encoding of 0xFEFF.

Review

Summary

In order to store text in a file, you need to know:

The character set (the codes) you're using. (Recommended: Unicode)
The character encoding scheme, or how to convert those character codes to bytes (Recommended: UTF-8)
1. The byte order of the encoding scheme.

All of this information is encapsulated by a named charset.

Gotchas

US-ASCII only represents 128 codes. If you have a character code 128 or above, it isn't ASCII!
There is no such thing as the ANSI character set, although many people incorrectly use that terminology to refer to Windows-1252.
Don't use the Windows-1252 character set; it is different from ISO-8859-1 incompatible with Unicode.
UTF-8 is a variable-length encoding; only character codes within the ASCII range will be encoded in single bytes.
If you detect a BOM when processing a text file, don't include the BOM in the actual text content; discard the BOM after determining the encoding and byte order.

In the Real World

For new programs, use the Unicode character set and store your text encoded in UTF-8 unless you have a reason to do otherwise.
Many legacy Java APIs still use strings to represent charsets by name, but you should use Charset instances whenever possible.

Self Evaluation

What is the difference between Unicode and UTF-8?
What does ANSI refer to?
Why is UTF-8 so efficient for storing English words?

Task

Add the capability to the Booker application to print the name of the application user, loaded from a configuration file.

Create a file named user.txt in the .booker subdirectory of the user's home directory. Use a text editor that understands UTF-8 or use a hex editor. Make sure the file has the UTF-8 BOM at the beginning. Provide some name (using only ASCII characters) in this file.
Create a utility class called IO and create a utility method called readText(…) that takes a Path and returns the string contents of that path.
- The method should throw an IOException if there are any problems reading the file.
- First check for the UTF-8 BOM. If it there, read it and discard it. If it is not present, assume that the file is stored in UTF-8. Whether this assumption is valid depends on the context! Many modern formats default to UTF-8, and here you control the file in question so you know what its encoding should be. But be sure and research if the file you are reading can be assumed to contain UTF-8 if it has no signature indicating the charset.
- You don't really want to implement the entire UTF-8 algorithm, but you instead take a shortcut: you know that in a UTF-8 file, the only multi-byte encodings will be those for Unicode code points that lie beyond the ASCII range. Put another way, if a UTF-8 file contains only ASCII characters, the file will never use more than one byte per character. So read all the characters and put them in a string to return. Check each character; if any byte indicates a multi-byte encoding, throw an IOException saying that the UTF-8 algorithm hasn't been completely implemented yet. (Make sure all this is documented in the API documentation.)
- To test this, you'll want to have two readText(…) methods: one that takes a path, and the other that takes an input stream; the first will delegate to the other. Create unit tests that create input streams based upon byte arrays. Test a byte array using a simple array of bytes representing characters. Test a byte array with and without a BOM. Test a byte array with BOM containing the UTF-8 encoding of the word touché. Some of these will expect exceptions to be thrown.
When the Booker application runs, check to see if there is a user.txt file in the configuration directory.
- If there is no such file, generate one containing the current system user account name. You can retrieve the name of the current system logged-in user using System.getProperty(String key) with the key "user.name".
- If that file exists, open and read it to retrieve the name of the user.
When listing files, print the user's name to the console (e.g. Library of Jane Doe) before continuing.

References

ASCII (Wikipedia)
ISO/IEC 8859-1 (Wikipedia)
Windows-1252 (Wikipedia)
Character Sets (IANA)
The Unicode Standard (Unicode Consortium)
UTF-8 (Wikipedia)
UTF-16 (Wikipedia)
UTF-32 (Wikipedia)
Supported Encodings (Oracle - Tech Notes: Guides)
Unicode Byte Order Mark FAQ

Resources

Acknowledgments

UTF-8 encoding examples table modified from UTF-8 (Wikipedia).
Some symbols are from Font Awesome by Dave Gandy.

Charsets

Goals

Concepts

Library

Lesson

Characters

Character Sets

ASCII

ISO-8859-1

Windows-1252

Unicode

Character Encodings

One Byte per Character Code

Two Bytes per Character Code

UTF-8

UTF-16

UTF-32

Byte Order Marks

Charsets

Review

Summary

Gotchas

In the Real World

Self Evaluation

Task

See Also

References

Resources

Acknowledgments