You learned in a previous lesson that a major part of internationalization is separating resources from code, placing display strings and related information into resource files to be localized for different locales. Internationalization however does not end at being able to substitute different strings for different languages. The processing of those strings require special cognizance of variations in writing systems and writing direction. You can still run into problems, even using Unicode string resources, if you assume that the strings follow the rules of American English. Many assumptions of a native English-speaking developer do not hold up when applied to other locales, as shown by the following figure.

English-speaking assumptions do not hold globally. Adapted from Java™ Internationalization by Andrew Deitsch and David Czarnecki (O'Reilly, 2001).
The letters A-Z do not always contain all the letters in the alphabet. Languages such as Danish have letters that appear after Z in the alphabet, and languages such as Hindi don't use the Latin script at all.
A script does not always require both consonants and vowels. In Urdu consonants are optional, while in Hindi vowels may change shape based upon position or be written above or below letters.
Some scripts do not have the concept of uppercase. Hebrew letter are always written the same way, for example.
Punctuation symbols differ in other languages. Spanish adds an inverted question mark ¿ at the beginning of a question, and in Greek questions end with a semicolon ; character.
Not all scripts are written left-to-right. Arabic and Hebrew are written right-to-left, while Chinese can be written vertically from right to left.
Languages using the same script may not always sort the same way. French for example uses a multilevel sorting using first base characters and then special marks.

The Unicode Consortium has intensely studied the various writing systems and produced rules for working with Unicode code points to allow for the variations in those systems. Studying those rules first requires attention on the Java's own representation of the Unicode character.


You were introduced to the concepts of characters in an earlier lesson on charsets. Each Unicode code point in the Universal Coded Character Set (UCS) represents a specific character, which is a distinct semantic entity without regard to the style in which it is presented.  You also learned that Unicode characters with code points in the range 0x00000xFFFF make up the the Basic Multilingual Plane (BMP). Codes beyond this range are for supplementary characters.

You cannot see a character itself; it is only a concept. A stylistic representation of a character is called a glyph. The presentations E and E are two different glyphs representing the character named LATIN CAPITAL LETTER E. To display a glyph representation, a computer will render the character in some font and style.

In Java a primitive char can be wrapped in a java.lang.Character instance, but the Character class provides also represents the Unicode concept of a character, and provides access to much useful information about the Unicode character definitions. In this lesson Character is used to indicate the Java class, while character is used to mean a Unicode character in general.

Character Properties

The Unicode Standard provides definitions all UCS characters in the Unicode Character Database (UCD), including extensive property lists and supplementary information, which it distributes as a series of text files accessible to computer processing. See Unicode Character Database for access to the UCD itself, and UAX #44: Unicode Character Database for a specification of UCD file formats.

Excerpt from the UnicodeData.txt file distributed with the UCD.
Name General Category Canonical Combining Class Bidi Class Decomposition Type / Mapping Numeric Type / Value Bidi Mirrored Simple Uppercase Mapping Simple Lowercase Mapping Simple Titlecase Mapping
U+0020 SPACE Zs 0 WS N
U+0024 $ DOLLAR SIGN Sc 0 ET N
U+0033 3 DIGIT THREE Nd 0 EN 3 N
U+0065 e LATIN SMALL LETTER E Ll 0 L N 0045 0045
U+00E9 é LATIN SMALL LETTER E WITH ACUTE Ll 0 L 0065 0301 N 00C9 00C9
Some columns removed and headers added for better readability.

The official UCS name of a character can be retrieved using Character.getName(int codePoint).

Character Categories

UCD character categories. (UAX #44 § 5.7.1)
Code Name Description
Lu Uppercase_Letter an uppercase letter
Ll Lowercase_Letter a lowercase letter
Lt Titlecase_Letter a digraphic character, with first part uppercase
LC Cased_Letter Lu | Ll | Lt
Lm Modifier_Letter a modifier letter
Lo Other_Letter other letters, including syllables and ideographs
L Letter Lu | Ll | Lt | Lm | Lo
Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)
Mc Spacing_Mark a spacing combining mark (positive advance width)
Me Enclosing_Mark an enclosing combining mark
M Mark Mn | Mc | Me
Nd Decimal_Number a decimal digit
Nl Letter_Number a letterlike numeric character
No Other_Number a numeric character of other type
N Number Nd | Nl | No
Pc Connector_Punctuation a connecting punctuation mark, like a tie
Pd Dash_Punctuation a dash or hyphen punctuation mark
Ps Open_Punctuation an opening punctuation mark (of a pair)
Pe Close_Punctuation a closing punctuation mark (of a pair)
Pi Initial_Punctuation an initial quotation mark
Pf Final_Punctuation a final quotation mark
Po Other_Punctuation a punctuation mark of other type
P Punctuation Pc | Pd | Ps | Pe | Pi | Pf | Po
Sm Math_Symbol a symbol of mathematical use
Sc Currency_Symbol a currency sign
Sk Modifier_Symbol a non-letterlike modifier symbol
So Other_Symbol a symbol of other type
S Symbol Sm | Sc | Sk | So
Zs Space_Separator a space character (of various non-zero widths)
Zl Line_Separator U+2028 LINE SEPARATOR only
Zp Paragraph_Separator U+2029 PARAGRAPH SEPARATOR only
Z Separator Zs | Zl | Zp
Cc Control a C0 or C1 control code
Cf Format a format control character
Cs Surrogate a surrogate code point
Co Private_Use a private-use character
Cn Unassigned a reserved unassigned code point or a noncharacter
C Other Cc | Cf | Cs | Co | Cn
Emphasized rows represent category groups.

Each character is assigned a general category, which is important for determining how characters relate to other characters. A character may be considered e.g. a lowercase or uppercase letter; a number; or punctuation. The general category column in the above table contains codes specifying the category shown in the figure to the side. Unicode also defines category groups that are shorthands for several categories together, but these are not used in the definitions of the individual characters.

Java refers to a character's general category as its type, which can be retrieved using Character.getType(int codePoint). The return value is a string representing one of the general categories in the categories table in the side figure. Rather than querying the type generally, the Character class has several convenience methods for querying specific categories. The convenience methods for retrieving the category groups is especially handy, as they encompass several types.

Character.isAlphabetic(int codePoint)
Uppercase_Letter (Lu), Lowercase_Letter(Ll), Titlecase_Letter (Lt), Modifier_Letter (Lm), Other_Letter (Lo), Letter_Number (Nl), or contributory property Other_Alphabetic Equivalent to group Letter (L) with the addition of Letter_Number (Nl).
Character.isDigit(int codePoint)
Decimal_Number (Nd).
Character.isLetter(int codePoint)
Uppercase_Letter (Lu), Lowercase_Letter(Ll), Titlecase_Letter (Lt), Modifier_Letter (Lm), or Other_Letter (Lo). Equivalent to group Letter (L).
Character.isLetterOrDigit(int codePoint)
Uppercase_Letter (Lu), Lowercase_Letter(Ll), Titlecase_Letter (Lt), Modifier_Letter (Lm), Other_Letter (Lo), or Decimal_Number (Nd). Equivalent to group Letter (L) with the addition of Decimal_Number (Nd).
Character.isLowerCase(int codePoint)
Lowercase_Letter (Ll) or contributory property Other_Lowercase.
Character.isMirrored(int codePoint)
Mirrored characters such as categories Open_Punctuation (Ps) and Close_Punctuation (Pe).
Character.isSpaceChar(int codePoint)
Space_Separator (Zs), Line_Separator (Zl), or Paragraph_Separator (Zp). Equivalent to group Separator (Z).
Character.isTitleCase(int codePoint)
Titlecase_Letter (Lt).
Character.isUpperCase(int codePoint)
Uppercase_Letter (Lu) or contributory property Other_Uppercase.

Bidi Properties

Some scripts such as Arabic and Hebrew are written from right-to-left rather than left-to-right. Several properties, such as Bidi Class and Bidi Mirrored in the table above, provide extensive information on the bidirectional or bidi characteristics of each character. This information is used when rendering a string of characters to actually display them to the user, which uses a complex set of rules described in AUX #9: Unicode Bidirectional Algorithm. Storing and retrieving internationalization strings can be done in large part without regard to bidi-related issues, which may be left to the rendering code that actually prints or displays the text.


The ISO-8859-1 charset as you saw in an earlier lesson increased the ASCII repertoire mostly by adding many accented letters found in European languages, such as the é U+00E9 that appears in touché. Many languages that use these accented characters do not consider them to be different letters as such. Rather the accent or diacritic (from diacritical mark) is added to a letter to ensure that its sound remains the same in the presence of other letters. (This is similar to how English adds a letter e to the end of words such as cane so that the vowel sound of can is changed). A diacritic is one of several types of marks, as indicated by the category Mark (M) in the character category table above.

The UCS provides separate diacritics and other marks in the category Nonspacing_Mark, which may be placed after a base character with the understanding that the base character and the mark(s) will be shown to the user as one character. Rather storing a single code point U+00E9 representing é, you could store the code point U+0065 representing the base character e, followed by the code point U+0301 for the combining acute accent mark. Through character composition this will create what appears to be the same character: . The letter é U+00E9 is called a precomposed character because its form is already composed of the base character U+0065 and the combining mark U+0301.

Character Sequences

Many times in Java you will need to deal with more than just single characters. As you saw above, at times what may appear as a single glyph is in reality stored as a combination of a base character followed by combining mark characters. The character sequence you are most familiar with is java.lang.String.


A more general interface for dealing with sequences of characters is java.lang.CharSequence. It provides many of the same char access methods you may already have noticed in String, such as CharSequence.length() and CharSequence.charAt(int index). Both String and java.lang.StringBuilder implement the CharSequence interface. 

Counting occurrences of 'E' and 'e' in a general CharSequence.
/** Counts the number of times the characters 'E' and 'e'
 * appear in the given sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of times 'E' or 'e' occurs in the sequence.
public static int countE(@Nonnull final CharSequence charSequence) {
  int count = 0;
  for(int i = 0; i < charSequence.length(); i++) {
    final char c = charSequence.charAt(i);
    if(c == 'E' || c == 'e') {
  return count;

CharSequence can also provide its char values as an IntStream via the CharSequence.chars() method. Counting occurrences of 'E' / 'e' in a could be accomplished much more compactly (and potentially more efficiently) using stream filtering and counting.

Counting occurrences of 'E' and 'e' using CharSequence.chars().
/** Counts the number of times the characters 'E' and 'e'
 * appear in the given sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of times 'E' or 'e' occurs in the sequence.
public static int countE(@Nonnull final CharSequence charSequence) {
  return (int)charSequence.chars()
      .filter(c -> c == 'E' || c == 'e')

Another useful method is CharSequence.subSequence(int start, int end), which functions similarly to collection views such as java.util.List.subList(int fromIndex, int toIndex). The end index is exclusive. Thus calling "touché".subSequence(1, 5) would yield the string "ouch".

Surrogate Pairs

Some UCS code points such as the Linear B symbol U+10014 in the table above are outside the BMP and require more than the 16 bits available in a single char. When working with the Character class, we got around this limitation by using a 32-bit int as a parameter in methods such as Character.isValidCodePoint(int codePoint). CharSequence implementations such as String, however, are conceptually are made up of char values. To understand how Java provides access to supplementary characters, you must first understand some of the details of UTF-16 encoding.

When discussing charsets you learned that both UTF-8 and UTF-16 are variable-length encodings, and both cover the the entire range of UCS code points. UTF-8 can use one, two, three, or even more bytes to encode a single code point using an involved algorithm. UTF-16 on the other hand uses a simpler approach. For any code points between 0x0000 and 0xFFFF, UTF-16 will use a single 16-bit value to represent the character. For supplementary characters (U+10000 and above) UTF-16 will use two subsequent 16-bit values called a surrogate pair. The first value or high surrogate will be in the range 0xD8000xDBFF, while the second value or low surrogate will be in the range 0xDC000xDFFF. This encoding technique only works because there exist no UCS characters that use code points in the range 0xD8000xDFFF.

Java uses UTF-16 in CharSequence and its implementations. This has a startling implication: not every char value represents a Unicode code point. Every char in a CharSequence such as String is potentially part of a surrogate pair, which must be decoded to discover the UCS code point! To discover if a char value is part of a surrogate pair, use Character.isSurrogate(char ch). Because surrogate pairs come in a certain order, you can use Character.isHighSurrogate(char ch) to determine if a surrogate pair is starting, followed by Character.isLowSurrogate(char ch) for the subsequent character of any encountered pair. After detecting a surrogate pair, you can determine the encoded Unicode code point using Character.toCodePoint(char high, char low).

As an example of how not to process a character sequence, the following code naively looks at each char without regard to whether a surrogate pair is present when counting characters in the Letter (L) category.

Incorrectly counting the number of letters in a CharSequence.
/** Counts the number letters in a sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of Unicode letter characters.
 * @see @see Character#isLetter(char)
public static int countLetters(@Nonnull final CharSequence charSequence) {
  int count = 0;
  for(int i = 0; i < charSequence.length(); i++) {
    final char c = charSequence.charAt(i);
    if(Character.isLetter(c)) {  //WRONG; may be surrogate pair
  return count;

Unfortunately the above implementation would skip the Linear B LINEAR B SYLLABLE B080 MA symbol U+10014 altogether. Even though Java recognizes it as being in the Unicode category Other_Letter (Lo), it is encoded in the string as the UTF-16 surrogate pair 0xD800 and 0xDC14. Neither of these char values on its own represents a letter. To correct the algorithm, one would need to use the Character surrogate pair detection and conversion methods explained above, as illustrated in the figure below.

Correctly counting the number of letters in a CharSequence.
/** Counts the number letters in a sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of Unicode letter characters.
 * @see @see Character#isLetter(char)
public static int countLetters(@Nonnull final CharSequence charSequence) {
  final int length = charSequence.length();
  int count = 0;
  for(int i = 0; i < length; i++) {
    //store char as int in case we need to decode surrogate
    int c = charSequence.charAt(i);
    //detect surrogate pairs
    if(Character.isHighSurrogate(c) && i + 1 < length) {
      final int nextChar = charSequence.charAt(i+1);
      if(Character.isLowSurrogate(nextChar)) {
        c = Character.toCodePoint(c, nextChar);
        i++; //skip the low surrogate the next time through the loop
    //check the character using the int character value
    if(Character.isLetter(c)) {
  return count;

Besides CharSequence.chars() there is another method that returns a stream: CharSequence.codePoints(). Although both methods return an IntStream, CharSequence.codePoints() returns a sequence of Unicode code point values instead of char values. Unlike CharSequence.chars(), CharSequence.codePoints() will never return a surrogate pair, and you will never have to check for them! Just be careful not to cast the values returned by CharSequence.codePoints() to char.

Using CharSequence.codePoints() the letter-counting algorithm above could be made must more compact and readable.

Counting the number of letters in a CharSequence using the IntStream from CharSequence.codePoints().
/** Counts the number letters in a sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of Unicode letter characters.
 * @see @see Character#isLetter(char)
public static int countLetters(@Nonnull final CharSequence charSequence) {
  return (int)charSequence.codePoints()


You saw above that some characters such as é U+00E9 have both precomposed forms, as well as separate forms with accents and other characters stored as distinct code points. This can cause major problems with even simple operations such as searching for a character or comparing strings. You surely would not want to separate searches for U+00E9 and for the sequence U+0065 U+0301, just to cover all composition forms of é. Imagine searching for letters that have decomposed forms consisting of three or more characters! Trying to check for all representations would furthermore become a nightmare when trying to compare strings.

Unicode Normalization Forms. (UAX #15 § 1.2)
Code Name Description
NFD Normalization Form D Canonical Decomposition
NFC Normalization Form C Canonical Decomposition, followed by Canonical Composition
NFKD Normalization Form KD Compatibility Decomposition
NFKC Normalization Form KC Compatibility Decomposition, followed by Canonical Composition

Unicode's solution to the problem of precomposed characters is normalization, the process of converting characters to some normal form. Here normal simply means some common, agreed upon representation; and UAX #15: Unicode Normalization Forms defines four such normalization forms, displayed in the accompanying figure. Normalization occurs by some sequence of decomposing the precomposed characters and optionally composing them into precomposed forms (whether they started that way or not). The most important of these forms use canonical decomposition/composition, using the forms preferred by The Unicode Standard.

Two strings normalized to the same normalization form can be safely compared, because the composition of each character is guaranteed to be the same by the algorithm. Which form you choose depends on your needs. If you wanted simply to compare two strings, you could use form NFD and decompose all the characters to their canonical decomposed forms, with no need to put them back into precomposed forms. If you were searching for a character you knew was in a canonical precomposed form, such as é U+00E9, you might choose form NFC to also convert the characters into their canonical precomposed form for easy matching of the single code point.

Java provides the class java.text.Normalizer which implements the normalization algorithm in UAX #15, representing the various forms using the java.text.Normalizer.Form enum. The following example shows how to search for the canonical precomposed form of é U+00E9 in a string that initially used the decomposed form e U+0065 followed by the combining acute accent ́ U+0301, by first normalizing the string using Normalization Form C (NFC).

Normalizing a string before searching for a character.
String text = "touch\u0065\u0301";  //"touché" in decomposed form
int resultIndex = text.indexOf('é'); //returns -1; character not found
text = Normalizer.normalize(text, Normalizer.Form.NFC);
int resultIndex = text.indexOf('é'); //returns correct index 5


The Java Comparator.compare(T o1, T o2) method, the sorting strategy you've used in several lessons, allows any two objects to be compared. The trick is deciding how a certain type of object should be ordered. Sorting numbers could hardly be simpler: simply compare the values arithmetically. It would be a mistake to think that sorting Unicode code points were that simple; human languages have evolved haphazardly for thousands of years, and the result is that sorting text or collation is replete with complications, confusions, and contradictions. Sor

Many a developer has naively tried to sort strings by comparing the Unicode code point value of individual characters, as in the following example:

Incorrectly collating character sequences by comparing code points.
public static final Comparator<CharSequence> textComparator = (text1, text2) -> {
  final int length = Math.min(text1.length(), text2.length());
  int result = 0;
  int index = 0;
  while(result == 0 && index < length) {
    //WRONG! Do NOT do this.
    result = Integer.compare(text1.charAt(index), text2.charAt(index));
  if(result == 0) { //handle one string being a prefix of the other
    result = Integer.compare(text1.length(), text2.length());
  return result;

For comparing simple words in English using all lowercase letters such as cat and car, this algorithm works! But in real life this approach quickly runs into trouble. You may want to refer to the charts in the lesson on charsets.

From the ASCII table, the uppercase letters A U+0041Z U+00FA appear separately and before the lowercase letters a U+0061z U+007A. This means that Zanzibar with a capital letter would be sorted before apple!
From the ISO-8859-1 table, the accented letter é U+00E9 appears after all the unaccented letters, both upper and lowercase, so that touching would be sorted before touché!
The accented letter é U+00E9 may be stored in decomposed form as e U+0065 followed by the accent mark  ́ U+0301. The sorting would change based on whether the precomposed form was used. The combining mark(s) would also cause, so that touched would appear before touché stored in decomposed form.

You might have guessed that getting around the composition problem might involve some form of normalization. Thinking further, you might realize that accents could be ignored if the sequence were first decomposed, perhaps using Normalization Form D (NFD), and then removing the decomposed accent characters before sorting. To ignore differences in case, you could have some way to map the uppercase characters to lowercase characters before sorting.

In fact Unicode provides just such a mapping! Looking at the UCD table at the beginning of this lesson, you'll see for example that the character E U+0045 has a simple lowercase mapping of U+0065, the code point for e. In addition to mapping case, however, you have to decide if the original case should be considered, as in some contexts the user would still expect an uppercase version of a letter to go before (or after) the lowercase version.

Rather than doing all this work manually, however, you should use the collation tools Java puts at your disposal in the form of a collator.

Unicode comparison levels. (UTS #10 § 1.1)
Level Description Example
L1 Base characters role < roles < rule
L2 Accents role < rôle < roles
L3 Case / Variants role < Role < rôle
L4 Punctuation role < “role” < Role
Ln Identical role < ro□le < “role”


Unicode provides UTS #10: Unicode Collation Algorithm to specify the steps to take when sorting sequences of characters. Because ordering expectations may differ based upon whether words are appearing in a dictionary or in a user list, for example, UTS #10 prescribes a multilevel comparison algorithm. The table lists these Comparison Levels to chose from when sorting text according to Unicode rules.

The the Unicode collation algorithm is difficult; UTS #10 is dense and complicated. To simplify things somewhat Java provides the java.text.Collator class. A Collator is a comparator that knows how to normalize strings and then compare them taking into account accents and case as necessary and appropriate for different locales. You can get a collator for the current locale using Collator.getInstance(), or for a particular locale using Collator.getInstance(Locale desiredLocale). The Collator.compare(String source, String target) method is used as you would for a Comparator<String>.

Collator Strength

Similar to the comparison levels of UTS #10, Java's provides provides three collator strengths., which can be set using Collator.setStrength(int newStrength). These correspond roughly to the Unicode L1, L2, and L3 comparison levels.

Considers only the base character, and ignores accents and case.
Considers the base character with any accents; case is ignored.
Considers the base character, accents, and other variations such as case. This is the default strength.
The characters will only be considered equal if they are the exact same character.

PRIMARY and SECONDARY strengths are both case-insensitive, but only PRIMARY is insensitive to accents. The most permissive is therefore PRIMARY, while the most strict is IDENTICAL. As example consider the word cafe, which sometimes appears as café to reflect the French spelling. Using PRIMARY collator strength, the words cafe, CAFE, and café would all be considered the same word. Using SECONDARY collator strength, cafe and café would be considered different words, but cafe and CAFE would be considered the same. Finally TERTIARY collator strength would consider all three forms as distinct, receiving different orderings. (The IDENTICAL strength would consider the three forms distinct as well, and would take into account any other differences that might appear in other characters beyond characters, accents, and case.)

Collator Decomposition

As you learned in the above sections, comparison of character sequences will not be valid unless the code points have first been normalized. For collation this best done by first decomposing characters into their base characters, accents, and other marks—Normalization Form D (NFD) or Normalization Form KD (NFKD), above. By default a Collator will perform no decomposition! If you wish normalization before comparison, you must set the decomposition level using Collator.setDecomposition(int decompositionMode). The Collator.CANONICAL_DECOMPOSITION level corresponds to Normalization Form D (NFD) and is should be your first choice for the collator decomposition setting.

Collator Comparison

To illustrate the use of a collator, suppose you want to sort a list of strings without regard to case or accents. You want to normalize the strings in case some of the strings use different composition forms. You would create and configure a collator as in the following example.

Internationalization-aware case-insensitive sorting using a Collator.
final List<String> strings = new ArrayList<>(Arrays.asList("TOUCHING", "touch", "touché"));
final Collator collator = Collator.getInstance();
System.out.println(strings); //prints "[touch, touché, TOUCHING]"




In the Real World

Think About It

Self Evaluation


Your Booker application compares locale-sensitive information in various places, such as when sorting publications by title and looking them up by title. You now know that the approach used so far using String comparison methods is incorrect and can yield erroneous results, such as sorting in an incorrect order or failing to find a publication by its title.

Convert your publication title sorting and lookup logic to correctly take internationalization issue into account. Sorting should be performed without regard to case or diacritics. Similarly lookup based on title should work without regard to whether the user's search string contains capital letters or accents. Create unit tests showing that the new sorting and comparison works, using character sequences for testing that would fail had the traditional String comparison methods been used.

See Also