Java in the Unicode of madness
Table of contents
In this post, I will cover common Java knowledge and approaches for dealing with Unicode, using Vietnamese as examples.
Charset
A Charset is a set of characters, and each charset needs a character encoding to convert those characters to numbers that computers can store. The most common Charsets are ASCII, ISO-8859-1, and Unicode.
Code point: a unique number that is assigned to each character in the Charset.
Code unit: the smallest unit of data that can be decoded to a code point. For example, UTF-8 uses 8-bit code units.
Code points and code units are often represented in hexadecimal (hex) for better readability.
ASCII
ASCII Charset is one of the most widely used Charset. It has 128 characters in total, including letters, numbers, and other common symbols. ASCII Charset uses 7-bit encoding.
- Code points: 0x00 - 0x7F
Even though ASCII is widely used, it is not suitable for websites that support customers from multiple countries, as their languages may not contain all of the characters in the ASCII charset.
ISO-8859
To make ASCII more versatile, there is a series of 16 ISO-8859 charsets that extend ASCII to 8-bit encoding by adding more characters. There is more than one ISO-8859 Charset because the new 128 slots were not enough for all of the characters in other languages. So, neighboring regions that share common characters may use the same ISO-8859 Charset, while other regions use different ones. The most widely used ISO-8859 Charset is ISO-8859-1.
- Code points: 0x00 - 0xFF
This approach has several problems. First, it does not scale well, as many languages are different enough to require their charsets. Second, some languages, such as Chinese, Korean, and Japanese, have so many characters that 128 slots are not enough.
Unicode
Unicode is the current standard charset for websites, extending ISO-8859-1 for compatibility. Unicode includes not only a set of characters but also various properties for each character. At version 15.1, Unicode contains 1,114,112 Code points.
- Code points: 0x0000 - 0x10FFFF
Plane
A range of 65,536 (1000016) contiguous Unicode code points, where the first code point is an integer multiple of 65,536 (1000016). Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536. Thus Plane 0 is U+0000..U+FFFF, Plane 1 is U+10000..U+1FFFF, ..., and Plane 16 (1016) is U+100000..10FFFF.
[6] Glossary of Unicode Terms
The reason there are only 17 planes is that the design of UTF-16 cannot encode Code points greater than 0x10FFFF
.
BMP
The Basic Multilingual Plane (BMP), the first plane of Unicode (Plane 0), can contain almost all modern language characters and symbols. However, to support UTF-16, the Code points from U+D800 to U+DFFF are reserved.
Unicode Encoding
Unlike ASCII, Unicode can be encoded in multiple ways. Depending on the situation, one Unicode encoding may be better than another. The most common Unicode encodings are UTF-8, UTF-16, and UTF-32.
BOM
Encodings like UTF-32 and UTF-16 can differ in big-endian and little-endian systems when encoding Code points to binary. Therefore, to decode strings in these encodings, it is necessary to assign the first character as a byte order mark (BOM).
BOM | Description |
0xFEFF | Big-endian |
0xFFFE | Little-endian |
To distinguish between big-endian and little-endian encoding versions, Unicode defines the postfixes -BE
and -LE
, respectively.
UTF-32
UTF-32 is the most straightforward form of Unicode. It is a fixed-length encoding that uses 32 bits to encode each Unicode Code point as its binary. Leading zeros are used to pad shorter characters to 32 bits.
- Code unit: 32 bits
Code point | UTF-32 |
U+0000 | 00000000000000000000000000000000 |
U+10FFFF | 000000000000000100001111111111111111 |
The good side of this approach is that it can identify the offset of each Unicode character in a String, but the clear disadvantage is that it uses at least 11 unused bits per Unicode character.
UTF-8
UTF-8 is the most common Unicode encoding standard used on the web. Its popularity comes from two reasons: first, UTF-8 is compatible with ASCII encoding, meaning that all ASCII binary can decode in UTF-8. Second, UTF-8 is highly space-efficient, as all character in ASCII still uses 1 byte, some other common Latin characters use 2 bytes, and other characters use more than 2 bytes.
However, unlike UTF-32, encoding UTF-8 is not straightforward. UTF-8 separates each Unicode character into a sequence of bytes, and some leading bits are used to identify which byte sequence a character belongs to.
- Code unit: 8 bits
Code point | UTF-8 |
U+0000 - U+007F | 0xxxxxxx |
U+0080 - U+07FF | 110xxxxx 10xxxxxx |
U+0800 - U+D7FF, U+E000 - U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
U+10000 - U+10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
In all cases, UTF-8 is more space-efficient than UTF-32, especially for English and Latin languages. However, not all languages benefit from this; for example, East Asian languages typically require 3 bytes per character in UTF-8.
UTF-16
UTF-16 is a variable-length encoding that uses 2 bytes for U+0000 to U+FFFF, and 4 bytes for U+10000 to U+10FFFF.
- Code unit: 16 bits
Code point | UTF-16 |
U+0000 - U+D7FF, U+E000 - U+FFFF | xxxxxxxx xxxxxxxx |
UTF-16 is not compatible with ASCII encoding and requires two bytes to store ASCII characters. However, UTF-16 may be better than UTF-8 for storing East Asian languages because it only needs two bytes, while UTF-8 needs three bytes. (Fact: Most East Asian websites still use UTF-8 instead of UTF-16 because HTML tags, CSS, and JavaScript are ASCII.)
Surrogate pairs
To encode Unicode characters from U+10000 to U+10FFFF, UTF-16 uses 4 bytes called surrogate pairs. The first 2 bytes are a high surrogate from 0xD800
to 0xDBFF
and the last 2 bytes are a low surrogate from 0xDC00
to 0xDFFF
.
0xDC00 | ... | 0xDFFF | |
0xD800 | U+10000 | ... | U+103FF |
... | ... | ... | ... |
0xDBFF | U+10FC00 | ... | U+10FFFF |
Java
Java 8 and below encode String using UTF-16, from Java 9 (JEP 254), strings that only contain characters in the ISO-8859-1 Charset are encoded as ISO-8859-1. Otherwise, they are encoded as UTF-16.
Char type
In Java, the char
type does not store a character, but a Code unit, because its size is constant. Therefore, the char
type is 16 bits in size, as Java uses UTF-16.
String class
Length method
This method returns the number of Code units in the string.
String text = "Tiếng Việt có đấu";
text.length(); // 17
String textFourBytes = "Tiếng Việt có đấu 😀"; // 😀 is 2 Code unit in UTF-16
textFourBytes .length(); // 20
CharAt method
This method returns the char
type of the Code unit at the index.
String text = "Tiếng Việt có đấu 😀"; // 😀 is 0xD83D 0xDE00 in UTF-16
text.charAt(16); // \u0075
text.charAt(17); // \u0020
text.charAt(18); // \uD83D
text.charAt(19); // \uDE00
CodePointAt method
This method returns the Code point of String at the index, if the current index and next index are surrogate pairs, this method returns the Code point from 0x10000 to 0x10FFFF.
String text = "Tiếng Việt có đấu 😀"; // 😀 is U+1F600 in Unicode
text.codePointAt(16); // 0x0075
text.codePointAt(17); // 0x0020
text.codePointAt(18); // 0x1F600
text.codePointAt(19); // 0xDE00
CodePointCount method
This method returns the number of Code points between the two input indexes.
String text = "Tiếng Việt có đấu";
text.codePointCount(0, text.length()); // 17
String text1 = "Tiếng Việt có đấu 😀";
text.codePointCount(0, text1.length()) // 19
GetBytes method
Encoding a string with a specific Character encoding (the StandardCharset class does not do anything with the Charset, it stores the Character encoding type). If the String contains a character that does not appear in the Character encoding scheme, the encoding will return 0x3F
for that character.
In UTF-16, because there is BOM, this method assigns the first 2 bytes to the byte order identifier.
String text = "ℳê 😀";
text.getBytes(StandardCharsets.US_ASCII); // 0x3F 0x3F 0x20 0x3F
text.getBytes(StandardCharsets.ISO_8859_1); // 0x3F 0xEA 0x20 0x3F
text.getBytes(StandardCharsets.UTF_8); // 0xE2 0x84 0xB3 0xC3 0xAA 0x20 0xF0 0x9F 0x98 0x80
text.getBytes(StandardCharsets.UTF_16); // 0xFE 0xFF 0x21 0x33 0x00 0xEA 0x00 0x20 0xD8 0x3D 0xDE 0x00
Character class
CharCount method
Return the number of Code units to determine the Code point.
int codePoint = 128512; // Decimal of 😀
Character.charCount(128512); // 2
Normalization
Canonical
Canonical equivalence is a fundamental equivalency between characters or sequences of characters which represent the same abstract character, and which when correctly displayed should always have the same visual appearance and behavior
[2] Unicode normalization forms
Composition | Decomposition |
ế | e + ◌̂ + ◌́ |
ℊ | ℊ |
đ | đ |
Canonical equivalence works in almost all cases in the Vietnamese language, except for the character đ
, because this character is not relevant to the character d
in some languages. This form also does not work with variant characters such as ℊ
.
Compatibility
Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.
[2] Unicode normalization forms
Composition | Decomposition |
ế | e + ◌̂ + ◌́ |
ℊ | g |
đ | đ |
Compatibility equivalence works like Canonical equivalence, but it supports variant characters. However, this equivalence does not work with the character đ
for the same reason as above.
Forms
Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other
[2] Unicode Normalization Forms
Form | Description |
NFD | Canonical Decomposition |
NFC | Canonical Decomposition, followed by Canonical Composition |
NFKD | Compatibility Decomposition |
NFKC | Compatibility Decomposition, followed by Canonical Composition |
Source | NFD | NFC | NFKD | NFKC |
Tiếnℊ | Tie◌̂◌́nℊ | Tiếnℊ | Tie◌̂◌́ng | Tiếng |
Java
Java supports all normalization forms through the Normalizer
class.
Normalizer class
Normalize method
This method returns a string after applying normalization with the specified form.
String text = "Tiếnℊ";
Normalizer.normalize(text, Normalizer.Form.NFD); // Tie◌̂◌́nℊ
Normalizer.normalize(text, Normalizer.Form.NFC); // Tiếnℊ
Normalizer.normalize(text, Normalizer.Form.NFKD); // Tie◌̂◌́ng
Normalizer.normalize(text, Normalizer.Form.NFKC); // Tiếng
Folding
A folding function or folding operation removes a distinction between related characters by mapping them to the same target. For example, a case folding may remove the case distinction, by replacing upper and title case variants of a character with the lower case. In other words, foldings define equivalence classes, and chose a representative or target member for each equivalence class. Applying a folding maps all members of the equivalence class to the target.
[3] Character Foldings
Folding is similar to Normalization in that they both convert characters to other forms, but Folding may cause characters to lose information after conversion. Folding is still a draft standard in Unicode.
Case Folding
Case Folding is an operation that converts a character to its upper or lower case. Because of its usefulness, multiple languages support Case Folding, even though Unicode has not standardized it.
Java supports Case Folding through the toUpperCase()
and toLowerCase()
methods. These methods work by storing the offset of a character and its upper and lowercase variants.
Lower case | Upper case | Offset |
a-z | A-Z | 0x20 |
ế ệ đ ấ | Ế Ệ Đ Ấ | 0x01 |
String text = "Tiếnℊ Việt có đấu 😀";
text.toUpperCase(); // TIẾNℊ VIỆT CÓ ĐẤU 😀
text.toLowerCase(); // tiếnℊ việt có đấu 😀
Diacritic Folding
Diacritic Folding is the operation of removing diacritics from characters. There is no clear method in Java that performs diacritic folding, so the most common way to do it is to convert the String to NFD form and then use regular expressions to remove diacritic characters.
String text = "Tiếnℊ Việt có đấu 😀";
Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\\p{M}", ""); // Tienℊ Viet co đau 😀
As mentioned above, the NFD form does not normalize the character đ
, so working with Vietnamese text requires an additional replace()
step.
String text = "Tiếnℊ Việt có đấu 😀";
Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\\p{M}", "").replace('đ', 'd'); // Tienℊ Viet co dau 😀
Collation
Collation is the general term for the process and function of determining the sorting order of strings of characters.
[5] Unicode Collation Algorithm
Collation varies according to language. In the case of Vietnamese, Unicode defines the standard as follows:
a A à À ả Ả ã Ã á Á ạ Ạ ă Ă ằ Ằ ẳ Ẳ ẵ Ẵ ắ Ắ ặ Ặ â Â ầ Ầ ẩ Ẩ ẫ Ẫ ấ Ấ ậ Ậ b B c C d D đ Đ e E è È ẻ Ẻ ẽ Ẽ é É ẹ Ẹ ê Ê ề Ề ể Ể ễ Ễ ế Ế ệ Ệ g G h H i I ì Ì ỉ Ỉ ĩ Ĩ í Í ị Ị k K l L m M n N o O ò Ò ỏ Ỏ õ Õ ó Ó ọ Ọ ô Ô ồ Ồ ổ Ổ ỗ Ỗ ố Ố ộ Ộ ơ Ơ ờ Ờ ở Ở ỡ Ỡ ớ Ớ ợ Ợ p P q Q r R s S t T u U ù Ù ủ Ủ ũ Ũ ú Ú ụ Ụ ư Ư ừ Ừ ử Ử ữ Ữ ứ Ứ ự Ự v V x X y Y ỳ Ỳ ỷ Ỷ ỹ Ỹ ý Ý ỵ Ỵ
Java
String class
CompareTo method
This method returns the difference between the first two characters that are different, or the difference between the lengths of the two Strings if they are identical. This method does not follow the Unicode Collation standard.
"ả".compareTo("ã"); // 'ả' - 'ã' = -127
Collator class
Compare method
This method compares Strings based on the Unicode Collation Algorithm.
Collator.getInstance(new Locale("vi")).compare("ả", "ã"); // -1
Collator.getInstance(Locale.US).compare("ả", "ã"); // 1