This post uses many Unicode concepts. If you are not familiar with Unicode, please read Java in the Unicode of Madness before continuing.

Diacritic: a mark that is used to create a new character from an original character with a different pronunciation.
Accent: is a type of diacritic that changes the tone of a character.

Vietnamese

The Vietnamese language has 29 characters and 12 vowels.

Roman character: a b c d e g h i k l m n o p q r s t u v x y
Character with diacritic: ă â đ ê ô ơ ư
Accent: ◌́ (acute), ◌̀ (grave), ◌̉, (hook), ◌̃ (tilde), ◌̣ (dot)

The acute and grave accent

Unlike 3 other accents, acute and grave has 2 Unicode Code points.

Accent mark	Code point	Name
◌̀	U+0300	Combining Grave Accent
◌̀	U+0340	Combining Grave Tone Mark

Accent mark	Code point	Name
◌́	U+0301	Combining Acute Accent
◌́	U+0341	Combining Acute Tone Mark

The composed and decomposed form

Vietnamese characters in strings can be in two forms: composed or decomposed. Composed forms are single Code points, while decomposed forms are combinations of an alphabet with diacritics and accents. For example:

Character	Composed	Decomposed
á	á or á	a◌́ or a◌́
ẳ	ẳ	a◌̆◌̉
đ	đ	đ

In the Vietnamese language, characters and their composed forms are almost identical. However, the decomposed form has two special cases. First, the acute and grave accents have two code points each. Second, the character đ cannot be decomposed into d with a stroke diacritic mark.

Regular Expression

Regular expressions support Unicode through the \p syntax. To match diacritics and accents, Regex uses the \p{M} or \p{Mark} property class.

In Java, the Pattern and Matcher classes are used to work with regular expressions. The String class also supports Regular Expressions through the replaceAll() method, but this simply uses the Pattern and Matcher classes internally. One thing to note is that the Matcher class is thread-safe, but it has poor performance due to synchronization.

Removing Vietnamese diacritic

This is the challenge of removing diacritics from Vietnamese characters and returning alphabet characters.

Normalization

The most general solution is to decompose Vietnamese characters into their constituent alphabet and diacritic characters. Then use Regular Expression to remove the diacritic character.

String text = "Tiếng Việt có đấu";
// T i e ◌̂ ◌́ ng V i e ◌̣ ◌̂ t c o ◌́ đ a ◌̂ ◌́ u
String decompositedForm = Normalizer.normalize(text, Normalizer.Form.NFD);
decompositedForm.replaceAll("\\p{M}", ""); // Tieng Viet co đau

This solution has several advantages. First, it works for Vietnamese and many other languages. Second, it can also handle both composed and decomposed forms. Finally, Java already has the Normalizer class to decompose characters.

However, this solution is not high-performance because Strings must go through the Normalization algorithm and Regular Expression. Additionally, this approach will not work with the đ character, so it requires an additional replacement step to convert it to d.

decompositedForm.replaceAll("\\p{M}", "").replace('đ', 'd'); // Tieng Viet co dau

Replacing

This solution is more familiar than Normalization and less complex, as it only deals with Vietnamese. The idea is simple: use the replaceAll() method to apply a rule to groups of Vietnamese characters that have the same alphabet after removing diacritics.

String text = "Tiếng Việt có đấu";

text = text.replaceAll("[AÁÀÃẠÂẤẦẪẬĂẮẰẴẶ]", "A")
            .replaceAll("[àáạảãâầấậẩẫăằắặẳẵ]", "a");
            .replaceAll("[EÉÈẼẸÊẾỀỄỆ]", "E");
            .replaceAll("[èéẹẻẽêềếệểễ]", "e");
            .replaceAll("[IÍÌĨỊ]", "I");
            .replaceAll("[ìíịỉĩ]", "i");
            .replaceAll("[OÓÒÕỌÔỐỒỖỘƠỚỜỠỢ]", "O");
            .replaceAll("[òóọỏõôồốộổỗơờớợởỡ]", "o");
            .replaceAll("[UÚÙŨỤƯỨỪỮỰ]", "U");
            .replaceAll("[ùúụủũưừứựửữ]", "u");
            .replaceAll("[YÝỲỸỴ]", "Y");
            .replaceAll("[ỳýỵỷỹ]", "y");
            .replaceAll("Đ", "D");
            .replaceAll("đ", "d");
            .replaceAll("\u0300|\u0301|\u0303|\u0309|\u0323|\u0340|\u0341|\u02C6|\u0306|\u031B", "");

Although this solution may seem straightforward, it is not efficient because it wastes many replaceAll() method calls.

Mapping

This is the most verbose approach, by mapping one Vietnamese character and its alphabet character.

String text = "Tiếng Việt có đấu";

Map<Character, Character> map = new HashMap<>() {{
put('á', 'a');
put('à', 'a');
put('ả', 'a');
put('ã', 'a');
put('ạ', 'a');
put('ắ', 'a');
put('ằ', 'a');
put('ẳ', 'a');
put('ẵ', 'a');
put('ặ', 'a');
put('ấ', 'a');
put('ầ', 'a');
put('ẩ', 'a');
put('ẫ', 'a');
put('ậ', 'a');
put('é', 'e');
put('è', 'e');
put('ẻ', 'e');
put('ẽ', 'e');
put('ẹ', 'e');
put('ế', 'e');
put('ề', 'e');
put('ể', 'e');
put('ễ', 'e');
put('ệ', 'e');
put('í', 'i');
put('ì', 'i');
put('ỉ', 'i');
put('ĩ', 'i');
put('ị', 'i');
put('ó', 'o');
put('ò', 'o');
put('ỏ', 'o');
put('õ', 'o');
put('ọ', 'o');
put('ố', 'o');
put('ồ', 'o');
put('ổ', 'o');
put('ỗ', 'o');
put('ộ', 'o');
put('ớ', 'o');
put('ờ', 'o');
put('ở', 'o');
put('ỡ', 'o');
put('ợ', 'o');
put('ú', 'u');
put('ù', 'u');
put('ủ', 'u');
put('ũ', 'u');
put('ụ', 'u');
put('ứ', 'u');
put('ừ', 'u');
put('ử', 'u');
put('ữ', 'u');
put('ự', 'u');
put('ý', 'y');
put('ỳ', 'y');
put('ỷ', 'y');
put('ỹ', 'y');
put('ỵ', 'y');
put('đ', 'd');
put('Á', 'A');
put('À', 'A');
put('Ả', 'A');
put('Ã', 'A');
put('Ạ', 'A');
put('Ắ', 'A');
put('Ằ', 'A');
put('Ẳ', 'A');
put('Ẵ', 'A');
put('Ặ', 'A');
put('Ấ', 'A');
put('Ầ', 'A');
put('Ẩ', 'A');
put('Ẫ', 'A');
put('Ậ', 'A');
put('É', 'E');
put('È', 'E');
put('Ẻ', 'E');
put('Ẽ', 'E');
put('Ẹ', 'E');
put('Ế', 'E');
put('Ề', 'E');
put('Ể', 'E');
put('Ễ', 'E');
put('Ệ', 'E');
put('Í', 'I');
put('Ì', 'I');
put('Ỉ', 'I');
put('Ĩ', 'I');
put('Ị', 'I');
put('Ó', 'O');
put('Ò', 'O');
put('Ỏ', 'O');
put('Õ', 'O');
put('Ọ', 'O');
put('Ố', 'O');
put('Ồ', 'O');
put('Ổ', 'O');
put('Ỗ', 'O');
put('Ộ', 'O');
put('Ớ', 'O');
put('Ờ', 'O');
put('Ở', 'O');
put('Ỡ', 'O');
put('Ợ', 'O');
put('Ú', 'U');
put('Ù', 'U');
put('Ủ', 'U');
put('Ũ', 'U');
put('Ụ', 'U');
put('Ứ', 'U');
put('Ừ', 'U');
put('Ử', 'U');
put('Ữ', 'U');
put('Ự', 'U');
put('Ý', 'Y');
put('Ỳ', 'Y');
put('Ỷ', 'Y');
put('Ỹ', 'Y');
put('Ỵ', 'Y');
put('Đ', 'd');
}};

StringBuilder sb = new StringBuilder(text);
for (int i = 0; i < sb.length(); i++) {
    char c = sb.charAt(i);
    if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == ' ') {
        continue;
    }

    Character alphabet = map.get(c);
    if (alphabet == null) {
        sb.deleteCharAt(i);
    } else {
        sb.setCharAt(i, alphabet);
    }
}

sb.toString();

Although it is verbose, it is highly efficient compared to the other two solutions. It uses a map to store and access characters, and it does not use the replaceAll() method.

Benchmark

Tool: JMH 1.36
OS: Ubuntu 22.04
CPU: I5-1135G7

Benchmark	Score
Mapping	27966412.833 ± 145145.204 ops/s
Normalization	1486144.004 ± 17256.300 ops/s
Replacing	153442.074 ± 1862.282 ops/s

Reference

[1] Java in the Unicode of madness

Removing Vietnamese diacritic in Java

Table of contents