Removing Vietnamese diacritic in Java

This post uses many Unicode concepts. If you are not familiar with Unicode, please read Java in the Unicode of Madness before continuing.

  • Diacritic: a mark that is used to create a new character from an original character with a different pronunciation.

  • Accent: is a type of diacritic that changes the tone of a character.

Vietnamese

The Vietnamese language has 29 characters and 12 vowels.

  • Roman character: a b c d e g h i k l m n o p q r s t u v x y

  • Character with diacritic: ă â đ ê ô ơ ư

  • Accent: ◌́ (acute), ◌̀ (grave), ◌̉, (hook), ◌̃ (tilde), ◌̣ (dot)

The acute and grave accent

Unlike 3 other accents, acute and grave has 2 Unicode Code points.

Accent markCode pointName
◌̀U+0300Combining Grave Accent
◌̀U+0340Combining Grave Tone Mark
Accent markCode pointName
◌́U+0301Combining Acute Accent
◌́U+0341Combining Acute Tone Mark

The composed and decomposed form

Vietnamese characters in strings can be in two forms: composed or decomposed. Composed forms are single Code points, while decomposed forms are combinations of an alphabet with diacritics and accents. For example:

CharacterComposedDecomposed
áá or áa◌́ or a◌́
a◌̆◌̉
đđđ

In the Vietnamese language, characters and their composed forms are almost identical. However, the decomposed form has two special cases. First, the acute and grave accents have two code points each. Second, the character đ cannot be decomposed into d with a stroke diacritic mark.

Regular Expression

Regular expressions support Unicode through the \p syntax. To match diacritics and accents, Regex uses the \p{M} or \p{Mark} property class.

In Java, the Pattern and Matcher classes are used to work with regular expressions. The String class also supports Regular Expressions through the replaceAll() method, but this simply uses the Pattern and Matcher classes internally. One thing to note is that the Matcher class is thread-safe, but it has poor performance due to synchronization.

Removing Vietnamese diacritic

This is the challenge of removing diacritics from Vietnamese characters and returning alphabet characters.

Normalization

The most general solution is to decompose Vietnamese characters into their constituent alphabet and diacritic characters. Then use Regular Expression to remove the diacritic character.

String text = "Tiếng Việt có đấu";
// T i e ◌̂ ◌́ ng V i e ◌̣ ◌̂ t c o ◌́ đ a ◌̂ ◌́ u
String decompositedForm = Normalizer.normalize(text, Normalizer.Form.NFD);
decompositedForm.replaceAll("\\p{M}", ""); // Tieng Viet co đau

This solution has several advantages. First, it works for Vietnamese and many other languages. Second, it can also handle both composed and decomposed forms. Finally, Java already has the Normalizer class to decompose characters.

However, this solution is not high-performance because Strings must go through the Normalization algorithm and Regular Expression. Additionally, this approach will not work with the đ character, so it requires an additional replacement step to convert it to d.

decompositedForm.replaceAll("\\p{M}", "").replace('đ', 'd'); // Tieng Viet co dau

Replacing

This solution is more familiar than Normalization and less complex, as it only deals with Vietnamese. The idea is simple: use the replaceAll() method to apply a rule to groups of Vietnamese characters that have the same alphabet after removing diacritics.

String text = "Tiếng Việt có đấu";

text = text.replaceAll("[AÁÀÃẠÂẤẦẪẬĂẮẰẴẶ]", "A")
            .replaceAll("[àáạảãâầấậẩẫăằắặẳẵ]", "a");
            .replaceAll("[EÉÈẼẸÊẾỀỄỆ]", "E");
            .replaceAll("[èéẹẻẽêềếệểễ]", "e");
            .replaceAll("[IÍÌĨỊ]", "I");
            .replaceAll("[ìíịỉĩ]", "i");
            .replaceAll("[OÓÒÕỌÔỐỒỖỘƠỚỜỠỢ]", "O");
            .replaceAll("[òóọỏõôồốộổỗơờớợởỡ]", "o");
            .replaceAll("[UÚÙŨỤƯỨỪỮỰ]", "U");
            .replaceAll("[ùúụủũưừứựửữ]", "u");
            .replaceAll("[YÝỲỸỴ]", "Y");
            .replaceAll("[ỳýỵỷỹ]", "y");
            .replaceAll("Đ", "D");
            .replaceAll("đ", "d");
            .replaceAll("\u0300|\u0301|\u0303|\u0309|\u0323|\u0340|\u0341|\u02C6|\u0306|\u031B", "");

Although this solution may seem straightforward, it is not efficient because it wastes many replaceAll() method calls.

Mapping

This is the most verbose approach, by mapping one Vietnamese character and its alphabet character.

String text = "Tiếng Việt có đấu";

Map<Character, Character> map = new HashMap<>() {{
put('á', 'a');
put('à', 'a');
put('ả', 'a');
put('ã', 'a');
put('ạ', 'a');
put('ắ', 'a');
put('ằ', 'a');
put('ẳ', 'a');
put('ẵ', 'a');
put('ặ', 'a');
put('ấ', 'a');
put('ầ', 'a');
put('ẩ', 'a');
put('ẫ', 'a');
put('ậ', 'a');
put('é', 'e');
put('è', 'e');
put('ẻ', 'e');
put('ẽ', 'e');
put('ẹ', 'e');
put('ế', 'e');
put('ề', 'e');
put('ể', 'e');
put('ễ', 'e');
put('ệ', 'e');
put('í', 'i');
put('ì', 'i');
put('ỉ', 'i');
put('ĩ', 'i');
put('ị', 'i');
put('ó', 'o');
put('ò', 'o');
put('ỏ', 'o');
put('õ', 'o');
put('ọ', 'o');
put('ố', 'o');
put('ồ', 'o');
put('ổ', 'o');
put('ỗ', 'o');
put('ộ', 'o');
put('ớ', 'o');
put('ờ', 'o');
put('ở', 'o');
put('ỡ', 'o');
put('ợ', 'o');
put('ú', 'u');
put('ù', 'u');
put('ủ', 'u');
put('ũ', 'u');
put('ụ', 'u');
put('ứ', 'u');
put('ừ', 'u');
put('ử', 'u');
put('ữ', 'u');
put('ự', 'u');
put('ý', 'y');
put('ỳ', 'y');
put('ỷ', 'y');
put('ỹ', 'y');
put('ỵ', 'y');
put('đ', 'd');
put('Á', 'A');
put('À', 'A');
put('Ả', 'A');
put('Ã', 'A');
put('Ạ', 'A');
put('Ắ', 'A');
put('Ằ', 'A');
put('Ẳ', 'A');
put('Ẵ', 'A');
put('Ặ', 'A');
put('Ấ', 'A');
put('Ầ', 'A');
put('Ẩ', 'A');
put('Ẫ', 'A');
put('Ậ', 'A');
put('É', 'E');
put('È', 'E');
put('Ẻ', 'E');
put('Ẽ', 'E');
put('Ẹ', 'E');
put('Ế', 'E');
put('Ề', 'E');
put('Ể', 'E');
put('Ễ', 'E');
put('Ệ', 'E');
put('Í', 'I');
put('Ì', 'I');
put('Ỉ', 'I');
put('Ĩ', 'I');
put('Ị', 'I');
put('Ó', 'O');
put('Ò', 'O');
put('Ỏ', 'O');
put('Õ', 'O');
put('Ọ', 'O');
put('Ố', 'O');
put('Ồ', 'O');
put('Ổ', 'O');
put('Ỗ', 'O');
put('Ộ', 'O');
put('Ớ', 'O');
put('Ờ', 'O');
put('Ở', 'O');
put('Ỡ', 'O');
put('Ợ', 'O');
put('Ú', 'U');
put('Ù', 'U');
put('Ủ', 'U');
put('Ũ', 'U');
put('Ụ', 'U');
put('Ứ', 'U');
put('Ừ', 'U');
put('Ử', 'U');
put('Ữ', 'U');
put('Ự', 'U');
put('Ý', 'Y');
put('Ỳ', 'Y');
put('Ỷ', 'Y');
put('Ỹ', 'Y');
put('Ỵ', 'Y');
put('Đ', 'd');
}};

StringBuilder sb = new StringBuilder(text);
for (int i = 0; i < sb.length(); i++) {
    char c = sb.charAt(i);
    if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == ' ') {
        continue;
    }

    Character alphabet = map.get(c);
    if (alphabet == null) {
        sb.deleteCharAt(i);
    } else {
        sb.setCharAt(i, alphabet);
    }
}

sb.toString();

Although it is verbose, it is highly efficient compared to the other two solutions. It uses a map to store and access characters, and it does not use the replaceAll() method.

Benchmark

  • Tool: JMH 1.36

  • OS: Ubuntu 22.04

  • CPU: I5-1135G7

BenchmarkScore
Mapping27966412.833 ± 145145.204 ops/s
Normalization1486144.004 ± 17256.300 ops/s
Replacing153442.074 ± 1862.282 ops/s

Reference

[1] Java in the Unicode of madness