Imagine you have a web site in some language (!=en) where you have some special characters (letters + some modifiers). Add some SEO requirements (links ending in .html, URL should contain the article title) and you get a big headache. Or you want to provide search capabilities that would match the text no matter if the user inputs the correct spelling (special characters) or the simplified (ASCII-like) words.

Fortunately Java provides an easy way to extract the base characters from any UTF8 character. Btw, if you didn’t already know, Java uses UTF8 as internal representation for char and String.

All you have to do is to “normalize” the text, split it in the base character + modifiers and take only the part you want, the base character. You can keep this string in the database along with the original version, using one or the other depending on the context.

Below is a sample routine that does the conversion for you:

    public static final String toBaseCharacters(final String sText) {
        if (sText==null || sText.length()==0)
            return sText;

        final char[] chars = sText.toCharArray();

        final int iSize = chars.length;

        final StringBuilder sb = new StringBuilder(iSize);

        for (int i = 0; i < iSize; i++) {
            String sLetter = new String(new char[] { chars[i] });

            sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFD);

            try {
                byte[] bLetter = sLetter.getBytes("UTF-8");

                sb.append((char) bLetter[0]);
            } catch (UnsupportedEncodingException e) {
                // the encoding is surely valid

        return sb.toString();

For more details you can check out:
Text Normalization (Core Java Technologies Tech Tips, Feb 2007)