Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases

Dealing with characters outside the ASCII range on the web is tough. It’s tough in other environments too, but particularly for web applications since text needs to move through so many places without being mangled — from user input, through JavaScript, into and out of PHP and string manipulation functions, into and out of databases. If you’re not careful, the text you start with isn’t what you’ll end up with after you’re done handling it. That was the case with W3Counter for a long time, but not any longer. I’ll tell you how.

Unicode is the preferred method of representing text outside the ASCII range, which includes text from virtually all non-English languages. Unicode maps characters to integers, and includes a large range of characters, many more than Windows-1252 or ISO-8859-1, the most common character sets used for English documents. Luckily there’s another character set, UTF-8, which can represent Unicode and has wide operating system and browser support.

Handling UTF-8 in HTML

The first step in capturing and displaying non-English text is to deliver your webpages with the UTF-8 encoding. This tells the browser to interpret the text of the webpage as UTF-8 sequences, allowing it to display characters UTF-8 can encode that other character sets can’t. There are two places your page tells the browser what encoding to use — the Content-Type HTTP header, and the Content-Type meta tag.

On an Apache 1.3.12 or later server, you can set what content-type header will be sent by default with the AddCharset, AddType, or AddDefaultCharset directives. These can be set in a .htaccess file if you’re on shared hosting and don’t have access to the server’s main configuration file. You can also specify the character set in a meta tag:

<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />

If you’re using IIS, you can find the content-type setting for each file type under the “Headers” menu in the properties of your web site.

Handling UTF-8 in JavaScript

JavaScript internally works with all text in Unicode, so it’s going to handle UTF-8 encoded text properly without any extra care. However, in the context of web application development, JavaScript is often used to pass off data to server-side scripts. Whether it’s done through rendering HTML (such as constructing an iframe URL) or through AJAX calls, you may need to send text as a parameter in a URL’s query string.

You’ll often see escape() used to prepare the string for use in a URL; it escapes characters like the ampersand that would otherwise result in a malformed URL. However, escape() doesn’t handle characters outside the ASCII range correctly, so the receiving script won’t be able to interpret them. You simply can’t use escape() on Unicode text.

Luckily, all recent browsers support two new JavaScript functions, encodeURIComponent() and encodeURI(). These functions are safe for UTF-8 text, encoding them with the proper escape sequence, as well as everything escape() did to make sure the text is usable in a URL. The encodeURI() function encodes entire URIs — so it leaves characters such as :?& intact. encodeURIComponent() encodes strings to be individual parameters of a URI, so it encodes all characters except ~!*()’.

In short, if you’re using escape(), use encodeURIComponent() instead. If you’re worried about breaking compatibility with very old browsers, you can test for the existence of the function before using it:

if (encodeURIComponent) {
    string = encodeURIComponent(string);
} else {
    string = escape(string);
}

Handling UTF-8 in PHP

Internally, PHP uses ISO-8859-1/Latin-1 encoding. This character set is much smaller and incompatible with Unicode, which makes handling UTF-8 text difficult. Use of most string functions in PHP will result in the interpreter handling the text as Latin-1, and your output looking like garbled junk. PHP provides a multibyte string function library if your host has compiled it into their PHP build, although it’s sometimes difficult to use and doesn’t provide equivalents to all the string functions PHP normally provides.

PHP handles integers just fine, and Unicode is just a mapping of characters to integers. We can take advantage of that, using some handy functions Scott Reynen has written, to deal with the incoming UTF-8 text. He provides several functions that work well together, allowing you to convert strings to Unicode, convert Unicode to HTML entities for display on a webpage, and do simple string manipulation.

Storing UTF-8 in Non-UTF-8 Databases

The beauty of Scott’s unicode_to_entities_preserving_ascii() function is that it turns UTF8-encoded text into a string that is represented entirely with ASCII characters. All of the chracters outside the ASCII range are turned into their HTML escape sequences, like &#1575;. That means you don’t need your database tables to be set to the UTF-8 character set, which on shared hosting, you may not have the ability to do, and it’s not often the default.

This is useful even if your output format isn’t HTML. Now that you have a way to get the text into the database without losing non-English characters, you can convert it back after you get it out for use elsewhere in your app. PHP has a built-in function which will handle this part for you: html_entity_decode.

$original_string = html_entity_decode($string, ENT_NOQUOTES, ‘UTF-8′);

The caveat, of course, is that you can’t search or sort on those strings in the database properly. If you need those abilities, you need to ensure the database, the table, the columns, and the connection are all set to the UTF-8 character set, and that you don’t use any non-multibyte-safe functions on the strings before inserting or after retrieving them.

And there you have it: Handling non-English text in your web applications even in a shared hosting environment.

  • http://www.askapache.com/htaccess/using-http-headers-with-htaccess.html#language-and-content-header-in-htaccess Apache Administrator

    Dan, this is a fantastic post! I’ve always wanted to know about this stuff, and some stuff I didn’t even think about before! Thanks!

  • http://hamstu.com Hamish M

    Excellent post Dan.

    Working in Montreal, it’s a requirement for most corporate websites to be in both English and French, so I’ve had to deal with these issues in the past, and will certainly deal with them in the future. It can be be a slippery slope, particularly when it comes to PHP. Thanks for the tips.

  • Pingback: » Eksperimentas. Keletos funkcijų bandymai Archyvas » Pixel.lt()

  • Pingback: Unterschiede zwischen Javascript escape(), encodeURI(), und encodeURIComponent() | IT.CappuccinoNet.com Blog()

  • Pingback: Escaping utf-8 strings with javascript : Conside Solutions AB()

  • Olivier

    Hi,

    All my server side script assume the input is latin1 (iso-8859-1). So when sending text through javascript this gives errors. Can javascript convert from unicode to latin1 on the client side? Man, why are people still using anything else than utf-8?!

  • http://devcha.blogspot.com Slavi

    Hi thanks for your post.

    The trick with escaping non English characters is good but if you have to perform a search like this you have to supply an “escaped” string in the search.

    SELECT
    *
    FROM
    table1
    WHERE
    real_name like ‘%&ecute;cute%’

    this SQL is just an example.

    Slavi

  • TK

    > it’s going to handle UTF-8 encoded text properly without any extra care

    I think you are confusing UTF-8 with Unicode. In fact, you DO need extra care to handle UTF-8 encoded strings.
    For example, ‘あ’, when Unicode encoded, becomes ‘\u3042′, whereas ‘あ’ is ‘\xe3\x81\x82′ when UTF-8 encoded.
    The former works, the latter doesn’t when displayed.

  • http://dh.squidcode.com dH

    Thank you for the idea – mine was the same but it is good to see similar solutions. Usually storing everything in UTF8 makes the RDBMS (MySQL, for an example) MUCH slower. If you don’t need sorting just searching / display the values correctly, it’s better to encode as html/bas64/anything and store as latin-1 – just faster.

  • pascal

    Extremely useful and clear post, that summarizes utf8 in a great way!!

    I just had to add
    header(‘Content-Type: text/html; charset=utf-8′);

    cos I dont want to edit the apache settings to add the correct charset

  • Maarten

    Note that the html_entity_decode($string, ENT_NOQUOTES, ‘UTF-8′); trick doesn’t work in php4 (somehow they fixed it in 5, and refuse to backport it seems), but this function might be a good workaround mb_decode_numericentity()

  • Pingback: Chasing Amy » Blog Archive » J.F. vs. UTF-8()

  • Mone

    great post.
    Just a little note, the first line of the javascript code must be
    if (window.encodeURIComponent) {
    otherwise browsers without the encodeURIComponent method will throw an exception instead of execute the escape method.

  • Pingback: Chasing Amy » J.F. vs. UTF-8()

  • Pingback: utf8 to ascii()

  • Mark

    Hi, If you use the encodeURIComponent in javascript to store a text in the database, what do you use to return teh same decoded text in a java program. I have use decodeURI but does not seem to work. What it returns is in ANSI format. I do know how to change from ANSI to UTF8 but no everything seems to be supported and sometimes part of the string are lost.

  • Kashif

    Really good post indeed, It helped me find and solve UTF-8 problem with my current project. Thanks Dan

  • Yardboy

    So very helpful – gracias!

  • Johnny

    Complete and better solution:

    function encode_utf8(s)
    {
    if (window.encodeURIComponent)//check fn present in old browser
    {
    return unescape(encodeURIComponent(s));
    }
    else
    {
    return escape(s);
    }
    }

    function decode_utf8(s)
    {
    if (window.decodeURIComponent)//check fn present in old browser
    {
    return decodeURIComponent(escape(s));
    }
    else
    {
    return unescape(s);
    }
    }

  • Pingback: Java Script UTF-8 编码潜在问题 – 解决方案()

  • Dgdfg

    tyutyu

  • Alex

    Thanks Dan, I was stuck wondering why things weren’t working until I read about you js escape notes.

  • Charles Hamel

    Great! thank you very much