Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases

Dealing with characters outside the ASCII range on the web is tough. It’s tough in other environments too, but particularly for web applications since text needs to move through so many places without being mangled — from user input, through JavaScript, into and out of PHP and string manipulation functions, into and out of databases. If you’re not careful, the text you start with isn’t what you’ll end up with after you’re done handling it. That was the case with W3Counter for a long time, but not any longer. I’ll tell you how.

Unicode is the preferred method of representing text outside the ASCII range, which includes text from virtually all non-English languages. Unicode maps characters to integers, and includes a large range of characters, many more than Windows-1252 or ISO-8859-1, the most common character sets used for English documents. Luckily there’s another character set, UTF-8, which can represent Unicode and has wide operating system and browser support.

Handling UTF-8 in HTML

The first step in capturing and displaying non-English text is to deliver your webpages with the UTF-8 encoding. This tells the browser to interpret the text of the webpage as UTF-8 sequences, allowing it to display characters UTF-8 can encode that other character sets can’t. There are two places your page tells the browser what encoding to use — the Content-Type HTTP header, and the Content-Type meta tag.

On an Apache 1.3.12 or later server, you can set what content-type header will be sent by default with the AddCharset, AddType, or AddDefaultCharset directives. These can be set in a .htaccess file if you’re on shared hosting and don’t have access to the server’s main configuration file. You can also specify the character set in a meta tag:

<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />

If you’re using IIS, you can find the content-type setting for each file type under the “Headers” menu in the properties of your web site.

Handling UTF-8 in JavaScript

JavaScript internally works with all text in Unicode, so it’s going to handle UTF-8 encoded text properly without any extra care. However, in the context of web application development, JavaScript is often used to pass off data to server-side scripts. Whether it’s done through rendering HTML (such as constructing an iframe URL) or through AJAX calls, you may need to send text as a parameter in a URL’s query string.

You’ll often see escape() used to prepare the string for use in a URL; it escapes characters like the ampersand that would otherwise result in a malformed URL. However, escape() doesn’t handle characters outside the ASCII range correctly, so the receiving script won’t be able to interpret them. You simply can’t use escape() on Unicode text.

Luckily, all recent browsers support two new JavaScript functions, encodeURIComponent() and encodeURI(). These functions are safe for UTF-8 text, encoding them with the proper escape sequence, as well as everything escape() did to make sure the text is usable in a URL. The encodeURI() function encodes entire URIs — so it leaves characters such as :?& intact. encodeURIComponent() encodes strings to be individual parameters of a URI, so it encodes all characters except ~!*()’.

In short, if you’re using escape(), use encodeURIComponent() instead. If you’re worried about breaking compatibility with very old browsers, you can test for the existence of the function before using it:

if (encodeURIComponent) {
    string = encodeURIComponent(string);
} else {
    string = escape(string);
}

Handling UTF-8 in PHP

Internally, PHP uses ISO-8859-1/Latin-1 encoding. This character set is much smaller and incompatible with Unicode, which makes handling UTF-8 text difficult. Use of most string functions in PHP will result in the interpreter handling the text as Latin-1, and your output looking like garbled junk. PHP provides a multibyte string function library if your host has compiled it into their PHP build, although it’s sometimes difficult to use and doesn’t provide equivalents to all the string functions PHP normally provides.

PHP handles integers just fine, and Unicode is just a mapping of characters to integers. We can take advantage of that, using some handy functions Scott Reynen has written, to deal with the incoming UTF-8 text. He provides several functions that work well together, allowing you to convert strings to Unicode, convert Unicode to HTML entities for display on a webpage, and do simple string manipulation.

Storing UTF-8 in Non-UTF-8 Databases

The beauty of Scott’s unicode_to_entities_preserving_ascii() function is that it turns UTF8-encoded text into a string that is represented entirely with ASCII characters. All of the chracters outside the ASCII range are turned into their HTML escape sequences, like &#1575;. That means you don’t need your database tables to be set to the UTF-8 character set, which on shared hosting, you may not have the ability to do, and it’s not often the default.

This is useful even if your output format isn’t HTML. Now that you have a way to get the text into the database without losing non-English characters, you can convert it back after you get it out for use elsewhere in your app. PHP has a built-in function which will handle this part for you: html_entity_decode.

$original_string = html_entity_decode($string, ENT_NOQUOTES, ‘UTF-8’);

The caveat, of course, is that you can’t search or sort on those strings in the database properly. If you need those abilities, you need to ensure the database, the table, the columns, and the connection are all set to the UTF-8 character set, and that you don’t use any non-multibyte-safe functions on the strings before inserting or after retrieving them.

And there you have it: Handling non-English text in your web applications even in a shared hosting environment.