Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases

Dealing with characters outside the ASCII range on the web is tough. It’s tough in other environments too, but particularly for web applications since text needs to move through so many places without being mangled — from user input, through JavaScript, into and out of PHP and string manipulation functions, into and out of databases. If you’re not careful, the text you start with isn’t what you’ll end up with after you’re done handling it. That was the case with W3Counter for a long time, but not any longer. I’ll tell you how.

Unicode is the preferred method of representing text outside the ASCII range, which includes text from virtually all non-English languages. Unicode maps characters to integers, and includes a large range of characters, many more than Windows-1252 or ISO-8859-1, the most common character sets used for English documents. Luckily there’s another character set, UTF-8, which can represent Unicode and has wide operating system and browser support.

Handling UTF-8 in HTML

The first step in capturing and displaying non-English text is to deliver your webpages with the UTF-8 encoding. This tells the browser to interpret the text of the webpage as UTF-8 sequences, allowing it to display characters UTF-8 can encode that other character sets can’t. There are two places your page tells the browser what encoding to use — the Content-Type HTTP header, and the Content-Type meta tag.

On an Apache 1.3.12 or later server, you can set what content-type header will be sent by default with the AddCharset, AddType, or AddDefaultCharset directives. These can be set in a .htaccess file if you’re on shared hosting and don’t have access to the server’s main configuration file. You can also specify the character set in a meta tag:

<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />

If you’re using IIS, you can find the content-type setting for each file type under the “Headers” menu in the properties of your web site.

Handling UTF-8 in JavaScript

JavaScript internally works with all text in Unicode, so it’s going to handle UTF-8 encoded text properly without any extra care. However, in the context of web application development, JavaScript is often used to pass off data to server-side scripts. Whether it’s done through rendering HTML (such as constructing an iframe URL) or through AJAX calls, you may need to send text as a parameter in a URL’s query string.

You’ll often see escape() used to prepare the string for use in a URL; it escapes characters like the ampersand that would otherwise result in a malformed URL. However, escape() doesn’t handle characters outside the ASCII range correctly, so the receiving script won’t be able to interpret them. You simply can’t use escape() on Unicode text.

Luckily, all recent browsers support two new JavaScript functions, encodeURIComponent() and encodeURI(). These functions are safe for UTF-8 text, encoding them with the proper escape sequence, as well as everything escape() did to make sure the text is usable in a URL. The encodeURI() function encodes entire URIs — so it leaves characters such as :?& intact. encodeURIComponent() encodes strings to be individual parameters of a URI, so it encodes all characters except ~!*()’.

In short, if you’re using escape(), use encodeURIComponent() instead. If you’re worried about breaking compatibility with very old browsers, you can test for the existence of the function before using it:

if (encodeURIComponent) {
    string = encodeURIComponent(string);
} else {
    string = escape(string);
}

Handling UTF-8 in PHP

Internally, PHP uses ISO-8859-1/Latin-1 encoding. This character set is much smaller and incompatible with Unicode, which makes handling UTF-8 text difficult. Use of most string functions in PHP will result in the interpreter handling the text as Latin-1, and your output looking like garbled junk. PHP provides a multibyte string function library if your host has compiled it into their PHP build, although it’s sometimes difficult to use and doesn’t provide equivalents to all the string functions PHP normally provides.

PHP handles integers just fine, and Unicode is just a mapping of characters to integers. We can take advantage of that, using some handy functions Scott Reynen has written, to deal with the incoming UTF-8 text. He provides several functions that work well together, allowing you to convert strings to Unicode, convert Unicode to HTML entities for display on a webpage, and do simple string manipulation.

Storing UTF-8 in Non-UTF-8 Databases

The beauty of Scott’s unicode_to_entities_preserving_ascii() function is that it turns UTF8-encoded text into a string that is represented entirely with ASCII characters. All of the chracters outside the ASCII range are turned into their HTML escape sequences, like &#1575;. That means you don’t need your database tables to be set to the UTF-8 character set, which on shared hosting, you may not have the ability to do, and it’s not often the default.

This is useful even if your output format isn’t HTML. Now that you have a way to get the text into the database without losing non-English characters, you can convert it back after you get it out for use elsewhere in your app. PHP has a built-in function which will handle this part for you: html_entity_decode.

$original_string = html_entity_decode($string, ENT_NOQUOTES, ‘UTF-8′);

The caveat, of course, is that you can’t search or sort on those strings in the database properly. If you need those abilities, you need to ensure the database, the table, the columns, and the connection are all set to the UTF-8 character set, and that you don’t use any non-multibyte-safe functions on the strings before inserting or after retrieving them.

And there you have it: Handling non-English text in your web applications even in a shared hosting environment.

AddThis Social Bookmark Button

6 Trackbacks to “Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases”

  1. Trackback from » Eksperimentas. Keletos funkcijų bandymai Archyvas » Pixel.lt on July 25th, 2007 at 12:22 pm:

    […] Pasiskaitymui: javascript escape vs encodeURIComponent Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases what characters are missing when you use escape() or encodeURI() ? […]

  2. Trackback from Unterschiede zwischen Javascript escape(), encodeURI(), und encodeURIComponent() | IT.CappuccinoNet.com Blog on August 17th, 2007 at 9:01 am:

    […] Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases Bookmark to: (No Ratings Yet)  Loading … […]

  3. Trackback from Escaping utf-8 strings with javascript : Conside Solutions AB on September 3rd, 2007 at 5:02 am:

    […] Dan Grossman has a nice post about it […]

  4. Trackback from Chasing Amy » Blog Archive » J.F. vs. UTF-8 on April 16th, 2008 at 4:15 pm:

    […] Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases Escrito el Miércoles, Abril 16th, 2008 (8:09 pm). Clasificado bajo: Mis Bits, Think Geek. Puedes escribir comentarios o trackbacks desde tu blog. Y puedes seguir los comentarios mediante RSS 2.0. […]

  5. Trackback from Chasing Amy » J.F. vs. UTF-8 on May 21st, 2008 at 6:26 am:

    […] Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases « Vivir… Chasing UTF-8 » […]

  6. Trackback from utf8 to ascii on June 3rd, 2008 at 11:22 pm:

    […] on the web is tough. If you’re not careful, the text you start with isn’t what you’ll end up with.http://www.dangrossman.info/2007/05/25/handling-utf-8-in-javascript-php-and-non-utf8-databases/Nabble - IETF - IMA - asciiUTF-8 and asciiPunycodeasciiUTF-8 and asciiPunycode. Is ASCIIUTF-8 equal […]

17 Responses to “Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases”

  1. Apache Administrator
    May 25th, 2007

    Dan, this is a fantastic post! I’ve always wanted to know about this stuff, and some stuff I didn’t even think about before! Thanks!

  2. Hamish M
    May 26th, 2007

    Excellent post Dan.

    Working in Montreal, it’s a requirement for most corporate websites to be in both English and French, so I’ve had to deal with these issues in the past, and will certainly deal with them in the future. It can be be a slippery slope, particularly when it comes to PHP. Thanks for the tips.

  3. Olivier
    November 30th, 2007

    Hi,

    All my server side script assume the input is latin1 (iso-8859-1). So when sending text through javascript this gives errors. Can javascript convert from unicode to latin1 on the client side? Man, why are people still using anything else than utf-8?!

  4. Slavi
    February 11th, 2008

    Hi thanks for your post.

    The trick with escaping non English characters is good but if you have to perform a search like this you have to supply an “escaped” string in the search.

    SELECT
    *
    FROM
    table1
    WHERE
    real_name like ‘%&ecute;cute%’

    this SQL is just an example.

    Slavi

  5. TK
    March 6th, 2008

    > it’s going to handle UTF-8 encoded text properly without any extra care

    I think you are confusing UTF-8 with Unicode. In fact, you DO need extra care to handle UTF-8 encoded strings.
    For example, ‘あ’, when Unicode encoded, becomes ‘\u3042′, whereas ‘あ’ is ‘\xe3\x81\x82′ when UTF-8 encoded.
    The former works, the latter doesn’t when displayed.

  6. dH
    March 7th, 2008

    Thank you for the idea - mine was the same but it is good to see similar solutions. Usually storing everything in UTF8 makes the RDBMS (MySQL, for an example) MUCH slower. If you don’t need sorting just searching / display the values correctly, it’s better to encode as html/bas64/anything and store as latin-1 - just faster.

  7. pascal
    March 12th, 2008

    Extremely useful and clear post, that summarizes utf8 in a great way!!

    I just had to add
    header(’Content-Type: text/html; charset=utf-8′);

    cos I dont want to edit the apache settings to add the correct charset

  8. Maarten
    March 25th, 2008

    Note that the html_entity_decode($string, ENT_NOQUOTES, ‘UTF-8′); trick doesn’t work in php4 (somehow they fixed it in 5, and refuse to backport it seems), but this function might be a good workaround mb_decode_numericentity()

  9. Mone
    May 19th, 2008

    great post.
    Just a little note, the first line of the javascript code must be
    if (window.encodeURIComponent) {
    otherwise browsers without the encodeURIComponent method will throw an exception instead of execute the escape method.

  10. Mark
    June 24th, 2008

    Hi, If you use the encodeURIComponent in javascript to store a text in the database, what do you use to return teh same decoded text in a java program. I have use decodeURI but does not seem to work. What it returns is in ANSI format. I do know how to change from ANSI to UTF8 but no everything seems to be supported and sometimes part of the string are lost.

  11. Kashif
    July 10th, 2008

    Really good post indeed, It helped me find and solve UTF-8 problem with my current project. Thanks Dan

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Network Activity

Visitor Boost and Targeted Visitors have received 17 orders today and 15 orders yesterday.

W3Counter is currently processing -39 queries per second for 12,750 websites.

Website Goodies is hosting 79,836 guestbooks, 12,159 counters and 7,276 polls.

Award Winning Hosts has collected 182 customer reviews of web hosts.