Dealing with characters outside the ASCII range on the web is tough. It’s tough in other environments too, but particularly for web applications since text needs to move through so many places without being mangled — from user input, through JavaScript, into and out of PHP and string manipulation functions, into and out of databases. If you’re not careful, the text you start with isn’t what you’ll end up with after you’re done handling it. That was the case with W3Counter for a long time, but not any longer. I’ll tell you how.
Unicode is the preferred method of representing text outside the ASCII range, which includes text from virtually all non-English languages. Unicode maps characters to integers, and includes a large range of characters, many more than Windows-1252 or ISO-8859-1, the most common character sets used for English documents. Luckily there’s another character set, UTF-8, which can represent Unicode and has wide operating system and browser support.
Handling UTF-8 in HTML
The first step in capturing and displaying non-English text is to deliver your webpages with the UTF-8 encoding. This tells the browser to interpret the text of the webpage as UTF-8 sequences, allowing it to display characters UTF-8 can encode that other character sets can’t. There are two places your page tells the browser what encoding to use — the Content-Type HTTP header, and the Content-Type meta tag.
On an Apache 1.3.12 or later server, you can set what content-type header will be sent by default with the AddCharset, AddType, or AddDefaultCharset directives. These can be set in a .htaccess file if you’re on shared hosting and don’t have access to the server’s main configuration file. You can also specify the character set in a meta tag:
<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />
If you’re using IIS, you can find the content-type setting for each file type under the “Headers” menu in the properties of your web site.
Handling UTF-8 in JavaScript
JavaScript internally works with all text in Unicode, so it’s going to handle UTF-8 encoded text properly without any extra care. However, in the context of web application development, JavaScript is often used to pass off data to server-side scripts. Whether it’s done through rendering HTML (such as constructing an iframe URL) or through AJAX calls, you may need to send text as a parameter in a URL’s query string.
You’ll often see escape() used to prepare the string for use in a URL; it escapes characters like the ampersand that would otherwise result in a malformed URL. However, escape() doesn’t handle characters outside the ASCII range correctly, so the receiving script won’t be able to interpret them. You simply can’t use escape() on Unicode text.
Luckily, all recent browsers support two new JavaScript functions, encodeURIComponent() and encodeURI(). These functions are safe for UTF-8 text, encoding them with the proper escape sequence, as well as everything escape() did to make sure the text is usable in a URL. The encodeURI() function encodes entire URIs — so it leaves characters such as :?& intact. encodeURIComponent() encodes strings to be individual parameters of a URI, so it encodes all characters except ~!*()’.
In short, if you’re using escape(), use encodeURIComponent() instead. If you’re worried about breaking compatibility with very old browsers, you can test for the existence of the function before using it:
if (encodeURIComponent) {
string = encodeURIComponent(string);
} else {
string = escape(string);
}
Handling UTF-8 in PHP
Internally, PHP uses ISO-8859-1/Latin-1 encoding. This character set is much smaller and incompatible with Unicode, which makes handling UTF-8 text difficult. Use of most string functions in PHP will result in the interpreter handling the text as Latin-1, and your output looking like garbled junk. PHP provides a multibyte string function library if your host has compiled it into their PHP build, although it’s sometimes difficult to use and doesn’t provide equivalents to all the string functions PHP normally provides.
PHP handles integers just fine, and Unicode is just a mapping of characters to integers. We can take advantage of that, using some handy functions Scott Reynen has written, to deal with the incoming UTF-8 text. He provides several functions that work well together, allowing you to convert strings to Unicode, convert Unicode to HTML entities for display on a webpage, and do simple string manipulation.
Storing UTF-8 in Non-UTF-8 Databases
The beauty of Scott’s unicode_to_entities_preserving_ascii() function is that it turns UTF8-encoded text into a string that is represented entirely with ASCII characters. All of the chracters outside the ASCII range are turned into their HTML escape sequences, like ا. That means you don’t need your database tables to be set to the UTF-8 character set, which on shared hosting, you may not have the ability to do, and it’s not often the default.
This is useful even if your output format isn’t HTML. Now that you have a way to get the text into the database without losing non-English characters, you can convert it back after you get it out for use elsewhere in your app. PHP has a built-in function which will handle this part for you: html_entity_decode.
$original_string = html_entity_decode($string, ENT_NOQUOTES, ‘UTF-8′);
The caveat, of course, is that you can’t search or sort on those strings in the database properly. If you need those abilities, you need to ensure the database, the table, the columns, and the connection are all set to the UTF-8 character set, and that you don’t use any non-multibyte-safe functions on the strings before inserting or after retrieving them.
And there you have it: Handling non-English text in your web applications even in a shared hosting environment.



Apache Administrator
May 25th, 2007
Dan, this is a fantastic post! I’ve always wanted to know about this stuff, and some stuff I didn’t even think about before! Thanks!
Hamish M
May 26th, 2007
Excellent post Dan.
Working in Montreal, it’s a requirement for most corporate websites to be in both English and French, so I’ve had to deal with these issues in the past, and will certainly deal with them in the future. It can be be a slippery slope, particularly when it comes to PHP. Thanks for the tips.
» Eksperimentas. Keletos funkcijų bandymai Archyvas » Pixel.lt
July 25th, 2007
[...] Pasiskaitymui: javascript escape vs encodeURIComponent Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases what characters are missing when you use escape() or encodeURI() ? [...]
Unterschiede zwischen Javascript escape(), encodeURI(), und encodeURIComponent() | IT.CappuccinoNet.com Blog
August 17th, 2007
[...] Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases Bookmark to: (No Ratings Yet) Loading … [...]
Escaping utf-8 strings with javascript : Conside Solutions AB
September 3rd, 2007
[...] Dan Grossman has a nice post about it [...]
Olivier
November 30th, 2007
Hi,
All my server side script assume the input is latin1 (iso-8859-1). So when sending text through javascript this gives errors. Can javascript convert from unicode to latin1 on the client side? Man, why are people still using anything else than utf-8?!
Slavi
February 11th, 2008
Hi thanks for your post.
The trick with escaping non English characters is good but if you have to perform a search like this you have to supply an “escaped” string in the search.
SELECT
*
FROM
table1
WHERE
real_name like ‘%&ecute;cute%’
this SQL is just an example.
Slavi
TK
March 6th, 2008
> it’s going to handle UTF-8 encoded text properly without any extra care
I think you are confusing UTF-8 with Unicode. In fact, you DO need extra care to handle UTF-8 encoded strings.
For example, ‘あ’, when Unicode encoded, becomes ‘\u3042′, whereas ‘あ’ is ‘\xe3\x81\x82′ when UTF-8 encoded.
The former works, the latter doesn’t when displayed.
dH
March 7th, 2008
Thank you for the idea - mine was the same but it is good to see similar solutions. Usually storing everything in UTF8 makes the RDBMS (MySQL, for an example) MUCH slower. If you don’t need sorting just searching / display the values correctly, it’s better to encode as html/bas64/anything and store as latin-1 - just faster.
pascal
March 12th, 2008
Extremely useful and clear post, that summarizes utf8 in a great way!!
I just had to add
header(’Content-Type: text/html; charset=utf-8′);
cos I dont want to edit the apache settings to add the correct charset
Maarten
March 25th, 2008
Note that the html_entity_decode($string, ENT_NOQUOTES, ‘UTF-8′); trick doesn’t work in php4 (somehow they fixed it in 5, and refuse to backport it seems), but this function might be a good workaround mb_decode_numericentity()
Chasing Amy » Blog Archive » J.F. vs. UTF-8
April 16th, 2008
[...] Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases Escrito el Miércoles, Abril 16th, 2008 (8:09 pm). Clasificado bajo: Mis Bits, Think Geek. Puedes escribir comentarios o trackbacks desde tu blog. Y puedes seguir los comentarios mediante RSS 2.0. [...]
Mone
May 19th, 2008
great post.
Just a little note, the first line of the javascript code must be
if (window.encodeURIComponent) {otherwise browsers without the encodeURIComponent method will throw an exception instead of execute the escape method.
Chasing Amy » J.F. vs. UTF-8
May 21st, 2008
[...] Handling UTF-8 in JavaScript, PHP, and Non-UTF8 Databases « Vivir… Chasing UTF-8 » [...]
utf8 to ascii
June 3rd, 2008
[...] on the web is tough. If you’re not careful, the text you start with isn’t what you’ll end up with.http://www.dangrossman.info/2007/05/25/handling-utf-8-in-javascript-php-and-non-utf8-databases/Nabble - IETF - IMA - asciiUTF-8 and asciiPunycodeasciiUTF-8 and asciiPunycode. Is ASCIIUTF-8 equal [...]
Mark
June 24th, 2008
Hi, If you use the encodeURIComponent in javascript to store a text in the database, what do you use to return teh same decoded text in a java program. I have use decodeURI but does not seem to work. What it returns is in ANSI format. I do know how to change from ANSI to UTF8 but no everything seems to be supported and sometimes part of the string are lost.
Kashif
July 10th, 2008
Really good post indeed, It helped me find and solve UTF-8 problem with my current project. Thanks Dan