UTF8 can be tricky – especially with PHP

Everybody uses (or should!) UTF8 these days. An easy and fully supporting PHP version I did not come across yet, though.
It seems there is sometimes more to it. This article is supposed to guideline the basic setup of a CakePHP app using UTF8 and will go beyond that to the really tricky parts regarding the de facto standard encoding these days.

Note: this post is really long overdue and was in my draft folder for 2+ years. So here it is, quickly published before it got even more dusty^^
And dusty sure is the right word with (hopefully) no one using ANSI/ISO-8859-1 anymore these days.

UTF8 and PHP

Use the mb_ functions if you know that you real with strings than can contain UTF8 chars. So if you want to count the length of such a string:

$length = mb_strlen($string);

If you are simply manipulating strings, you do not always have to use those slower and UTF8-aware fnctions, though. But in doubt always do so.

UTF8 and preg_match()

Now this is a tricky one – especially if you don’t want to recompile PHP with the PCRE UTF-8 flag enabled or if you don’t know about it at all. IMO that should be the default, but it usually isn’t it seems.

Most times, when dealing with UTF8 strings, the /u modifier and p{L} helps:

preg_match('/^\p{L}[\p{L} _.-]+$/u', $username, $matches)

In other cases you might have to add (*UTF8) in your pattern.

UTF8 and CakePHP

CakePHP setup

The main parts are handled in the book, especially in the getting-started section.
But the main part that sometimes people get wrong is that the APP encoding is "utf-8" while in the database.php its spelled utf8.

Make sure you save all files as "UTF8 without BOM" via your IDE as soon as they start to contain UTF8 chars. Failing to do so will cause output issues.
I usually try to avoid this and use Locale translation and mainly English chars in all files as much as possible.

Note: Before adding any UTF8 chars to files, those files are always ANSI (there is no way without the BOM to distinguish those two encoding formats as they are one and the same here). So no matter how often you try to save them as UTF8, they will always still be ANSI. In case you wondered why it falls back to it again in most IDEs.

Correcting PHP functions

Some PHP functionality has been wrapped in CakePHP to overcome deficiencies regarding Unicode.
String::wordWrap() for example replaces the faulty wordwrap() function.

I also added a few fixes to my Tools plugin as Utility/Utility class:

pregMatch(): Unicode aware replacement for preg_match()
pregMatchAll(): Unicode aware replacement for preg_match_all()
strSplit(): Unicode aware replacement for str_split()
pregMatchAll(): Unicode aware replacement for preg_match_all()

Probably more to come..

Proper validation

Make sure your validation is unicode aware – that’s probably one of the most made mistakes from mainly English speaking devs/people.
They maybe assume that it works to simply use strlen() or a [a-z] regex or alike – not taking into account that for example many normal first/last names contain a lot of special chars.
Validation here should never be too strict. Otherwise a lot of people will be very upset.

So in the above example we do NOT want to use

preg_match('/^\[a-z][a-z .-]+$/i', $firstName)

but something more like

preg_match('/^\p{L}[\p{L} .-]+$/u', $firstName)

to validate a first name.
IF we actually have to validate this further than a simple "not empty" check is a different topic (I don’t think so). But if you really must, PLEASE do not shut people out because their parents gave them non-English names 😉

A similar thing I had to fix in the core a while back, regarding domains/urls.
And this is CakePHP2.5 – the current master – so that topic sure is still quite current for some cases. More and more so with further internationalization.

Checklist for your CakePHP app

Ideally, use utf8_unicode_ci as collasion for your DB
Your layout should contain <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The apache/nginx should serve files as UTF8 via header Content-Type text/html; charset=UTF-8

Outview

Only in PHP7 (as PHP6 got skipped) there will be a more built-in approach then for UTF8. Until then (and maybe even then) we will have to fight quite a lot here for the next years.

There are even a few popular projects in GitHub around the UTF8 issues, e.g:

https://github.com/nicolas-grekas/Patchwork-UTF8
https://github.com/heartsentwined/php-utf8
https://github.com/voku/portable-utf8
https://github.com/gatsu/UTF8

Might be worth checking out.

Anything missing? Please let me know.

4 Comments

Costa says:

August 15, 2014 at 22:21

Thanks for posting, saves me the trouble of researching / testing this stuff.

(btw, typo: "collision for your DB" )
Stijn de Witt says:

October 22, 2014 at 22:49

"Note: Before adding any UTF8 chars to files, those files are always ANSI (there is no way without the BOM to distinguish those two encoding formats as they are one and the same here)."

In fact, only the 7-bit ASCII characters are compatible with UTF-8, so in, say, German text you’ll already have problems if you don’t save as UTF-8.

Also, is there a specific reason for using UTF-8 without BOM? I’d reckon that especially because without the BOM distinguishing ASCII from UTF-8 is impossible we should save *with BOM. Unless of course there is a very good reason not to.
Mark says:

October 22, 2014 at 23:00

German umlauts for me are part of UTF8 🙂
And using BOM often creates software reading problems. Thus avoiding it is the saner approach in my experience.
voku says:

February 3, 2015 at 20:19
Hi, maybe this is also usefully for someone 🙂 -> a php-function that will fixing utf-8 problems from inputs and prevent XSS attacks.

https://gist.github.com/voku/2246eefcea1ef2671f18

PS: I updated the "Portable UTF-8" package 🙂 and added a new function that fixed some problems with url-encoded strings!
- fixing percent-encoding e.g.: "D%FCsseldorf"
- fixing html-encoding e.g.: "Düsseldorf"
- fixing html + url-encoding e.g.: "D%26%23xFC%3Bsseldorf"
- fixing broken utf-8 e.g.: "DÃ¼sseldorf"
- fixing broken utf-8 + url-encoding e.g.: "D%C3%83%C2%BCsseldorf"
- fixing double url-encoding e.g.: "D%25C3%2583%25C2%25BCsseldorf"
Install via composer: "voku/portable-utf8": "1.0.*"

Usage: UTF8::urldecode($string)

Mfg Lars 🙂

This site uses Akismet to reduce spam. Learn how your comment data is processed.