How to correctly manage UTF-8 with Perl

Perl has a native support for different character encodings, like the well known UTF-8, but its default behavior is to use Latin-1. This brings easily to a lot of problems if you don’t tune certain settings.

Perl is still used for server side scripts and batches because it is a very practical language for many purposes. But when used with encodings like UTF-8 you have to set a couple of pragmas to avoid problems.

First of all it can be useful to say Perl that your source code is UTF-8. This is required if you use such encoding in some hardwired strings of regular expressions. So we are talking about something like this:

use utf-8;
$s = 'alé';

Then you have to ensure that reading and writing data on handlers will be correctly encoded. Perl’s internal string format will always be correct, but when you print something you have to set the correct encoding:

open $fh, '>:utf8', 'utf8-test.txt';
open $fh, '<:utf8', 'utf8-test.txt';

Easy, but not very handy because you have to set this every time you open a handle.

At the same time Perl gives you a way to set a default encoder for a already opened handle such as a standard handle:

binmode STDOUT, ':utf8';

But the best way to handle this cases is to use the “open” pragma:

use open ':utf8', ':std';

This says to Perl to use the UTF-8 encoding for all file handles that will be opened and also to STDOUT, STDERR and STDIN, that is what you usually want to do if you’re using UTF-8.

Now that the source code and file handles are correctly encoded a other common case is database connection. If you’re using a MySQL connection with DBI you have to pass a simple parameter to the connection:

my $dbh = DBI->connect(
  {mysql_enable_utf8 => 1}

So the DB driver knows that want data going to and coming from the database is UTF-8 and will encode/decode characters as required.

Leave a Reply

Your email address will not be published. Required fields are marked *