UTF8 follies

Character encoding issues are shortening my life expectancy. It’s been a few years since I dealt with them regularly, so they bite me from time to time.

Recently I puzzled over a set of strings that were showing up double-encoded in my Postgres database. These strings were encoded from incoming Latin 1 to UTF8, travelled around the program with their utf8 flags set, went through JSON->decode and encode multiple times without apparent harm, but once in the database, they had the telltale double-encoding gibberish characters. E.g.:


use Encode;
my $utf8_string = "Télévision extrême à domicile";
print encode('utf8', $utf8_string);
# becomes Télévision extrême à domicile

I spent an absurdly long time capturing strings and dumping during execution:

say qx{echo $str | od -c }

od being a handy Unix utility I became very familiar with in my days of converting bibliographic data.

Turns out I was focusing on the wrong leads. These strings became values in a hash which was run through JSON->encode before saving to db. What prompted the re-encoding was not the values of the hash, but the keys. These were string literals in my program which were not marked as UTF8. Perl looked at my keys, looked at my values, saw a mixed bag, and decided to run the whole thing through the shredder again, just to be on the safe side.

The solution was simple:
use utf8

Because my keys were source-code string literals, I want my source code to be UTF8. use utf8 assures that.

About these ads

5 Responses to UTF8 follies

  1. Tom Davis says:

    Wow! Thanks for the work and broadcast of your discovery. That seems like something that would definitely hit me over the head sooner or later.

  2. Belden Lyman says:

    Another gotcha can be dumping UTF8 strings into postgresql and pulling them back out as gibberish. Tell DBI to preserve the UTF8ness:

    DBI->connect(‘blah’, ‘blah’, ‘blah’, pg_enable_utf8 => 1);

  3. perlgerl says:

    Oh yes. I covered that base in my Rose::DB subclass:

    __PACKAGE__->register_db(

    connect_options => {
    pg_server_prepare => 0,
    pg_enable_utf8 => 1,
    },
    post_connect_sql => “SET CLIENT_ENCODING TO ‘UTF8′;”,
    );

  4. mjevans says:

    I’d be interested to see a small example of what you describe where it goes wrong. I use JSON::XS, DBI and hash keys with unicode values all over the place without utf8 and have never seen this. Could you post a small sample?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: