Character encoding issues are shortening my life expectancy. It’s been a few years since I dealt with them regularly, so they bite me from time to time.
Recently I puzzled over a set of strings that were showing up double-encoded in my Postgres database. These strings were encoded from incoming Latin 1 to UTF8, travelled around the program with their utf8 flags set, went through JSON->decode and encode multiple times without apparent harm, but once in the database, they had the telltale double-encoding gibberish characters. E.g.:
use Encode;
my $utf8_string = "Télévision extrême à domicile";
print encode('utf8', $utf8_string);
# becomes Télévision extrême à domicile
I spent an absurdly long time capturing strings and dumping during execution:
say qx{echo $str | od -c }
od being a handy Unix utility I became very familiar with in my days of converting bibliographic data.
Turns out I was focusing on the wrong leads. These strings became values in a hash which was run through JSON->encode before saving to db. What prompted the re-encoding was not the values of the hash, but the keys. These were string literals in my program which were not marked as UTF8. Perl looked at my keys, looked at my values, saw a mixed bag, and decided to run the whole thing through the shredder again, just to be on the safe side.
The solution was simple:
use utf8
Because my keys were source-code string literals, I want my source code to be UTF8. use utf8 assures that.
Wow! Thanks for the work and broadcast of your discovery. That seems like something that would definitely hit me over the head sooner or later.
Tom, if I can keep one programmer from bashing his head on his desk, my life will not have been in vain.
Another gotcha can be dumping UTF8 strings into postgresql and pulling them back out as gibberish. Tell DBI to preserve the UTF8ness:
DBI->connect(‘blah’, ‘blah’, ‘blah’, pg_enable_utf8 => 1);
Oh yes. I covered that base in my Rose::DB subclass:
I’d be interested to see a small example of what you describe where it goes wrong. I use JSON::XS, DBI and hash keys with unicode values all over the place without utf8 and have never seen this. Could you post a small sample?