exo : blah

content

Tue, 21 Oct 2003

the joy of characters

As has been noted in various places Unicode is not well understood by many programmers. I'd include myself amongst them. However, I'm a lot better informed than I was. I'm in the process of updating my RSS aggregator to handle Unicode better in versions of Perl that support it.

Useful things I have learned in this process are:

  • Always check that all the software you are using is Unicode aware. Some versions of Gnome Terminal aren't and the Unicode aware version of xterm is a separate program.
  • Make sure you flag in the appropriate way your data's encoding. In my case it was setting the Content-Type header of mails to text/plain; charset="utf-8".
  • Mixing different encodings leads to surprising results. Life is much easier if you convert everything to a common encoding and then do any munging on it.

Anyway, it's all pretty much there although for the sake of simplicity I've not tried to make it understand anything other than plain text under version of Perl before 5.7.0. Maybe once I'm happy with the Unicode stuff under later versions I'll look at that.

posted at: 12:17 #

all the usual copyright stuff... [ copyright struan donald 2002 - present ], plus license