devel - [sympa-dev] Re: Unicode vs. UTF-8

Subject: Developers of Sympa

List archive

[sympa-dev] Re: Unicode vs. UTF-8

From: Olivier Salaün - CRU <address@concealed>
To: Hatuka*nezumi - IKEDA Soji <address@concealed>
Cc: address@concealed
Subject: [sympa-dev] Re: Unicode vs. UTF-8
Date: Mon, 13 Nov 2006 16:53:56 +0100

Thank you for tackling this problem ; coping with characters encoding is really a nightmare.

A bit of history :
Until release 5.2.x, Sympa did not cope much about input/output encodings. Web pages ans service messages were encoded using the encoding associated with each language. We had a few side-effects including the fact that a shared web document filename might include 8bit characters. We noticed that latest Perl interpreter included a transparent decoding/recoding layer that could break the HTML output of Sympa (read http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm). Therefore the solution for all our encoding problems seemed to use this IO layer to decode everything to unicode. The drawback of this method is with malformed UTF-8 character handling, that happen with Templates essentially.

Does your (c) option means replacing all our

open FILE, "<:encoding($Conf{'filesystem_encoding'})", $file

with a decoding procedure ? Or can we still use the IO layer ?

I'm afraid I still find it hard to jungle with encoding concepts and problems. Therefore, could you please summarize advantages of using (c) option ?

Can you explain us what is a BOM (seems to stand for "byte-order make") ?

Hatuka*nezumi - IKEDA Soji wrote:

I have been playing with Sympa 5.3 test release.  Almost all things seem to go nice.  But following phenomenon is reproduced again:

- When customized template (including UTF-8 data beyond ISO-8859-1 range) is installed, either under $EXPL_DIR or under $DATADIR, they are decoded/encoded as ISO-8859-1 text.

This is caused because some paths of processing in Sympa won't handle Unicode string properly; they occasionally strip off utf8 flags of data (in the case above that path is Template::Parser.  MIME::Parser also is known to strip utf8 flags off).

To avoid this problem, there are several options:

(a) Use undocumented ``UTF-8 BOM'' feature of Template::Provider (as   of Template-toolkit 2.14):
    http://www.template-toolkit.org/pipermail/templates/2004-June/006270.html

(b) Force templates' encodings to be Unicode, guessing input is UTF-8 or Unicode.  For exapmle:
    http://search.cpan.org/perldoc?Template::Provider::Encoding

(c) Switch Sympa's internal encoding from Unicode to UTF-8 (byte string).

I suppose the last option is better:

- ``UTF-8 BOM'' is confusing for those wish to create/edit template text: Many text editors silently remove it (essentially, BOM is not allowed by official UTF-8 feature).

- Former two options (a) and (b) solve possible problems only by Template-toolkit.

- Last option (c) will reduce redundant internal encoding/decoding tasks.  Decoding to UTF-8 will be required only at the time of reading data; No encodings will be required for Web output.

I'd like to listen developers' opinion on this issue.

[sympa-dev] Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/12/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Olivier Salaün - CRU, 11/13/2006
  - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/14/2006
    - [sympa-dev] Re: Unicode vs. UTF-8, Olivier Salaün - CRU, 11/14/2006
      - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/15/2006
      - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/18/2006

List archive

[sympa-dev] Re: Unicode vs. UTF-8