devel - [sympa-dev] Re: Unicode vs. UTF-8

Subject: Developers of Sympa

List archive

[sympa-dev] Re: Unicode vs. UTF-8

From: Olivier Salaün - CRU <address@concealed>
To: Hatuka*nezumi - IKEDA Soji <address@concealed>
Cc: address@concealed
Subject: [sympa-dev] Re: Unicode vs. UTF-8
Date: Tue, 14 Nov 2006 11:27:49 +0100

Hatuka*nezumi - IKEDA Soji wrote:

I'm afraid I still find it hard to jungle with encoding concepts and problems. Therefore, could you please summarize advantages of using (c) option ?

The advantages are: It may hopefully avoid troublesomenesses caused by mixture of Unicode (wide character, in term of Perl) and byte string.  Additionally, it is expected to reduce redundant internal encode/decode tasks.

Following two tables are to describe what I mean.  Please correct if I misunderstood ---

N.B.: ``byte'' in following tables means a byte string contains probable non-UTF-8 or binary data.

(I) on current Sympa, internal text processings are carried out assuming Unicode string, where encoding of each sources and required conversions are:

That's a brilliant idea to summarize things through this table ; it should help sort things out.

Sources               : Encoding            : Required (used) conversion
............................................................................
Core of system
  gettext()           : locale charset      : to Unicode
  config file         : filesystem_encoding : PerlIO layer by open()
  template
    (source)          : UTF-8               : (currently broken)
    (parameters)      : Unicode or UTF-8    : Template::Directive::OUTPUT hack
    (parsed result)   : Unicode or UTF-8    : ditto.
  shared file name    : Q-encoded UTF-8     : Q-encode/decode & Encode::decode

Input from users
  message headers     : MIME-encoded        : to Unicode (for templates)
  HTTP input
    (POST parameters) : UTF-8               : PerlIO layer by binmode()
    (file upload)     : byte                : none

Output to users
  message via list    : byte                : none (except custom_subject)
  service message     : mixture of any      : mail::reformat_message()
  HTTP output
    (HTML)            : Unicode             : PerlIO layer by binmode()
    (file download)   : byte                : none
.................................................................


(II) If internal text processings were carried out assuming UTF-8, 
  we may take care of distinction just between UTF-8 and byte 
  (* are changed items):

Sources               : Encoding            : Required conversion
............................................................................
Core of system
  gettext()           : locale charset      :*to UTF-8
  config file         : filesystem_encoding :*to UTF-8
  template
    (source)          : UTF-8               :*none
    (parameters)      :*UTF-8               :*none
    (parsed result)   :*UTF-8               :*none
  shared file name    : Q-encoded UTF-8     :*Q-encode/decode

Input from users
  message headers     : MIME-encoded        :*to UTF-8 (for templates)
  HTTP input
    (POST parameters) : UTF-8               :*none
    (file upload)     : byte                : none

Outputs to users
  message via list    : byte                : none (except custom_subject)
  service message     : mixture of both     : mail::reformat_message()
  HTTP output
    (HTML)            : UTF-8               :*none
    (file download)   : byte                : none
.................................................................

After all, changes needed to switch from Unicode to UTF-8 are likely:
- To read raw config files then to convert them to UTF-8.
- To change Language::gettext()'s output from Unicode to UTF-8.
- To remove Template::Directive::OUTPUT hack (see src/tt2.pl).
- To disable PerlIO layer on wwsympa.fcgi.

It is probably worth building a prototype to check that we don't have unexpected behaviors after these changes.
Do you have some time/energy to make a patch that switch from Unicode to UTF-8?

Can you explain us what is a BOM (seems to stand for "byte-order make") ?

You are right.  Concretely, so-called ``UTF-8 BOM'' is a sequence 
"\xEF\xBB\xBF" (UCS U+FEFF) prepended to text data (though the
``byte order'' is senseless on UTF-8).

On current Template-Toolkit, "UNICODE" option may be used to 
discriminate Unicode-oriented templates from byte-oriented templates 
according to either they have ``BOM'' or not (for more details 
check source of Template::Provider).

This feature on BOM is a sort of kludge, I believe.  It can make 
some confusion (as I wrote in below).

Thanks for providing these explanations.

[sympa-dev] Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/12/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Olivier Salaün - CRU, 11/13/2006
  - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/14/2006
    - [sympa-dev] Re: Unicode vs. UTF-8, Olivier Salaün - CRU, 11/14/2006
      - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/15/2006
      - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/18/2006

List archive

[sympa-dev] Re: Unicode vs. UTF-8