devel - [sympa-dev] Re: Unicode vs. UTF-8

Subject: Developers of Sympa

List archive

[sympa-dev] Re: Unicode vs. UTF-8

From: Hatuka*nezumi - IKEDA Soji <address@concealed>
To: Olivier Salaün - CRU <address@concealed>
Cc: address@concealed
Subject: [sympa-dev] Re: Unicode vs. UTF-8
Date: Tue, 14 Nov 2006 19:02:50 +0900

On Mon, 13 Nov 2006 16:53:56 +0100
Olivier Salaün - CRU <address@concealed> wrote:

> Thank you for tackling this problem ; coping with characters encoding is
> really a nightmare.
>
> A bit of history :
> Until release 5.2.x, Sympa did not cope much about input/output
> encodings. Web pages ans service messages were encoded using the
> encoding associated with each language. We had a few side-effects
> including the fact that a shared web document filename might include
> 8bit characters. We noticed that latest Perl interpreter included a
> transparent decoding/recoding layer that could break the HTML output of
> Sympa (read http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm).
> Therefore the solution for all our encoding problems seemed to use this
> IO layer to decode everything to unicode. The drawback of this method is
> with malformed UTF-8 character handling, that happen with Templates
> essentially.
>
> Does your (c) option means replacing all our
>
> open FILE, "<:encoding($Conf{'filesystem_encoding'})", $file
>
> with a decoding procedure ? Or can we still use the IO layer ?
>
> I'm afraid I still find it hard to jungle with encoding concepts and
> problems. Therefore, could you please summarize advantages of using (c)
> option ?

The advantages are: It may hopefully avoid troublesomenesses caused
by mixture of Unicode (wide character, in term of Perl) and byte
string. Additionally, it is expected to reduce redundant internal
encode/decode tasks.

Following two tables are to describe what I mean. Please correct
if I misunderstood ---

N.B.: ``byte'' in following tables means a byte string contains
probable non-UTF-8 or binary data.

(I) on current Sympa, internal text processings are carried out
assuming Unicode string, where encoding of each sources and
required conversions are:

Sources : Encoding : Required (used) conversion
............................................................................
Core of system
gettext() : locale charset : to Unicode
config file : filesystem_encoding : PerlIO layer by open()
template
(source) : UTF-8 : (currently broken)
(parameters) : Unicode or UTF-8 : Template::Directive::OUTPUT hack
(parsed result) : Unicode or UTF-8 : ditto.
shared file name : Q-encoded UTF-8 : Q-encode/decode & Encode::decode

Input from users
message headers : MIME-encoded : to Unicode (for templates)
HTTP input
(POST parameters) : UTF-8 : PerlIO layer by binmode()
(file upload) : byte : none

Output to users
message via list : byte : none (except custom_subject)
service message : mixture of any : mail::reformat_message()
HTTP output
(HTML) : Unicode : PerlIO layer by binmode()
(file download) : byte : none
.................................................................

(II) If internal text processings were carried out assuming UTF-8,
we may take care of distinction just between UTF-8 and byte
(* are changed items):

Sources : Encoding : Required conversion
............................................................................
Core of system
gettext() : locale charset :*to UTF-8
config file : filesystem_encoding :*to UTF-8
template
(source) : UTF-8 :*none
(parameters) :*UTF-8 :*none
(parsed result) :*UTF-8 :*none
shared file name : Q-encoded UTF-8 :*Q-encode/decode

Input from users
message headers : MIME-encoded :*to UTF-8 (for templates)
HTTP input
(POST parameters) : UTF-8 :*none
(file upload) : byte : none

Outputs to users
message via list : byte : none (except custom_subject)
service message : mixture of both : mail::reformat_message()
HTTP output
(HTML) : UTF-8 :*none
(file download) : byte : none
.................................................................

After all, changes needed to switch from Unicode to UTF-8 are likely:
- To read raw config files then to convert them to UTF-8.
- To change Language::gettext()'s output from Unicode to UTF-8.
- To remove Template::Directive::OUTPUT hack (see src/tt2.pl).
- To disable PerlIO layer on wwsympa.fcgi.

> Can you explain us what is a BOM (seems to stand for "byte-order make") ?

You are right. Concretely, so-called ``UTF-8 BOM'' is a sequence
"\xEF\xBB\xBF" (UCS U+FEFF) prepended to text data (though the
``byte order'' is senseless on UTF-8).

On current Template-Toolkit, "UNICODE" option may be used to
discriminate Unicode-oriented templates from byte-oriented templates
according to either they have ``BOM'' or not (for more details
check source of Template::Provider).

This feature on BOM is a sort of kludge, I believe. It can make
some confusion (as I wrote in below).

> Hatuka*nezumi - IKEDA Soji wrote:
> > I have been playing with Sympa 5.3 test release. Almost all things seem
> > to go nice. But following phenomenon is reproduced again:
> >
> > - When customized template (including UTF-8 data beyond ISO-8859-1 range)
> > is installed, either under $EXPL_DIR or under $DATADIR, they are
> > decoded/encoded as ISO-8859-1 text.
> >
> > This is caused because some paths of processing in Sympa won't handle
> > Unicode string properly; they occasionally strip off utf8 flags of data
> > (in the case above that path is Template::Parser. MIME::Parser also is
> > known to strip utf8 flags off).
> >
> > To avoid this problem, there are several options:
> >
> > (a) Use undocumented ``UTF-8 BOM'' feature of Template::Provider (as of
> > Template-toolkit 2.14):
> >
> > http://www.template-toolkit.org/pipermail/templates/2004-June/006270.html
> >
> > (b) Force templates' encodings to be Unicode, guessing input is UTF-8 or
> > Unicode. For exapmle:
> > http://search.cpan.org/perldoc?Template::Provider::Encoding
> >
> > (c) Switch Sympa's internal encoding from Unicode to UTF-8 (byte string).
> >
> > I suppose the last option is better:
> >
> > - ``UTF-8 BOM'' is confusing for those wish to create/edit template text:
> > Many text editors silently remove it (essentially, BOM is not allowed by
> > official UTF-8 feature).
> >
> > - Former two options (a) and (b) solve possible problems only by
> > Template-toolkit.
> >
> > - Last option (c) will reduce redundant internal encoding/decoding tasks.
> > Decoding to UTF-8 will be required only at the time of reading data; No
> > encodings will be required for Web output.
> >
> > I'd like to listen developers' opinion on this issue.

--- nezumi

[sympa-dev] Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/12/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Olivier Salaün - CRU, 11/13/2006
  - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/14/2006
    - [sympa-dev] Re: Unicode vs. UTF-8, Olivier Salaün - CRU, 11/14/2006
      - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/15/2006
      - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/18/2006

List archive

[sympa-dev] Re: Unicode vs. UTF-8