Subject: Developers of Sympa
List archive
- From: Olivier Salaün - CRU <address@concealed>
 - To: Hatuka*nezumi - IKEDA Soji <address@concealed>
 - Cc: address@concealed
 - Subject: [sympa-dev] Re: Unicode vs. UTF-8
 - Date: Mon, 13 Nov 2006 16:53:56 +0100
 
| 
Thank you for tackling this problem ; coping with characters encoding
is really a nightmare. A bit of history : Until release 5.2.x, Sympa did not cope much about input/output encodings. Web pages ans service messages were encoded using the encoding associated with each language. We had a few side-effects including the fact that a shared web document filename might include 8bit characters. We noticed that latest Perl interpreter included a transparent decoding/recoding layer that could break the HTML output of Sympa (read http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm). Therefore the solution for all our encoding problems seemed to use this IO layer to decode everything to unicode. The drawback of this method is with malformed UTF-8 character handling, that happen with Templates essentially. Does your (c) option means replacing all our open FILE, "<:encoding($Conf{'filesystem_encoding'})", $filewith a decoding procedure ? Or can we still use the IO layer ? I'm afraid I still find it hard to jungle with encoding concepts and problems. Therefore, could you please summarize advantages of using (c) option ? Can you explain us what is a BOM (seems to stand for "byte-order make") ? Hatuka*nezumi - IKEDA Soji wrote: I have been playing with Sympa 5.3 test release.  Almost all things seem to go nice.  But following phenomenon is reproduced again:
- When customized template (including UTF-8 data beyond ISO-8859-1 range) is installed, either under $EXPL_DIR or under $DATADIR, they are decoded/encoded as ISO-8859-1 text.
This is caused because some paths of processing in Sympa won't handle Unicode string properly; they occasionally strip off utf8 flags of data (in the case above that path is Template::Parser.  MIME::Parser also is known to strip utf8 flags off).
To avoid this problem, there are several options:
(a) Use undocumented ``UTF-8 BOM'' feature of Template::Provider (as   of Template-toolkit 2.14):
    http://www.template-toolkit.org/pipermail/templates/2004-June/006270.html
(b) Force templates' encodings to be Unicode, guessing input is UTF-8 or Unicode.  For exapmle:
    http://search.cpan.org/perldoc?Template::Provider::Encoding
(c) Switch Sympa's internal encoding from Unicode to UTF-8 (byte string).
I suppose the last option is better:
- ``UTF-8 BOM'' is confusing for those wish to create/edit template text: Many text editors silently remove it (essentially, BOM is not allowed by official UTF-8 feature).
- Former two options (a) and (b) solve possible problems only by Template-toolkit.
- Last option (c) will reduce redundant internal encoding/decoding tasks.  Decoding to UTF-8 will be required only at the time of reading data; No encodings will be required for Web output.
I'd like to listen developers' opinion on this issue.
  
 | 
- 
            
            [sympa-dev] Unicode vs. UTF-8,
            Hatuka*nezumi - IKEDA Soji, 11/12/2006
- 
        [sympa-dev] Re: Unicode vs. UTF-8,
        Olivier Salaün - CRU, 11/13/2006
- 
            
            [sympa-dev] Re: Unicode vs. UTF-8,
            Hatuka*nezumi - IKEDA Soji, 11/14/2006
- 
            
            [sympa-dev] Re: Unicode vs. UTF-8,
            Olivier Salaün - CRU, 11/14/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/15/2006
 - [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/18/2006
 
 
 - 
            
            [sympa-dev] Re: Unicode vs. UTF-8,
            Olivier Salaün - CRU, 11/14/2006
 
 - 
            
            [sympa-dev] Re: Unicode vs. UTF-8,
            Hatuka*nezumi - IKEDA Soji, 11/14/2006
 
 - 
        [sympa-dev] Re: Unicode vs. UTF-8,
        Olivier Salaün - CRU, 11/13/2006
 
Archive powered by MHonArc 2.6.19+.