Subject: Developers of Sympa
List archive
- From: Olivier Salaün - CRU <address@concealed>
- To: Hatuka*nezumi - IKEDA Soji <address@concealed>
- Cc: address@concealed
- Subject: [sympa-dev] Re: Unicode vs. UTF-8
- Date: Mon, 13 Nov 2006 16:53:56 +0100
Thank you for tackling this problem ; coping with characters encoding
is really a nightmare. A bit of history : Until release 5.2.x, Sympa did not cope much about input/output encodings. Web pages ans service messages were encoded using the encoding associated with each language. We had a few side-effects including the fact that a shared web document filename might include 8bit characters. We noticed that latest Perl interpreter included a transparent decoding/recoding layer that could break the HTML output of Sympa (read http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm). Therefore the solution for all our encoding problems seemed to use this IO layer to decode everything to unicode. The drawback of this method is with malformed UTF-8 character handling, that happen with Templates essentially. Does your (c) option means replacing all our open FILE, "<:encoding($Conf{'filesystem_encoding'})", $filewith a decoding procedure ? Or can we still use the IO layer ? I'm afraid I still find it hard to jungle with encoding concepts and problems. Therefore, could you please summarize advantages of using (c) option ? Can you explain us what is a BOM (seems to stand for "byte-order make") ? Hatuka*nezumi - IKEDA Soji wrote: I have been playing with Sympa 5.3 test release. Almost all things seem to go nice. But following phenomenon is reproduced again: - When customized template (including UTF-8 data beyond ISO-8859-1 range) is installed, either under $EXPL_DIR or under $DATADIR, they are decoded/encoded as ISO-8859-1 text. This is caused because some paths of processing in Sympa won't handle Unicode string properly; they occasionally strip off utf8 flags of data (in the case above that path is Template::Parser. MIME::Parser also is known to strip utf8 flags off). To avoid this problem, there are several options: (a) Use undocumented ``UTF-8 BOM'' feature of Template::Provider (as of Template-toolkit 2.14): http://www.template-toolkit.org/pipermail/templates/2004-June/006270.html (b) Force templates' encodings to be Unicode, guessing input is UTF-8 or Unicode. For exapmle: http://search.cpan.org/perldoc?Template::Provider::Encoding (c) Switch Sympa's internal encoding from Unicode to UTF-8 (byte string). I suppose the last option is better: - ``UTF-8 BOM'' is confusing for those wish to create/edit template text: Many text editors silently remove it (essentially, BOM is not allowed by official UTF-8 feature). - Former two options (a) and (b) solve possible problems only by Template-toolkit. - Last option (c) will reduce redundant internal encoding/decoding tasks. Decoding to UTF-8 will be required only at the time of reading data; No encodings will be required for Web output. I'd like to listen developers' opinion on this issue. |
-
[sympa-dev] Unicode vs. UTF-8,
Hatuka*nezumi - IKEDA Soji, 11/12/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/13/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Hatuka*nezumi - IKEDA Soji, 11/14/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/14/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/15/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/18/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/14/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Hatuka*nezumi - IKEDA Soji, 11/14/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/13/2006
Archive powered by MHonArc 2.6.19+.