Subject: Developers of Sympa
List archive
- From: Olivier Salaün - CRU <address@concealed>
- To: Hatuka*nezumi - IKEDA Soji <address@concealed>
- Cc: address@concealed
- Subject: [sympa-dev] Re: Unicode vs. UTF-8
- Date: Tue, 14 Nov 2006 11:27:49 +0100
Hatuka*nezumi - IKEDA Soji wrote:
I'm afraid I still find it hard to jungle with encoding concepts and problems. Therefore, could you please summarize advantages of using (c) option ? The advantages are: It may hopefully avoid troublesomenesses caused by mixture of Unicode (wide character, in term of Perl) and byte string. Additionally, it is expected to reduce redundant internal encode/decode tasks. Following two tables are to describe what I mean. Please correct if I misunderstood --- N.B.: ``byte'' in following tables means a byte string contains probable non-UTF-8 or binary data. (I) on current Sympa, internal text processings are carried out assuming Unicode string, where encoding of each sources and required conversions are:That's a brilliant idea to summarize things through this table ; it should help sort things out. Sources : Encoding : Required (used) conversion ............................................................................ Core of system gettext() : locale charset : to Unicode config file : filesystem_encoding : PerlIO layer by open() template (source) : UTF-8 : (currently broken) (parameters) : Unicode or UTF-8 : Template::Directive::OUTPUT hack (parsed result) : Unicode or UTF-8 : ditto. shared file name : Q-encoded UTF-8 : Q-encode/decode & Encode::decode Input from users message headers : MIME-encoded : to Unicode (for templates) HTTP input (POST parameters) : UTF-8 : PerlIO layer by binmode() (file upload) : byte : none Output to users message via list : byte : none (except custom_subject) service message : mixture of any : mail::reformat_message() HTTP output (HTML) : Unicode : PerlIO layer by binmode() (file download) : byte : none ................................................................. (II) If internal text processings were carried out assuming UTF-8, we may take care of distinction just between UTF-8 and byte (* are changed items): Sources : Encoding : Required conversion ............................................................................ Core of system gettext() : locale charset :*to UTF-8 config file : filesystem_encoding :*to UTF-8 template (source) : UTF-8 :*none (parameters) :*UTF-8 :*none (parsed result) :*UTF-8 :*none shared file name : Q-encoded UTF-8 :*Q-encode/decode Input from users message headers : MIME-encoded :*to UTF-8 (for templates) HTTP input (POST parameters) : UTF-8 :*none (file upload) : byte : none Outputs to users message via list : byte : none (except custom_subject) service message : mixture of both : mail::reformat_message() HTTP output (HTML) : UTF-8 :*none (file download) : byte : none ................................................................. After all, changes needed to switch from Unicode to UTF-8 are likely: - To read raw config files then to convert them to UTF-8. - To change Language::gettext()'s output from Unicode to UTF-8. - To remove Template::Directive::OUTPUT hack (see src/tt2.pl). - To disable PerlIO layer on wwsympa.fcgi.It is probably worth building a prototype to check that we don't have unexpected behaviors after these changes. Do you have some time/energy to make a patch that switch from Unicode to UTF-8? Can you explain us what is a BOM (seems to stand for "byte-order make") ? You are right. Concretely, so-called ``UTF-8 BOM'' is a sequence "\xEF\xBB\xBF" (UCS U+FEFF) prepended to text data (though the ``byte order'' is senseless on UTF-8). On current Template-Toolkit, "UNICODE" option may be used to discriminate Unicode-oriented templates from byte-oriented templates according to either they have ``BOM'' or not (for more details check source of Template::Provider). This feature on BOM is a sort of kludge, I believe. It can make some confusion (as I wrote in below).Thanks for providing these explanations. |
-
[sympa-dev] Unicode vs. UTF-8,
Hatuka*nezumi - IKEDA Soji, 11/12/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/13/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Hatuka*nezumi - IKEDA Soji, 11/14/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/14/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/15/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/18/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/14/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Hatuka*nezumi - IKEDA Soji, 11/14/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/13/2006
Archive powered by MHonArc 2.6.19+.