Subject: Developers of Sympa
List archive
- From: Olivier Salaün - CRU <address@concealed>
- To: Hatuka*nezumi - IKEDA Soji <address@concealed>
- Cc: address@concealed
- Subject: [sympa-dev] Re: Unicode vs. UTF-8
- Date: Tue, 14 Nov 2006 11:27:49 +0100
Hatuka*nezumi - IKEDA Soji wrote:
The advantages are: It may hopefully avoid troublesomenesses caused by mixture of Unicode (wide character, in term of Perl) and byte string. Additionally, it is expected to reduce redundant internal encode/decode tasks. Following two tables are to describe what I mean. Please correct if I misunderstood --- N.B.: ``byte'' in following tables means a byte string contains probable non-UTF-8 or binary data. (I) on current Sympa, internal text processings are carried out assuming Unicode string, where encoding of each sources and required conversions are:That's a brilliant idea to summarize things through this table ; it should help sort things out. Sources : Encoding : Required (used) conversion
............................................................................
Core of system
gettext() : locale charset : to Unicode
config file : filesystem_encoding : PerlIO layer by open()
template
(source) : UTF-8 : (currently broken)
(parameters) : Unicode or UTF-8 : Template::Directive::OUTPUT hack
(parsed result) : Unicode or UTF-8 : ditto.
shared file name : Q-encoded UTF-8 : Q-encode/decode & Encode::decode
Input from users
message headers : MIME-encoded : to Unicode (for templates)
HTTP input
(POST parameters) : UTF-8 : PerlIO layer by binmode()
(file upload) : byte : none
Output to users
message via list : byte : none (except custom_subject)
service message : mixture of any : mail::reformat_message()
HTTP output
(HTML) : Unicode : PerlIO layer by binmode()
(file download) : byte : none
.................................................................
(II) If internal text processings were carried out assuming UTF-8,
we may take care of distinction just between UTF-8 and byte
(* are changed items):
Sources : Encoding : Required conversion
............................................................................
Core of system
gettext() : locale charset :*to UTF-8
config file : filesystem_encoding :*to UTF-8
template
(source) : UTF-8 :*none
(parameters) :*UTF-8 :*none
(parsed result) :*UTF-8 :*none
shared file name : Q-encoded UTF-8 :*Q-encode/decode
Input from users
message headers : MIME-encoded :*to UTF-8 (for templates)
HTTP input
(POST parameters) : UTF-8 :*none
(file upload) : byte : none
Outputs to users
message via list : byte : none (except custom_subject)
service message : mixture of both : mail::reformat_message()
HTTP output
(HTML) : UTF-8 :*none
(file download) : byte : none
.................................................................
After all, changes needed to switch from Unicode to UTF-8 are likely:
- To read raw config files then to convert them to UTF-8.
- To change Language::gettext()'s output from Unicode to UTF-8.
- To remove Template::Directive::OUTPUT hack (see src/tt2.pl).
- To disable PerlIO layer on wwsympa.fcgi.
It is probably worth building a prototype to check that we don't have
unexpected behaviors after these changes.Do you have some time/energy to make a patch that switch from Unicode to UTF-8?
You are right. Concretely, so-called ``UTF-8 BOM'' is a sequence "\xEF\xBB\xBF" (UCS U+FEFF) prepended to text data (though the ``byte order'' is senseless on UTF-8). On current Template-Toolkit, "UNICODE" option may be used to discriminate Unicode-oriented templates from byte-oriented templates according to either they have ``BOM'' or not (for more details check source of Template::Provider). This feature on BOM is a sort of kludge, I believe. It can make some confusion (as I wrote in below).Thanks for providing these explanations. |
-
[sympa-dev] Unicode vs. UTF-8,
Hatuka*nezumi - IKEDA Soji, 11/12/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/13/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Hatuka*nezumi - IKEDA Soji, 11/14/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/14/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/15/2006
- [sympa-dev] Re: Unicode vs. UTF-8, Hatuka*nezumi - IKEDA Soji, 11/18/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/14/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Hatuka*nezumi - IKEDA Soji, 11/14/2006
-
[sympa-dev] Re: Unicode vs. UTF-8,
Olivier Salaün - CRU, 11/13/2006
Archive powered by MHonArc 2.6.19+.