devel - Re: [sympa-dev] Charset/encoding for e-mail message

Subject: Developers of Sympa

List archive

Re: [sympa-dev] Charset/encoding for e-mail message

From: Olivier Salaün - CRU <address@concealed>
To: Hatuka*nezumi - IKEDA Soji <address@concealed>
Cc: address@concealed, address@concealed
Subject: Re: [sympa-dev] Charset/encoding for e-mail message
Date: Wed, 27 Sep 2006 16:55:16 +0200

Hi,

Hatuka*nezumi - IKEDA Soji wrote:

Currently, there are following problems on e-mail encoding by Sympa ---

o On some locales, specific charset & header encoding & body
transfer-encoding are de facto standard for e-mail messages.

--- For example on ja_JP locale, ISO-2022-JP & BASE64 & 7BIT are
commonly used. UTF-8 / QUOTED-PRINTABLE are very less common.

Do you mean that we should move the default encoding from UTF-8 to ISO-2022-JP ?
Note that this encoding is only used for "service messages" sent by Sympa (welcome message, error report,...)
Are there any problems to read UTF-8 encoded emails with standard mail clients in Japan ?

o On the other hand, various charsets are used for Web interface.

--- For ja_JP: EUC-JP, SHIFT_JIS, UTF-8 and also ISO-2022-JP
are used (from coding view, since ISO-2022-JP prevents
HTML-entity escape, 8-bit schema are preferred).

We've fixed this problem in the version to come (current development version) : all web pages are now recoded to UTF-8, so are web archives. We don't have problem anymore with mixtures of encodings in web pages.
Along with this new version we've added a new sympa.conf parameter for the listmaster to declare what encoding is used on the filesystem.

o Also for other multibyte / non-Latin charsets, BASE64 (B)
encoding scheme is preferred or often de facto.

I assume you refer to "service messages" that Sympa sends because Sympa does not alter messages sent to mailing lists (expect when custom_subject is used).
Should Sympa Base64-encode both message body and header fields ?
What kind of problems happen if using Quoted-Printable ?

o MIME::Words::encode_mimewords() breaks multibyte character
boundaries in encoded headers. cf.:
http://rt.cpan.org/Public/Bug/Display.html?id=13027

By attached patch I tried to solve these problems. Though this
patch can be applied to current branch, if my attempt agree to
policy of Sympa development, I'll continue working on dev branch.

Since you've developed an alternative to MIME::Words, we'd much prefer that you make it a separate CPAN module that Sympa would use instead of the MIME::Words module. Actually we gave the same answer to Peter Szabo who sent us a similar proposition (check https://www.szszi.hu/wiki/Sympa4Patches). He has decided to build a new CPAN module called MIME::AltWords that would fix all the unicode problems of MIME::Words. I also suggested him the option of extending Encode::MIME::Header (see http://search.cpan.org/~dankogai/Encode-2.18/lib/Encode/MIME/Header.pm).

Obviously you did similar works Peter and yourself.
We'd prefer having the best of both your codes ;-)
Why not work together on this new CPAN module ?
Peter is Cced.

Notes on attached patch ---

- On locales where e-mail messages require charset conversion,
gettext(_charset_) should return a locale-targetted charset.
For example for ja_JP above, this might be wanted to be EUC-JP
(note that filesystem encoding may differ from this charset).

- Preferred encoding scheme for UTF-8 on header field is vary by
language contexts. Shorter one will be selected.

- Minimalism: texts not containing non-ASCII should be specified
as US-ASCII / 7BIT.

* This patch is imcomplete. Message bodies aren't converted using
charset/encoding for e-mail: how may I handle message bodiess
generated from tt2?

Could you send us a lighter version of your patch without the new versions of xx_mimewords() subroutine (cf above paragraph). Thanks and sorry for making you more work.

BTW this must be a FAQ: How should "Sympa" be pronounced, whether
"sympa(-thetic)" in English, "sympa(-thique)" en français, another
or ...everything?

We pronounce it the French way but most non French people pronounce it the English way.
What would be the Japanese way ? Please send us an MP3...
We might start a collection of MP3 on sympa.org, pronounced in each language.

[sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 09/24/2006
- Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 09/27/2006
  - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 09/28/2006
  - Re: Pronounciation of "Sympa" was Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 09/28/2006

List archive

Re: [sympa-dev] Charset/encoding for e-mail message