devel - Re: [sympa-dev] Charset/encoding for e-mail message

Subject: Developers of Sympa

List archive

Re: [sympa-dev] Charset/encoding for e-mail message

From: Hatuka*nezumi - IKEDA Soji <address@concealed>
To: Olivier Salaün - CRU <address@concealed>
Cc: address@concealed, address@concealed
Subject: Re: [sympa-dev] Charset/encoding for e-mail message
Date: Thu, 28 Sep 2006 19:27:37 +0900

On Wed, 27 Sep 2006 16:55:16 +0200
Olivier Salaün - CRU <address@concealed> wrote:

> Note that this encoding is only used for "service messages" sent by
> Sympa (welcome message, error report,...)

That's right. My patch affect to "service messages" generated
by Sympa. Headers and bodies of messages from users are intact
(except custom_subject).

ISO-2022-JP may not be used for internal processing: it may simply
cause another difficulties.

> Are there any problems to read UTF-8 encoded emails with standard mail
> clients in Japan ?

This is frequently answered question --- yes, there are.

Particularily on CJK locales, e-mail clients often expected to
support at least only "legacy" character sets (e.g. GB2312, BIG5,
ISO-2022-JP, EUC-KR). The services provided for long years (like
e-mail) tend to require "legacy" charsets; newly-coming services
(like WWW) use various charsets including UTF-8.

One of why such situation occur is, I guess: character sets for CJK
languages are large (they include at least thousands or someones
tens of thousands of characters) sets and there aren't completely
obvious mapping between those character sets and Unicode (mappings
aren't algorithmic as ISO-8859-1 is) agreed over all vendors.

So one particular "legacy" charset is often obliged while MUAs do
not necessarily support UTF-8.

Perhaps, this situation may eventually be changed --- if we could
wait for several years or more.

> > o On the other hand, various charsets are used for Web interface.
> >
> > --- For ja_JP: EUC-JP, SHIFT_JIS, UTF-8 and also ISO-2022-JP
> > are used (from coding view, since ISO-2022-JP prevents
> > HTML-entity escape, 8-bit schema are preferred).
> >
> We've fixed this problem in the version to come (current development
> version) : all web pages are now recoded to UTF-8, so are web archives.
> We don't have problem anymore with mixtures of encodings in web pages.
> Along with this new version we've added a new sympa.conf parameter for
> the listmaster to declare what encoding is used on the filesystem.

I decided to work on dev version by now.

> > o Also for other multibyte / non-Latin charsets, BASE64 (B)
> > encoding scheme is preferred or often de facto.
> >
> I assume you refer to "service messages" that Sympa sends because Sympa
> does not alter messages sent to mailing lists (expect when
> custom_subject is used).
> Should Sympa Base64-encode both message body and header fields ?
> What kind of problems happen if using Quoted-Printable ?

MIME says that header fields including non-ASCII characters should
be encoded (either by BASE64 or QUOTED-PRINTABLE).

On message bodies, encoding is either recommended or not by
charsets; for example ISO-2022-JP recommends that bodies won't be
encoded (7BIT). Bodies with 8-bit charsets might need to be
encoded for compatibility.

Answer to second question:

Encoded string mainly including non-Latin characters is shorter
using BASE64 than QUOTED-PRINTABLE (anyway, encoded strings won't
be human-readable). So implementations for non-Latin languages
often or usually use BASE64 (if encoding needed).

Since this is fairly common manner, QUOTED-PRINTABLE in Japanese
messages are not readable by some (artless) MUAs, or sometimes are
even considered to indicate the spam by mail scanners.

> Since you've developed an alternative to MIME::Words, we'd much prefer
> that you make it a separate CPAN module that Sympa would use instead of
> the MIME::Words module. Actually we gave the same answer to Peter Szabo
> who sent us a similar proposition (check
> https://www.szszi.hu/wiki/Sympa4Patches). He has decided to build a new
> CPAN module called MIME::AltWords that would fix all the unicode
> problems of MIME::Words. I also suggested him the option of extending
> Encode::MIME::Header (see
> http://search.cpan.org/~dankogai/Encode-2.18/lib/Encode/MIME/Header.pm).
>
> Obviously you did similar works Peter and yourself.
> We'd prefer having the best of both your codes ;-)
> Why not work together on this new CPAN module ?
> Peter is Cced.

Gosh, I have been working to re-invent the wheel?

> Could you send us a lighter version of your patch without the new
> versions of xx_mimewords() subroutine (cf above paragraph). Thanks and
> sorry for making you more work.

Delightfully!

--- nezumi

[sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 09/24/2006
- Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 09/27/2006
  - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 09/28/2006
  - Re: Pronounciation of "Sympa" was Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 09/28/2006

List archive

Re: [sympa-dev] Charset/encoding for e-mail message