devel - Re: [sympa-dev] Charset/encoding for e-mail message

Subject: Developers of Sympa

List archive

Re: [sympa-dev] Charset/encoding for e-mail message

From: Hatuka*nezumi - IKEDA Soji <address@concealed>
To: Olivier Salaün - CRU <address@concealed>
Cc: address@concealed
Subject: Re: [sympa-dev] Charset/encoding for e-mail message
Date: Thu, 19 Oct 2006 19:57:43 +0900

On Wed, 18 Oct 2006 12:09:36 +0200
Olivier Salaün - CRU <address@concealed> wrote:

> We have applied, tested and eventually committed your patch in the dev
> CVS branch.

Thanks for your quick feedback!

> The 'A' (Standing for Automatic I suppose) is definitely a good idea
> since the choice of using either Base64 or QuotedPrinteable highly
> depends on the language used. The 'A' option is not documented in your
> module CPAN pages though.

perldoc MIME::EncWords
| =item encode_mimewords RAW, [OPTS]
:
| =item Encoding
:
| You may also specify ``special'' values: C<"a"> will automatically choose
| recommended encoding to use (with charset conversion if alternative
| charset is recommended: see L<MIME::Charset>);
| C<"s"> will choose shorter one of either C<"q"> or C<"b">.

> > - src/mail.pm:reformat_message(): reformat outgoing service messages
> > (along with charset conversion for Japanese messages).
> >
> Does reformat_message() also alter message bodies ? There might be
> issues with S:MIME signed messages that should not be altered, or the
> signature is broken...

I didn't make sure whether it breaks S/MIME-encrypted data or not
(reformat_message() will be called in mail_file(), just before
sending()). If it is an issue, bodies in the multipart/signed or
multipart/encrypted parts won't be touched by this:

--- >8 --- >8 --- >8 --- >8 --- >8 --- >8 --- >8 --- >8 --- >8 ---
--- src/mail.pm 18 Oct 2006 10:07:46 -0000 1.37
+++ src/mail.pm 18 Oct 2006 12:16:05 -0000
@@ -822,4 +822,5 @@

my $eff_type = $part->effective_type;
+ return $part if $eff_type =~ m{^multipart/(signed|encrypted)$};

if ($part->parts) {
--- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< ---

> I had to replace calls to croak() with proper error handling ; that were
> not acceptable within a daemon.

It was my corner-cutting... Errors should reach at the users.

> > * To support charset conversion for Japanese messages, _charset_ of
> > po/ja.po must be changed from UTF-8 to EUC-JP (how about Rosetta stuff?).
> >
> You're right, the PO file should be trans coded to EUC-JP.
> Rosetta is one option to translate Sympa GUI, other options include
> using other software such as Kbabel. And actually we think about
> providing a translation service (vs software) on our sympa.org website.
> It would be based on a home-made software or on Pootle.
>
> Anyway I had a try trans coding ja.po from utf-8 to euc-jp using iconv,
> but without success :
>
> % iconv -f utf-8 -t eucJP -o /tmp/ja.po po/ja.po
>
> The problem I got were at the PO catalog compiling time :
<<snip>>
> Therefore I aborted the process. If you have a clue of what the problem
> might be...

% iconv -f utf-8 -t eucJP /tmp/ja.po | sed -e 's/; charset=UTF-8/;
charset=EUC-JP/i' > po/ja.po

will give desired one. I attached the result, with revised
translations by myself (this may be useful for tests discussed
below).

> > - src/Message.pm:new(), src/List.pm:distribute_msg():
> > custom_subject processing was improved. It will handle mixed-charset
> > situations better.
> >
> I noticed that : it's great.
> BTW it looks like the way you call MIME::EncWords::encode_mimewords() is
> not documented on CPAN, ie :
>
> MIME::EncWords::encode_mimewords([
> [$s1, $enc1],
> [$s2, $enc2]
> ], Encoding=>'A', Field=>'Subject');

| =item encode_mimewords RAW, [OPTS]
:
| RAW may be a Unicode string when Unicode/multibyte support is enabled
| (see L<MIME::Charset/USE_ENCODE>).
| Furthermore, RAW may be a reference to that returned
| by L<"decode_mimewords"> on array context. In latter case "Charset"
| option (see below) will be overridden.

I will add a note on modification at next release...

> > Problem ---
> >
> > In addtion to patch described above, I tried replacing
> > MIME::Words::decode_mimewords(STRING)
> > with
> > MIME::EncWords::decode_mimewords(STRING, Charset=>CHARSET).
> >
> > But I cannot clarify what CHARSET may be used to feed decoded data to TT2
> > templates. By any charset (including _UNICODE_), TT2 seems to break fed
> > data. See bug #1059.
> >
> Can you provide a bit more explanations regarding this problem ?

Strings used for interpolation on TT2 are interpreted as if they
are encoded by ISO-8859-1. Anyhow curious this is ---

- When a byte string ``é'' (latin small letter e with acute) encoded
by UTF-8, "\xC3\xA9", is fed to TT2, output contains ``Ã©'',
"\xC3\x83\xC2\xA9" (UTF-8 representation of ISO-8859-1
interpretation of "\xC3\xA9").
- When a Unicode string ``é'', "\x{00E9}", is fed, output contains
``Ã©'', "\x{00C3}\x{00A9}" (Perl internal representation of
ISO-8859-1 interpretation of the Unicode string with utf8 flag
forced to be off).

Environment ---

Perl 5.8.5
Template-Toolkit 2.15
MIME::Charset 0.04.1
MIME::EncWords 0.03
Settings in sympa.conf:
lang ja_JP
filesystem_encoding utf-8

Modifications ---

I replaced
MIME::Words::decode_mimewords(STRING)
by
MIME::EncWords::decode_mimewords(STRING, Charset=>'utf8')
on CVS HEAD.

The problems are ---

(a) Not reproduced on:
Web:
- List subject.
INFO Service message:
- List subject.

(b) Reproduced on:
INFO Service message:
- Description of list.
Web:
- Help pages installed into web_tt2/ja_JP/ (UTF-8 is used).

Afterwards, I made a quick hack on src/tt2.pl (as patch
attached). Then this seems not to be reproduced, probably.

(c) Following seem to be coumpound of another factors; they are
encoded by charset got by gettext("_charset_") then interpreted
as ISO-8859-1:

Web:
- Language names in language box.
- Dropdown box of "digest" parameter.
- Perhaps anywhere strftime()'ed date appear.
INFO Service message:
- Days of Digest.

For example ``日本語'' is shown as ``ÆüËÜ¸ì''
("\xC6\xFC\xCB\xDC\xB8\xEC' by EUC-JP and ISO-8859-1,
respectively) in language box. ``Español'' is truncated
to be ``Espa''.

> > Other known bugs ---
> >
> > - When address headers of service messages include non-ASCII characters,
> > headers will be encoded maliciously.
> >
> > It is advisable that structured headers (address fields, parenthesized
> > comments, parameters,...) will be handled
> > separately by some appropriate functions.
> >
> I'm afraid I don't understand what you mean ?

I mean that, for example, a header:

To: Modérateurs de la liste somelist <address@concealed>

will be encoded as:

To: =?ISO-8859-1?Q?Mod=E9rateurs_de_la_liste_somelist_<somelist-editor@so?=
=?ISO-8859-1?Q?me.dom.ain>?=

N.B.: This result _is_ MIME-compliant, if it was _not_ a structured
header field. Though original MIME::Words takes care of natural
word separators (i.e. spaces), such separators are not
necessarily obvious in non-word-spacing languages (CJK, Thai, ...).

On TT2 templates, this will be avoided by attached (second) patch,
but this may not be generalized solution. I believe that the
structured header fields in general need to be parsed/constructed
by another functions not just only processing B/Q encodings.

*

All modifications described in this message are compiled into the
last attachment. An actual (maybe tentative) installation is
running here: http://sympa.nezumi.nu/sympa

I will reply to reminder of your message later. Thanks again.

--- nezumi

Attachment: ja.po.gz
Description: GNU Zip compressed data

Attachment: sympa-MAIN-20061015-tt2_utf8.patch
Description: Binary data

Attachment: sympa-MAIN-20061018-tt2_qencode.patch
Description: Binary data

Attachment: sympa-MAIN-20061019-mail_encoding_suppl.patch
Description: Binary data

Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/15/2006
- Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 10/18/2006
  - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/19/2006
    - Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 10/19/2006
      - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/20/2006
        
        Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 10/27/2006
      - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/26/2006
    - Re: [sympa-dev] language names (was Charset/encoding for e-mail message), Olivier Salaün - CRU, 10/20/2006
      - Re: [sympa-translation] Re: [sympa-dev] language names (was Charset/encoding for e-mail message), Hatuka*nezumi - IKEDA Soji, 10/26/2006
        
        Re: [sympa-translation] Re: [sympa-dev] language names (was Charset/encoding for e-mail message), Olivier Salaün - CRU, 10/27/2006
  - Re: [sympa-dev] Charset/encoding for e-mail message, Sylvain Amrani, 10/20/2006
    - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/21/2006

List archive

Re: [sympa-dev] Charset/encoding for e-mail message