devel - Re: [sympa-dev] Charset/encoding for e-mail message

Subject: Developers of Sympa

List archive

Re: [sympa-dev] Charset/encoding for e-mail message

From: Olivier Salaün - CRU <address@concealed>
To: Hatuka*nezumi - IKEDA Soji <address@concealed>
Cc: address@concealed
Subject: Re: [sympa-dev] Charset/encoding for e-mail message
Date: Thu, 19 Oct 2006 17:49:13 +0200

Hatuka*nezumi - IKEDA Soji wrote: [...]

- src/mail.pm:reformat_message(): reformat outgoing service messages (along with charset conversion for Japanese messages).

Does reformat_message() also alter message bodies ? There might be 
issues with S:MIME signed messages that should not be altered, or the 
signature is broken...

I didn't make sure whether it breaks S/MIME-encrypted data or not
(reformat_message() will be called in mail_file(), just before
sending()).  If it is an issue, bodies in the multipart/signed or
multipart/encrypted parts won't be touched by this:

We'll probably need to add similar code...thanks for the patch.

--- >8 --- >8 --- >8 --- >8 --- >8 --- >8 --- >8 --- >8 --- >8 ---
--- src/mail.pm 18 Oct 2006 10:07:46 -0000      1.37
+++ src/mail.pm 18 Oct 2006 12:16:05 -0000
@@ -822,4 +822,5 @@

     my $eff_type = $part->effective_type;
+    return $part if $eff_type =~ m{^multipart/(signed|encrypted)$};

     if ($part->parts) {
--- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< ---

[...]

  * To support charset conversion for Japanese messages, _charset_ of po/ja.po must be changed from UTF-8 to EUC-JP (how about Rosetta stuff?).

[...]

Anyway I had a try trans coding ja.po from utf-8 to euc-jp using iconv, 
but without success :

    % iconv -f utf-8 -t eucJP -o /tmp/ja.po po/ja.po

The problem I got were at the PO catalog compiling time :

<<snip>>

Therefore I aborted the process. If you have a clue of what the problem 
might be...

% iconv -f utf-8 -t eucJP /tmp/ja.po | sed -e 's/; charset=UTF-8/; charset=EUC-JP/i' > po/ja.po

will give desired one.  I attached the result, with revised
translations by myself (this may be useful for tests discussed
below).

I also did (manually) change the charset in the PO files.
Strangely your ja.po file compiles perfectly ; I'll commit it in CVS, thanks.

[...]

Problem ---

In addtion to patch described above, I tried replacing
    MIME::Words::decode_mimewords(STRING)
with
    MIME::EncWords::decode_mimewords(STRING, Charset=>CHARSET).

But I cannot clarify what CHARSET may be used to feed decoded data to TT2 templates.  By any charset (including _UNICODE_), TT2 seems to break fed data.  See bug #1059.

Can you provide a bit more explanations regarding this problem ?

Strings used for interpolation on TT2 are interpreted as if they
are encoded by ISO-8859-1.  Anyhow curious this is ---

- When a byte string ``é'' (latin small letter e with acute) encoded
  by UTF-8, "\xC3\xA9", is fed to TT2, output contains ``Ã©'',
  "\xC3\x83\xC2\xA9" (UTF-8 representation of ISO-8859-1
  interpretation of "\xC3\xA9").
- When a Unicode string ``é'', "\x{00E9}", is fed, output contains
  ``Ã©'', "\x{00C3}\x{00A9}" (Perl internal representation of
  ISO-8859-1 interpretation of the Unicode string with utf8 flag
  forced to be off).

I'm wondering if your problem might be related to something I fixed yesterday in the CVS HEAD :

The logging subroutine (do_log()) does recode its parameters from UTF-8 to the filesystem_encoding. (This is required because syslogd does not seem to cope well with UTF-8) I found out, while applying your patch, that do_log() was not only recoding the values of the parameters but also the variables themselves. I fixed this. Therefore can you have a try with the latest CVS HEAD before we go on investigations on this topic ?

[...]

The problems are ---

(a) Not reproduced on:
  Web:
    - List subject.
  INFO Service message:
    - List subject.

(b) Reproduced on:
  INFO Service message:
    - Description of list.

I was not able to reproduce this problem ; but maybe I need to try with non-ISO-8859-1 data...

  Web:
    - Help pages installed into web_tt2/ja_JP/ (UTF-8 is used).

  Afterwards, I made a quick hack on src/tt2.pl (as patch
  attached).  Then this seems not to be reproduced, probably.

If the problem persist, please provide us a step by step way to reproduce the problem.

(c) Following seem to be coumpound of another factors; they are
  encoded by charset got by gettext("_charset_") then interpreted
  as ISO-8859-1:

  Web:
    - Language names in language box.
    - Dropdown box of "digest" parameter.
    - Perhaps anywhere strftime()'ed date appear.
  INFO Service message:
    - Days of Digest.

  For example ``日本語'' is shown as ``ÆüËÜ¸ì''
  ("\xC6\xFC\xCB\xDC\xB8\xEC' by EUC-JP and ISO-8859-1,
  respectively) in language box.  ``Español'' is truncated
  to be ``Espa''.

I'll try to find out what is causing this...

Other known bugs ---

- When address headers of service messages include non-ASCII characters, headers will be encoded maliciously.

  It is advisable that structured headers (address fields, parenthesized comments, parameters,...) will be handled
  separately by some appropriate functions.

I'm afraid I don't understand what you mean ?

I mean that, for example, a header:

  To: Modérateurs de la liste somelist <address@concealed>

will be encoded as:

  To: =?ISO-8859-1?Q?Mod=E9rateurs_de_la_liste_somelist_<somelist-editor@so?=
   =?ISO-8859-1?Q?me.dom.ain>?=

N.B.: This result _is_ MIME-compliant, if it was _not_ a structured
  header field.  Though original MIME::Words takes care of natural
  word separators (i.e. spaces), such separators are not
  necessarily obvious in non-word-spacing languages (CJK, Thai, ...).

On TT2 templates, this will be avoided by attached (second) patch,
but this may not be generalized solution.

Another solution is to put the [% FILTER qencode %] at the right place in the TT2 files, example below :

To: [% FILTER qencode %][%|loc(list.name)%]Moderators of list %1[%END%][%END%] <[% list.name %]-editor@[% list.host %]>

I've fixed the mail_tt2 files according to this.
I don't know if we still need your patch...

I believe that the
structured header fields in general need to be parsed/constructed
by another functions not just only processing B/Q encodings.

What other solution do you propose ?

All modifications described in this message are compiled into the
last attachment.  An actual (maybe tentative) installation is
running here: http://sympa.nezumi.nu/sympa

I will reply to reminder of your message later.  Thanks again.

We'll have a close look at the patches your provided, thanks.

BTW : In your previous message, you reported a problem related to duplicated header fields. I found out that the problem was related to our mail::mail_file() subroutine incorrectly detecting folded header fields. Here is the patch : http://sourcesup.cru.fr/cgi/viewcvs.cgi/sympa/src/mail.pm?r1=1.37&r2=1.38&makepatch=1&diff_format=u

Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/15/2006
- Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 10/18/2006
  - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/19/2006
    - Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 10/19/2006
      - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/20/2006
        
        Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 10/27/2006
      - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/26/2006
    - Re: [sympa-dev] language names (was Charset/encoding for e-mail message), Olivier Salaün - CRU, 10/20/2006
      - Re: [sympa-translation] Re: [sympa-dev] language names (was Charset/encoding for e-mail message), Hatuka*nezumi - IKEDA Soji, 10/26/2006
        
        Re: [sympa-translation] Re: [sympa-dev] language names (was Charset/encoding for e-mail message), Olivier Salaün - CRU, 10/27/2006
  - Re: [sympa-dev] Charset/encoding for e-mail message, Sylvain Amrani, 10/20/2006
    - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/21/2006

List archive

Re: [sympa-dev] Charset/encoding for e-mail message