Skip to Content.
Sympa Menu

devel - Re: [sympa-dev] Charset/encoding for e-mail message

Subject: Developers of Sympa

List archive

Chronological Thread  
  • From: Olivier Salaün - CRU <address@concealed>
  • To: Hatuka*nezumi - IKEDA Soji <address@concealed>
  • Cc: address@concealed
  • Subject: Re: [sympa-dev] Charset/encoding for e-mail message
  • Date: Thu, 19 Oct 2006 17:49:13 +0200

Hatuka*nezumi - IKEDA Soji wrote: [...]
- src/mail.pm:reformat_message(): reformat outgoing service messages (along with charset conversion for Japanese messages).
      
Does reformat_message() also alter message bodies ? There might be 
issues with S:MIME signed messages that should not be altered, or the 
signature is broken...
    
I didn't make sure whether it breaks S/MIME-encrypted data or not
(reformat_message() will be called in mail_file(), just before
sending()).  If it is an issue, bodies in the multipart/signed or
multipart/encrypted parts won't be touched by this:
  
We'll probably need to add similar code...thanks for the patch.
--- >8 --- >8 --- >8 --- >8 --- >8 --- >8 --- >8 --- >8 --- >8 ---
--- src/mail.pm 18 Oct 2006 10:07:46 -0000      1.37
+++ src/mail.pm 18 Oct 2006 12:16:05 -0000
@@ -822,4 +822,5 @@

     my $eff_type = $part->effective_type;
+    return $part if $eff_type =~ m{^multipart/(signed|encrypted)$};

     if ($part->parts) {
--- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< ---
  
[...]
  * To support charset conversion for Japanese messages, _charset_ of po/ja.po must be changed from UTF-8 to EUC-JP (how about Rosetta stuff?).
      
[...]

Anyway I had a try trans coding ja.po from utf-8 to euc-jp using iconv, 
but without success :

    % iconv -f utf-8 -t eucJP -o /tmp/ja.po po/ja.po

The problem I got were at the PO catalog compiling time :
    
<<snip>>
  
Therefore I aborted the process. If you have a clue of what the problem 
might be...
    
% iconv -f utf-8 -t eucJP /tmp/ja.po | sed -e 's/; charset=UTF-8/; charset=EUC-JP/i' > po/ja.po

will give desired one.  I attached the result, with revised
translations by myself (this may be useful for tests discussed
below).
  
I also did (manually) change the charset in the PO files.
Strangely your ja.po file compiles perfectly ; I'll commit it in CVS, thanks.
[...]
  
Problem ---

In addtion to patch described above, I tried replacing
    MIME::Words::decode_mimewords(STRING)
with
    MIME::EncWords::decode_mimewords(STRING, Charset=>CHARSET).

But I cannot clarify what CHARSET may be used to feed decoded data to TT2 templates.  By any charset (including _UNICODE_), TT2 seems to break fed data.  See bug #1059.
      
Can you provide a bit more explanations regarding this problem ?
    
Strings used for interpolation on TT2 are interpreted as if they
are encoded by ISO-8859-1.  Anyhow curious this is ---

- When a byte string ``é'' (latin small letter e with acute) encoded
  by UTF-8, "\xC3\xA9", is fed to TT2, output contains ``é'',
  "\xC3\x83\xC2\xA9" (UTF-8 representation of ISO-8859-1
  interpretation of "\xC3\xA9").
- When a Unicode string ``é'', "\x{00E9}", is fed, output contains
  ``é'', "\x{00C3}\x{00A9}" (Perl internal representation of
  ISO-8859-1 interpretation of the Unicode string with utf8 flag
  forced to be off).
  
I'm wondering if your problem might be related to something I fixed yesterday in the CVS HEAD :
The logging subroutine (do_log()) does recode its parameters from UTF-8 to the filesystem_encoding. (This is required because syslogd does not seem to cope well with UTF-8) I found out, while applying your patch, that do_log() was not only recoding the values of the parameters but also the variables themselves. I fixed this. Therefore can you have a try with the latest CVS HEAD before we go on investigations on this topic ?
[...]

The problems are ---

(a) Not reproduced on:
  Web:
    - List subject.
  INFO Service message:
    - List subject.

(b) Reproduced on:
  INFO Service message:
    - Description of list.
  
I was not able to reproduce this problem ; but maybe I need to try with non-ISO-8859-1 data...
  Web:
    - Help pages installed into web_tt2/ja_JP/ (UTF-8 is used).

  Afterwards, I made a quick hack on src/tt2.pl (as patch
  attached).  Then this seems not to be reproduced, probably.
  
If the problem persist, please provide us a step by step way to reproduce the problem.
(c) Following seem to be coumpound of another factors; they are
  encoded by charset got by gettext("_charset_") then interpreted
  as ISO-8859-1:

  Web:
    - Language names in language box.
    - Dropdown box of "digest" parameter.
    - Perhaps anywhere strftime()'ed date appear.
  INFO Service message:
    - Days of Digest.

  For example ``日本語'' is shown as ``ÆüËܸì''
  ("\xC6\xFC\xCB\xDC\xB8\xEC' by EUC-JP and ISO-8859-1,
  respectively) in language box.  ``Español'' is truncated
  to be ``Espa''.
  
I'll try to find out what is causing this...
Other known bugs ---

- When address headers of service messages include non-ASCII characters, headers will be encoded maliciously.

  It is advisable that structured headers (address fields, parenthesized comments, parameters,...) will be handled
  separately by some appropriate functions.
  
      
I'm afraid I don't understand what you mean ?
    
I mean that, for example, a header:

  To: Modérateurs de la liste somelist <address@concealed>

will be encoded as:

  To: =?ISO-8859-1?Q?Mod=E9rateurs_de_la_liste_somelist_<somelist-editor@so?=
   =?ISO-8859-1?Q?me.dom.ain>?=

N.B.: This result _is_ MIME-compliant, if it was _not_ a structured
  header field.  Though original MIME::Words takes care of natural
  word separators (i.e. spaces), such separators are not
  necessarily obvious in non-word-spacing languages (CJK, Thai, ...).

On TT2 templates, this will be avoided by attached (second) patch,
but this may not be generalized solution.  
Another solution is to put the [% FILTER qencode %] at the right place in the TT2 files, example below :
To: [% FILTER qencode %][%|loc(list.name)%]Moderators of list %1[%END%][%END%] <[% list.name %]-editor@[% list.host %]>
I've fixed the mail_tt2 files according to this.
I don't know if we still need your patch...
I believe that the
structured header fields in general need to be parsed/constructed
by another functions not just only processing B/Q encodings.
  
What other solution do you propose ?
All modifications described in this message are compiled into the
last attachment.  An actual (maybe tentative) installation is
running here: http://sympa.nezumi.nu/sympa

I will reply to reminder of your message later.  Thanks again.
  
We'll have a close look at the patches your provided, thanks.

BTW : In your previous message, you reported a problem related to duplicated header fields. I found out that the problem was related to our mail::mail_file() subroutine incorrectly detecting folded header fields. Here is the patch : http://sourcesup.cru.fr/cgi/viewcvs.cgi/sympa/src/mail.pm?r1=1.37&r2=1.38&makepatch=1&diff_format=u




Archive powered by MHonArc 2.6.19+.

Top of Page