devel - Re: [sympa-dev] Charset/encoding for e-mail message

Subject: Developers of Sympa

List archive

Re: [sympa-dev] Charset/encoding for e-mail message

From: Olivier Salaün - CRU <address@concealed>
To: Hatuka*nezumi - IKEDA Soji <address@concealed>
Cc: address@concealed
Subject: Re: [sympa-dev] Charset/encoding for e-mail message
Date: Wed, 18 Oct 2006 12:09:36 +0200

Hi,

Hatuka*nezumi - IKEDA Soji wrote:

By suggestions, I released new Perl modules: MIME::Charset and MIME::EncWords (similarly to Peter's work, this is another alternative for MIME::Words, but focuses on supporting multibyte
charsets):
    http://search.cpan.org/perldoc/MIME%3A%3ACharset
    http://search.cpan.org/perldoc/MIME%3A%3AEncWords

Reviced patch (attached) is based on these modules.  Changes made are ---

We have applied, tested and eventually committed your patch in the dev CVS branch.
Below are a few comments, questions :

- encode_mimewords()/decode_mimewords() of MIME::Words are repleaced by alternative ones of MIME::EncWords (but
  modification is imcomplete, as described later).

  o encode_mimewords() automatically chooses appropriate encoding
    (B, Q or unencoded) by enhanced ``Encoding="A"'' option.

The 'A' (Standing for Automatic I suppose) is definitely a good idea since the choice of using either Base64 or QuotedPrinteable highly depends on the language used. The 'A' option is not documented in your module CPAN pages though.

- src/mail.pm:reformat_message(): reformat outgoing service messages (along with charset conversion for Japanese messages).

Does reformat_message() also alter message bodies ? There might be issues with S:MIME signed messages that should not be altered, or the signature is broken...

I had to replace calls to croak() with proper error handling ; that were not acceptable within a daemon.

  * To support charset conversion for Japanese messages, _charset_ of po/ja.po must be changed from UTF-8 to EUC-JP (how about Rosetta stuff?).

You're right, the PO file should be trans coded to EUC-JP.
Rosetta is one option to translate Sympa GUI, other options include using other software such as Kbabel. And actually we think about providing a translation service (vs software) on our sympa.org website. It would be based on a home-made software or on Pootle.

Anyway I had a try trans coding ja.po from utf-8 to euc-jp using iconv, but without success :

% iconv -f utf-8 -t eucJP -o /tmp/ja.po po/ja.po

The problem I got were at the PO catalog compiling time :

msgfmt -o ja.mo ja.po
ja.po:29:2: invalid multi byte sequence
ja.po:29:4: invalid multi byte sequence
ja.po:29:5: invalid multibyte sequence
ja.po:29:6: invalid multibyte sequence
ja.po:29:8: invalid multibyte sequence
ja.po:29:9: invalid multibyte sequence
ja.po:29:11: invalid multibyte sequence
ja.po:29:12: invalid multibyte sequence
ja.po:29:13: invalid multibyte sequence
ja.po:29:14: invalid multibyte sequence
ja.po:29:15: invalid multibyte sequence
ja.po:29:27: invalid multibyte sequence
ja.po:29:28: invalid multibyte sequence
ja.po:29:36: invalid multibyte sequence
ja.po:29:38: invalid multibyte sequence
ja.po:29:39: invalid multibyte sequence
ja.po:29:41: invalid multibyte sequence
ja.po:29:42: invalid multibyte sequence
ja.po:29:43: invalid multibyte sequence
ja.po:29:44: invalid multibyte sequence
msgfmt: too many errors, aborting

Therefore I aborted the process. If you have a clue of what the problem might be...

- src/Message.pm:new(), src/List.pm:distribute_msg():
  custom_subject processing was improved.  It will handle mixed-charset situations better.

I noticed that : it's great.
BTW it looks like the way you call MIME::EncWords::encode_mimewords() is not documented on CPAN, ie :

MIME::EncWords::encode_mimewords([
        [$s1, $enc1],
        [$s2, $enc2]
        ], Encoding=>'A', Field=>'Subject');

Problem ---

In addtion to patch described above, I tried replacing
    MIME::Words::decode_mimewords(STRING)
with
    MIME::EncWords::decode_mimewords(STRING, Charset=>CHARSET).

But I cannot clarify what CHARSET may be used to feed decoded data to TT2 templates.  By any charset (including _UNICODE_), TT2 seems to break fed data.  See bug #1059.

Can you provide a bit more explanations regarding this problem ?

Other known bugs ---

- When address headers of service messages include non-ASCII characters, headers will be encoded maliciously.

  It is advisable that structured headers (address fields, parenthesized comments, parameters,...) will be handled
  separately by some appropriate functions.

I'm afraid I don't understand what you mean ?

- Headers for some service messages (at least MIME-Version, Content-Type and Content-Transfer-Encoding) are duplicated.
  This doesn't seem to be caused by this patch.

I can't manage to reproduce this problem on our server.
Please fill out a bug report with enough information for us to reproduce.

Thank you for this great contribution.
I'm sure users of multibyte char sets will appreciate :-)

Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/15/2006
- Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 10/18/2006
  - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/19/2006
    - Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 10/19/2006
      - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/20/2006
        
        Re: [sympa-dev] Charset/encoding for e-mail message, Olivier Salaün - CRU, 10/27/2006
      - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/26/2006
    - Re: [sympa-dev] language names (was Charset/encoding for e-mail message), Olivier Salaün - CRU, 10/20/2006
      - Re: [sympa-translation] Re: [sympa-dev] language names (was Charset/encoding for e-mail message), Hatuka*nezumi - IKEDA Soji, 10/26/2006
        
        Re: [sympa-translation] Re: [sympa-dev] language names (was Charset/encoding for e-mail message), Olivier Salaün - CRU, 10/27/2006
  - Re: [sympa-dev] Charset/encoding for e-mail message, Sylvain Amrani, 10/20/2006
    - Re: [sympa-dev] Charset/encoding for e-mail message, Hatuka*nezumi - IKEDA Soji, 10/21/2006

List archive

Re: [sympa-dev] Charset/encoding for e-mail message