Skip to Content.
Sympa Menu

devel - Re: [sympa-dev] Charset/encoding for e-mail message

Subject: Developers of Sympa

List archive

Chronological Thread  
  • From: Olivier Salaün - CRU <address@concealed>
  • To: Hatuka*nezumi - IKEDA Soji <address@concealed>
  • Cc: address@concealed
  • Subject: Re: [sympa-dev] Charset/encoding for e-mail message
  • Date: Wed, 18 Oct 2006 12:09:36 +0200

Hi,

Hatuka*nezumi - IKEDA Soji wrote:
By suggestions, I released new Perl modules: MIME::Charset and MIME::EncWords (similarly to Peter's work, this is another alternative for MIME::Words, but focuses on supporting multibyte
charsets):
    http://search.cpan.org/perldoc/MIME%3A%3ACharset
    http://search.cpan.org/perldoc/MIME%3A%3AEncWords

Reviced patch (attached) is based on these modules.  Changes made are ---
  
We have applied, tested and eventually committed your patch in the dev CVS branch.
Below are a few comments, questions :
- encode_mimewords()/decode_mimewords() of MIME::Words are repleaced by alternative ones of MIME::EncWords (but
  modification is imcomplete, as described later).

  o encode_mimewords() automatically chooses appropriate encoding
    (B, Q or unencoded) by enhanced ``Encoding="A"'' option.
  
The 'A' (Standing for Automatic I suppose) is definitely a good idea since the choice of using either Base64 or QuotedPrinteable highly depends on the language used. The 'A' option is not documented in your module CPAN pages though.
- src/mail.pm:reformat_message(): reformat outgoing service messages (along with charset conversion for Japanese messages).
  
Does reformat_message() also alter message bodies ? There might be issues with S:MIME signed messages that should not be altered, or the signature is broken...

I had to replace calls to croak() with proper error handling ; that were not acceptable within a daemon.
  * To support charset conversion for Japanese messages, _charset_ of po/ja.po must be changed from UTF-8 to EUC-JP (how about Rosetta stuff?).
  
You're right, the PO file should be trans coded to EUC-JP.
Rosetta is one option to translate Sympa GUI, other options include using other software such as Kbabel. And actually we think about providing a translation service (vs software) on our sympa.org website. It would be based on a home-made software or on Pootle.

Anyway I had a try trans coding ja.po from utf-8 to euc-jp using iconv, but without success :
% iconv -f utf-8 -t eucJP -o /tmp/ja.po po/ja.po
The problem I got were at the PO catalog compiling time :
msgfmt -o ja.mo ja.po
ja.po:29:2: invalid multi byte sequence
ja.po:29:4: invalid multi byte sequence
ja.po:29:5: invalid multibyte sequence
ja.po:29:6: invalid multibyte sequence
ja.po:29:8: invalid multibyte sequence
ja.po:29:9: invalid multibyte sequence
ja.po:29:11: invalid multibyte sequence
ja.po:29:12: invalid multibyte sequence
ja.po:29:13: invalid multibyte sequence
ja.po:29:14: invalid multibyte sequence
ja.po:29:15: invalid multibyte sequence
ja.po:29:27: invalid multibyte sequence
ja.po:29:28: invalid multibyte sequence
ja.po:29:36: invalid multibyte sequence
ja.po:29:38: invalid multibyte sequence
ja.po:29:39: invalid multibyte sequence
ja.po:29:41: invalid multibyte sequence
ja.po:29:42: invalid multibyte sequence
ja.po:29:43: invalid multibyte sequence
ja.po:29:44: invalid multibyte sequence
msgfmt: too many errors, aborting
Therefore I aborted the process. If you have a clue of what the problem might be...
- src/Message.pm:new(), src/List.pm:distribute_msg():
  custom_subject processing was improved.  It will handle mixed-charset situations better.
  
I noticed that : it's great.
BTW it looks like the way you call MIME::EncWords::encode_mimewords() is not documented on CPAN, ie :
MIME::EncWords::encode_mimewords([
        [$s1, $enc1],
        [$s2, $enc2]
        ], Encoding=>'A', Field=>'Subject');
Problem ---

In addtion to patch described above, I tried replacing
    MIME::Words::decode_mimewords(STRING)
with
    MIME::EncWords::decode_mimewords(STRING, Charset=>CHARSET).

But I cannot clarify what CHARSET may be used to feed decoded data to TT2 templates.  By any charset (including _UNICODE_), TT2 seems to break fed data.  See bug #1059.
  
Can you provide a bit more explanations regarding this problem ?
Other known bugs ---

- When address headers of service messages include non-ASCII characters, headers will be encoded maliciously.

  It is advisable that structured headers (address fields, parenthesized comments, parameters,...) will be handled
  separately by some appropriate functions.
  
I'm afraid I don't understand what you mean ?
- Headers for some service messages (at least MIME-Version, Content-Type and Content-Transfer-Encoding) are duplicated.
  This doesn't seem to be caused by this patch.
  
I can't manage to reproduce this problem on our server.
Please fill out a bug report with enough information for us to reproduce.

Thank you for this great contribution.
I'm sure users of multibyte char sets will appreciate :-)



Archive powered by MHonArc 2.6.19+.

Top of Page