Skip to Content.
Sympa Menu

devel - Re: [sympa-developpers] Sympatic unicode ?

Subject: Developers of Sympa

List archive

Chronological Thread  
  • From: IKEDA Soji <address@concealed>
  • To: "Stefan Hornburg (Racke)" <address@concealed>
  • Cc: address@concealed
  • Subject: Re: [sympa-developpers] Sympatic unicode ?
  • Date: Fri, 9 Mar 2018 12:03:12 +0900

On Thu, 8 Mar 2018 16:31:38 +0100
"Stefan Hornburg (Racke)" <address@concealed> wrote:

> On 03/08/2018 04:24 PM, Soji Ikeda wrote:
> > racke,
> >
> > 2018/03/08 23:58、Stefan Hornburg (Racke) <address@concealed>のメール:
> >
> >>>> On 03/08/2018 12:52 PM, Marc Chantreux wrote:
> >>>> On Fri, Mar 02, 2018 at 05:55:22PM +0900, Soji Ikeda wrote:
> >>>> They should not be read / written through :utf8 layer, but :bytes
> >>>> layer.
> >>>> E.g. following operations should use :bytes layer:
> >>>> - Opening messages on disk.
> >>>> - Opening pipe to sendmail.
> >>
> >> We should rather use CPAN modules than opening a pipe to sendmail ...
> >
> > I don’t mind if it is performed by wrapping module. I described the case
> > that :utf8 layer should not be used (see also comment blow).
> >
> >>>
> >>> what's the point of using :bytes everywhere just because mails should be
> >>> serialized this way ?
> >>>
> >>> those special cases (even if happens frequently) should be wrapped into
> >>> functions that ensures the correctness.
> >>>
> >>> regards
> >>> marc
> >>
> >> Yes, I would agree with Marc.
> >>
> >> We are doing the following inside our Dancer apps:
> >>
> >>
> >> # the dumper shows \x{20ac}, so html and text are decoded.
> >> email {
> >> %args,
> >> body => encode( 'UTF-8', $text ),
> >> type => 'text',
> >> attach => {
> >> Charset => 'utf-8',
> >> Data => encode( 'UTF-8', $html ),
> >> Encoding => "quoted-printable",
> >> Type => "text/html"
> >> },
> >> multipart => 'alternative',
> >> };
> >>
> >> Here "email" is basically a wrapper around Email::Sender
> >> (https://metacpan.org/pod/Dancer2::Plugin::Email#DESCRIPTION).
> >
> > In that case email is crafted by program itself: Internal encoding may be
> > Unicode and resulting message may be freely encoded to UTF-8 (or other
> > char set).
> >
> > However we have to process incoming messages possibly encoded by legacy
> > chaset and transfer-encoding. Because we should keep the content
> > unchanged octet-by-octet (or we might break integrity of signature etc.),
> > it may not be decoded to Unicode. After all, we have to treat message as
> > byte string, not text data.
> In the normal case a raw email is just ASCII, isn't that correct? Which
> exceptions do you know?

It is not so normal, I think.


RFC 5322 states that message "is composed of characters with values
in the range of 1 through 127 and interpreted as US-ASCII
characters." This means that messages may be transmitted via
7-bit paths (transports allowing only range of 1 thougth 127).

To use characters beyond US-ASCII conforming to RFC 5322
(ex. RFC 822), MIME introduced two layers of encoding mechanisms:
charset and transfer-encoding (or header encoding for header fields).

For example, this message <https://git.io/vAx1h> (test data for PR#8)
includes a body text encoded by GB18030 charset, then encoded by
BASE64 transfer-encoding.

If we decode this message and then encode again, two problems are
possible:

- A charset may not be round-trip conversion in general. Thus
reencoded byte sequence may not be identical to the original.

(As I wrote on normalization form, there may also be problem
with UTF-8).

- Result of transfer-encoding has ambiguity: For example above,
each encoded line is folded by 76 bytes, but line length is vary
by implementations of BASE64 encoder (likewise QUOTED-PRINTABLE).

These are what actually happened with earlier versions of Sympa:
When Sympa decoded (whole or part of) message and encoded again,
signature was broken. Even if we adopted Unicode, we have to take
care of this.


Moreover, messages may include octets beyond range of US-ASCII:
requirement of 7-bit transport is becoming a thing of the past,
and 8BIT transfer-encoding (introduced by MIME; no transfer-encoding)
can be used.

There also was problem actually happened: With Sympa 6.1.x using
database for spool, stored messages were occasionally broken.
Because 8-bit charset (ISO-8859-1, ISO-8859-2, ...) and 8BIT
transfer-encoding were used.

I think, a lesson we can learn is: We have to treat message as BLOB,
not text.


Regards,
-- Soji


> > Does my description miss the point?
>
> I think your sentence above clarified your earlier descriptions.
>
> Regards
> Racke
>
> >
> > Regards,
> > — Soji
> >
> >> Regards
> >> Racke
> >>
> >> --
> >> Ecommerce and Linux consulting + Perl and web application programming.
> >> Debian and Sympa administration. Provisioning with Ansible.
> >
> >
>
>
> --
> Ecommerce and Linux consulting + Perl and web application programming.
> Debian and Sympa administration. Provisioning with Ansible.


--
株式会社 コンバージョン
ITソリューション部 システムソリューション1グループ 池田荘児
〒140-0014 東京都品川区大井1-49-15 アクセス大井町ビル4F
e-mail address@concealed TEL 03-6429-2880
https://www.conversion.co.jp/



Archive powered by MHonArc 2.6.19+.

Top of Page